Abstract
Recognizing and extracting meaningful information from Web unstructured documents, taking into account their semantics, is an important problem of information and knowledge management. This paper describes H\(\imath\)LεX, a system implementing a novel logic-based approach to information extraction from unstructured documents. The approach adopted in the H\(\imath\)LεX system is founded on a new two-dimensional representation of documents, and heavily exploits DLP + - an extension of disjunctive logic programming for ontology representation and reasoning, which has been recently implemented on top of the DLV reasoning environment. Unlike previous systems, which are mainly syntactic, H\(\imath\)LεX combines both semantic and syntactic knowledge for a powerful information extraction. Ontologies, representing the semantics of information to be extracted, are encoded in DLP + , while the extraction patterns are expressed using regular expressions and an ad hoc two-dimensional grammar. The execution of DLP + reasoning modules, encoding the grammar expressions, yields the actual extraction of information from the input document. H\(\imath\)LεX allows the semantic information extraction from both HTML pages and flat text documents by using synthetic and very expressive extraction patterns.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baumgartner, R., Flesca, S., Gottlob, G.: Declarative information extraction, web crawling, and recursive wrapping with lixto. In: Eiter, T., Faber, W., Truszczyński, M. (eds.) LPNMR 2001. LNCS (LNAI), vol. 2173, pp. 21–41. Springer, Heidelberg (2001)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. The VLDB Journal, 119–128 (2001)
Chang, S.-K.: The analysis of two-dimensional patterns using picture processing grammars. In: STOC 1970. Proceedings of the second annual ACM symposium on Theory of computing, pp. 206–216. ACM Press, New York (1970)
Dell’Armi, T., Leone, N., Ricca, F.: Il linguaggio dlp+. Internal report, Exeura s.r.l (June 2004)
Eikvil, L.: Information extraction from world wide web - a survey. Technical Report 945, Norweigan Computing Center (1999)
Eiter, T., Faber, W., Leone, N., Pfeifer, G.: Declarative Problem-Solving Using the DLV System. In: Minker, J. (ed.) Logic-Based Artificial Intelligence, pp. 79–103. Kluwer Academic Publishers, Dordrecht (2000)
Eiter, T., Leone, N., Mateis, C., Pfeifer, G., Scarcello, F.: A deductive system for non-monotonic reasoning. In: Logic Programming and Non-monotonic Reasoning, pp. 364–375 (1997)
Faber, W., Pfeifer, G.: Dlv homepage (1996)
Feldman, R., Aumann, Y., Finkelstein-Landau, M., Hurvitz, E., Regev, Y., Yaroshevich, A.: A comparative study of information extraction strategies. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 349–359. Springer, Heidelberg (2002)
Gelfond, M., Lifschitz, V.: Classical negation in logic programs and disjunctive databases. New Generation Computing 9(3/4), 365–386 (1991)
Giammarresi, D., Restivo, A.: Two-dimensional languages. In: Salomaa, A., Rozenberg, G. (eds.) Handbook of Formal Languages, Beyond Words, vol. 3, pp. 215–267. Springer, Berlin (1997)
Kuhlins, S., Tredwell, R.: Toolkits for generating wrappers – a survey of software toolkits for automated data extraction from web sites. In: Aksit, M., Mezini, M., Unland, R. (eds.) NODe 2002. LNCS, vol. 2591, pp. 184–198. Springer, Heidelberg (2003)
Laender, A., Ribeiro-Neto, B., Silva, A., Teixeira, J.: A brief survey of web data extraction tools. In: SIGMOD Record, vol. 31 (June 2002)
Leone, N., Pfeifer, G., Faber, W., Eiter, T., Gottlob, G., Perri, S., Scarcello, F.: The DLV System for Knowledge Representation and Reasoning (2004)
Rosenfeld, B., Feldman, R., Fresko, M., Schler, J., Aumann, Y.: Teg: a hybrid approach to information extraction. In: Grossman, D., Gravano, L., Zhai, C., Herzog, O., Evans, D.A. (eds.) CIKM, pp. 589–596. ACM, New York (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ruffolo, M., Manna, M. (2008). H\(\imath\)LεX: A System for Semantic Information Extraction from Web Documents. In: Manolopoulos, Y., Filipe, J., Constantopoulos, P., Cordeiro, J. (eds) Enterprise Information Systems. ICEIS 2006. Lecture Notes in Business Information Processing, vol 3. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77581-2_13
Download citation
DOI: https://doi.org/10.1007/978-3-540-77581-2_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77580-5
Online ISBN: 978-3-540-77581-2
eBook Packages: Computer ScienceComputer Science (R0)