H $\imath$ LεX: A System for Semantic Information Extraction from Web Documents

Ruffolo, Massimo; Manna, Marco

doi:10.1007/978-3-540-77581-2_13

Massimo Ruffolo^1,2 &
Marco Manna³

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 3))

Included in the following conference series:

International Conference on Enterprise Information Systems

537 Accesses

Abstract

Recognizing and extracting meaningful information from Web unstructured documents, taking into account their semantics, is an important problem of information and knowledge management. This paper describes H$\imath$LεX, a system implementing a novel logic-based approach to information extraction from unstructured documents. The approach adopted in the H$\imath$LεX system is founded on a new two-dimensional representation of documents, and heavily exploits DLP ⁺ - an extension of disjunctive logic programming for ontology representation and reasoning, which has been recently implemented on top of the DLV reasoning environment. Unlike previous systems, which are mainly syntactic, H$\imath$LεX combines both semantic and syntactic knowledge for a powerful information extraction. Ontologies, representing the semantics of information to be extracted, are encoded in DLP ⁺, while the extraction patterns are expressed using regular expressions and an ad hoc two-dimensional grammar. The execution of DLP ⁺ reasoning modules, encoding the grammar expressions, yields the actual extraction of information from the input document. H$\imath$LεX allows the semantic information extraction from both HTML pages and flat text documents by using synthetic and very expressive extraction patterns.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Document Layout Analysis for Semantic Information Extraction

Information Extraction Approaches: A Survey

Kizomba: An Unsupervised Heuristic-Based Web Information Extractor

References

Baumgartner, R., Flesca, S., Gottlob, G.: Declarative information extraction, web crawling, and recursive wrapping with lixto. In: Eiter, T., Faber, W., Truszczyński, M. (eds.) LPNMR 2001. LNCS (LNAI), vol. 2173, pp. 21–41. Springer, Heidelberg (2001)
Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. The VLDB Journal, 119–128 (2001)
Google Scholar
Chang, S.-K.: The analysis of two-dimensional patterns using picture processing grammars. In: STOC 1970. Proceedings of the second annual ACM symposium on Theory of computing, pp. 206–216. ACM Press, New York (1970)
Chapter Google Scholar
Dell’Armi, T., Leone, N., Ricca, F.: Il linguaggio dlp+. Internal report, Exeura s.r.l (June 2004)
Google Scholar
Eikvil, L.: Information extraction from world wide web - a survey. Technical Report 945, Norweigan Computing Center (1999)
Google Scholar
Eiter, T., Faber, W., Leone, N., Pfeifer, G.: Declarative Problem-Solving Using the DLV System. In: Minker, J. (ed.) Logic-Based Artificial Intelligence, pp. 79–103. Kluwer Academic Publishers, Dordrecht (2000)
Google Scholar
Eiter, T., Leone, N., Mateis, C., Pfeifer, G., Scarcello, F.: A deductive system for non-monotonic reasoning. In: Logic Programming and Non-monotonic Reasoning, pp. 364–375 (1997)
Google Scholar
Faber, W., Pfeifer, G.: Dlv homepage (1996)
Google Scholar
Feldman, R., Aumann, Y., Finkelstein-Landau, M., Hurvitz, E., Regev, Y., Yaroshevich, A.: A comparative study of information extraction strategies. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 349–359. Springer, Heidelberg (2002)
Chapter Google Scholar
Gelfond, M., Lifschitz, V.: Classical negation in logic programs and disjunctive databases. New Generation Computing 9(3/4), 365–386 (1991)
Article Google Scholar
Giammarresi, D., Restivo, A.: Two-dimensional languages. In: Salomaa, A., Rozenberg, G. (eds.) Handbook of Formal Languages, Beyond Words, vol. 3, pp. 215–267. Springer, Berlin (1997)
Google Scholar
Kuhlins, S., Tredwell, R.: Toolkits for generating wrappers – a survey of software toolkits for automated data extraction from web sites. In: Aksit, M., Mezini, M., Unland, R. (eds.) NODe 2002. LNCS, vol. 2591, pp. 184–198. Springer, Heidelberg (2003)
Chapter Google Scholar
Laender, A., Ribeiro-Neto, B., Silva, A., Teixeira, J.: A brief survey of web data extraction tools. In: SIGMOD Record, vol. 31 (June 2002)
Google Scholar
Leone, N., Pfeifer, G., Faber, W., Eiter, T., Gottlob, G., Perri, S., Scarcello, F.: The DLV System for Knowledge Representation and Reasoning (2004)
Google Scholar
Rosenfeld, B., Feldman, R., Fresko, M., Schler, J., Aumann, Y.: Teg: a hybrid approach to information extraction. In: Grossman, D., Gravano, L., Zhai, C., Herzog, O., Evans, D.A. (eds.) CIKM, pp. 589–596. ACM, New York (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Exeura s.r.l.,
Massimo Ruffolo
ICAR-CNR - Institute of High Performance Computing and Networking of the Italian National Research Council,
Massimo Ruffolo
Department of Mathematics, University of Calabria, 87036, Arcavacata di Rende (CS), Italy
Marco Manna

Authors

Massimo Ruffolo
View author publications
You can also search for this author in PubMed Google Scholar
Marco Manna
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Yannis Manolopoulos Joaquim Filipe Panos Constantopoulos José Cordeiro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ruffolo, M., Manna, M. (2008). H$\imath$LεX: A System for Semantic Information Extraction from Web Documents. In: Manolopoulos, Y., Filipe, J., Constantopoulos, P., Cordeiro, J. (eds) Enterprise Information Systems. ICEIS 2006. Lecture Notes in Business Information Processing, vol 3. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77581-2_13

Download citation

DOI: https://doi.org/10.1007/978-3-540-77581-2_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77580-5
Online ISBN: 978-3-540-77581-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

H\(\imath\)LεX: A System for Semantic Information Extraction from Web Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Document Layout Analysis for Semantic Information Extraction

Information Extraction Approaches: A Survey

Kizomba: An Unsupervised Heuristic-Based Web Information Extractor

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

H\(\imath\)LεX: A System for Semantic Information Extraction from Web Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Document Layout Analysis for Semantic Information Extraction

Information Extraction Approaches: A Survey

Kizomba: An Unsupervised Heuristic-Based Web Information Extractor

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us