Skip to main content

H\(\imath\)LεX: A System for Semantic Information Extraction from Web Documents

  • Conference paper
Enterprise Information Systems (ICEIS 2006)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 3))

Included in the following conference series:

  • 537 Accesses

Abstract

Recognizing and extracting meaningful information from Web unstructured documents, taking into account their semantics, is an important problem of information and knowledge management. This paper describes H\(\imath\)LεX, a system implementing a novel logic-based approach to information extraction from unstructured documents. The approach adopted in the H\(\imath\)LεX system is founded on a new two-dimensional representation of documents, and heavily exploits DLP  +  - an extension of disjunctive logic programming for ontology representation and reasoning, which has been recently implemented on top of the DLV reasoning environment. Unlike previous systems, which are mainly syntactic, H\(\imath\)LεX combines both semantic and syntactic knowledge for a powerful information extraction. Ontologies, representing the semantics of information to be extracted, are encoded in DLP  + , while the extraction patterns are expressed using regular expressions and an ad hoc two-dimensional grammar. The execution of DLP  +  reasoning modules, encoding the grammar expressions, yields the actual extraction of information from the input document. H\(\imath\)LεX allows the semantic information extraction from both HTML pages and flat text documents by using synthetic and very expressive extraction patterns.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Baumgartner, R., Flesca, S., Gottlob, G.: Declarative information extraction, web crawling, and recursive wrapping with lixto. In: Eiter, T., Faber, W., Truszczyński, M. (eds.) LPNMR 2001. LNCS (LNAI), vol. 2173, pp. 21–41. Springer, Heidelberg (2001)

    Google Scholar 

  2. Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. The VLDB Journal, 119–128 (2001)

    Google Scholar 

  3. Chang, S.-K.: The analysis of two-dimensional patterns using picture processing grammars. In: STOC 1970. Proceedings of the second annual ACM symposium on Theory of computing, pp. 206–216. ACM Press, New York (1970)

    Chapter  Google Scholar 

  4. Dell’Armi, T., Leone, N., Ricca, F.: Il linguaggio dlp+. Internal report, Exeura s.r.l (June 2004)

    Google Scholar 

  5. Eikvil, L.: Information extraction from world wide web - a survey. Technical Report 945, Norweigan Computing Center (1999)

    Google Scholar 

  6. Eiter, T., Faber, W., Leone, N., Pfeifer, G.: Declarative Problem-Solving Using the DLV System. In: Minker, J. (ed.) Logic-Based Artificial Intelligence, pp. 79–103. Kluwer Academic Publishers, Dordrecht (2000)

    Google Scholar 

  7. Eiter, T., Leone, N., Mateis, C., Pfeifer, G., Scarcello, F.: A deductive system for non-monotonic reasoning. In: Logic Programming and Non-monotonic Reasoning, pp. 364–375 (1997)

    Google Scholar 

  8. Faber, W., Pfeifer, G.: Dlv homepage (1996)

    Google Scholar 

  9. Feldman, R., Aumann, Y., Finkelstein-Landau, M., Hurvitz, E., Regev, Y., Yaroshevich, A.: A comparative study of information extraction strategies. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 349–359. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  10. Gelfond, M., Lifschitz, V.: Classical negation in logic programs and disjunctive databases. New Generation Computing 9(3/4), 365–386 (1991)

    Article  Google Scholar 

  11. Giammarresi, D., Restivo, A.: Two-dimensional languages. In: Salomaa, A., Rozenberg, G. (eds.) Handbook of Formal Languages, Beyond Words, vol. 3, pp. 215–267. Springer, Berlin (1997)

    Google Scholar 

  12. Kuhlins, S., Tredwell, R.: Toolkits for generating wrappers – a survey of software toolkits for automated data extraction from web sites. In: Aksit, M., Mezini, M., Unland, R. (eds.) NODe 2002. LNCS, vol. 2591, pp. 184–198. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  13. Laender, A., Ribeiro-Neto, B., Silva, A., Teixeira, J.: A brief survey of web data extraction tools. In: SIGMOD Record, vol. 31 (June 2002)

    Google Scholar 

  14. Leone, N., Pfeifer, G., Faber, W., Eiter, T., Gottlob, G., Perri, S., Scarcello, F.: The DLV System for Knowledge Representation and Reasoning (2004)

    Google Scholar 

  15. Rosenfeld, B., Feldman, R., Fresko, M., Schler, J., Aumann, Y.: Teg: a hybrid approach to information extraction. In: Grossman, D., Gravano, L., Zhai, C., Herzog, O., Evans, D.A. (eds.) CIKM, pp. 589–596. ACM, New York (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Yannis Manolopoulos Joaquim Filipe Panos Constantopoulos José Cordeiro

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ruffolo, M., Manna, M. (2008). H\(\imath\)LεX: A System for Semantic Information Extraction from Web Documents. In: Manolopoulos, Y., Filipe, J., Constantopoulos, P., Cordeiro, J. (eds) Enterprise Information Systems. ICEIS 2006. Lecture Notes in Business Information Processing, vol 3. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77581-2_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-77581-2_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-77580-5

  • Online ISBN: 978-3-540-77581-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics