Skip to main content

Tuples Extraction from HTML Using Logic Wrappers and Inductive Logic Programming

  • Conference paper
Advances in Web Intelligence (AWIC 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3528))

Included in the following conference series:

Abstract

This paper presents an approach for applying inductive logic programming to information extraction from HTML documents structured as unranked ordered trees. We consider information extraction from Web resources that are abstracted as providing sets of tuples. Our approach is based on defining a new class of wrappers as a special class of logic programs – logic wrappers. The approach is demonstrated with examples and experimental results in the area of collecting product information, highlighting the advantages and the limitations of the method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Bădică, C., Bădică, A.: Rule Learning for Feature Values Extraction from HTML Product Information Sheets. In: Antoniou, G., Boley, H. (eds.) RuleML 2004. LNCS, vol. 3323, pp. 37–48. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  2. Bădică, C., Popescu, E., Bădică, A.: Learning Logic Wrappers for Information Extraction from the Web. In: Papazoglou, M., Yamazaki, K. (eds.) Proc. SAINT 2005 Workshops. Computer Intelligence for Exabyte Scale Data Explosion, Trento, Italy, pp. 336–339. IEEE Computer Society Press, Los Alamitos (2005)

    Google Scholar 

  3. Chidlovskii, B.: Information Extraction from Tree Documents by Learning Subtree Delimiters. In: Proc. IIWeb 2003, Acapulco, Mexico, pp. 3–8 (2003)

    Google Scholar 

  4. Freitag, D.: Information extraction from HTML: application of a general machine learning approach. In: Proc. AAAI 1998, pp. 517–523 (1998)

    Google Scholar 

  5. Junker, M., Sintek, M., Rinck, M.: Learning for Text Categorization and Information Extraction with ILP. In: Proc. Workshop on Learning Language in Logic, Bled, Slovenia (1999)

    Google Scholar 

  6. Kosala, R., Bussche, J., van den Bruynooghe, M., Blockeel, H.: Information Extraction in Structured Documents Using Tree Automata Induction. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 299–310. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  7. Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118, 15–68 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  8. Kushmerick, N., Thomas, B.: Adaptive Information Extraction: Core Technologies for Information Agents. In: Klusch, M., Bergamaschi, S., Edwards, P., Petta, P. (eds.) Intelligent Information Agents. LNCS (LNAI), vol. 2586, pp. 79–103. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  9. Ikeda, D., Yamada, Y., Hirokawa, S.: Expressive Power of Tree and String Based Wrappers. In: Proc. IIWeb 2003, Acapulco, Mexoco, pp. 21–16 (2003)

    Google Scholar 

  10. Quinlan, J.R., Cameron-Jones, R.M.: Induction of Logic Programs: FOIL and Related Systems. New Generation Computing 13, 287–312 (1995)

    Article  Google Scholar 

  11. Sakamoto, H., Arimura, H., Arikawa, S.: Knowledge Discovery from Semistructured Texts. In: Arikawa, S., Shinohara, A. (eds.) Progress in Discovery Science. LNCS (LNAI), vol. 2281, pp. 586–599. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  12. Thomas, B.: Token-Templates and Logic Programs for Intelligent Web Search. Intelligent Information Systems. Special Issue: Methodologies for Intelligent Information Systems 14(2/3), 241–261 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bădică, C., Bădică, A., Popescu, E. (2005). Tuples Extraction from HTML Using Logic Wrappers and Inductive Logic Programming. In: Szczepaniak, P.S., Kacprzyk, J., Niewiadomski, A. (eds) Advances in Web Intelligence. AWIC 2005. Lecture Notes in Computer Science(), vol 3528. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11495772_8

Download citation

  • DOI: https://doi.org/10.1007/11495772_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-26219-0

  • Online ISBN: 978-3-540-31900-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics