Skip to main content

Integrating Data from the Web by Machine-Learning Tree-Pattern Queries

  • Conference paper
On the Move to Meaningful Internet Systems 2006: CoopIS, DOA, GADA, and ODBASE (OTM 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4275))

Abstract

Effienct and reliable integration of web data requires building programs called wrappers. Hand writting wrappers is tedious and error prone. Constant changes in the web, also implies that wrappers need to be constantly refactored. Machine learning has proven to be useful, but current techniques are either limited in expressivity, require non-intuitive user interaction or do not allow for n-ary extraction. We study using tree-patterns as an n-ary extraction language and propose an algorithm learning such queries. It calculates the most information-conservative tree-pattern which is a generalization of two input trees. A notable aspect is that the approach allows to learn queries containing both child and descendant relationships between nodes. More importantly, the proposed approach does not require any labeling other than the data which the user effectively wants to extract. The experiments reported show the effectiveness of the approach.

An erratum to this chapter can be found at http://dx.doi.org/10.1007/11914853_71.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Baumgartner, R., Flesca, S., Gottlob, G.: Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto. In: Eiter, T., Faber, W., Truszczyński, M. (eds.) LPNMR 2001. LNCS (LNAI), vol. 2173, pp. 21–40. Springer, Heidelberg (2001)

    Google Scholar 

  2. Carme, J., Lemay, A., Niehren, J.: Learning Node Selecting Tree Transducer from Completely Annotated Examples. In: Int. Conf. on Grammar Induction, pp. 29–102 (2004)

    Google Scholar 

  3. Gilleron, R., Marty, P., Tommasi, M., Torre, F.: Adaptive Relation Extraction from Semi-Structured Data. In: 6émes Journées Francophones. Extraction et Gestion des Connaissances (2006)

    Google Scholar 

  4. Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based Content Extraction of HTML Documents. In: Proc. of the 12th WWW Conference. Elsevier Science, Amsterdam (2003)

    Google Scholar 

  5. Habegger, B., Quafafou, M.: Context generalization for information extraction from the web. In: Proc. of the ACM/IEEE Web Intelligence Conference (2004)

    Google Scholar 

  6. Hsu, C., Dung, M.: Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web. Information Systems 23(8) (1998)

    Google Scholar 

  7. Knoblock, C., Lerman, K., Minton, S., Muslea, I.: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. Data Engineering Bulletin 23(4) (2003)

    Google Scholar 

  8. Kosala, R., Bruynooghe, M., den Bussche, J.V., Blockeel, H.: Information Extraction from web documents based on local unranked tree automaton inference. In: Proc. of the 18th Int. Joint Conf. on Artificial Intelligence (IJCAI-2003), pp. 403–408 (2003)

    Google Scholar 

  9. Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. In: Artificial Intelligence (2000)

    Google Scholar 

  10. Lerman, K., Knoblock, C., Minton, S.: Automatic Data Extraction from Lists and Tables in Web Sources. In: IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, Washington (August 2001)

    Google Scholar 

  11. Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent System 4(1-2) (March 2001)

    Google Scholar 

  12. XML Path Language (XPath) (1999), Available at: http://www.w3.org/TR/xpath

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Habegger, B., Debarbieux, D. (2006). Integrating Data from the Web by Machine-Learning Tree-Pattern Queries. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems 2006: CoopIS, DOA, GADA, and ODBASE. OTM 2006. Lecture Notes in Computer Science, vol 4275. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11914853_59

Download citation

  • DOI: https://doi.org/10.1007/11914853_59

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-48287-1

  • Online ISBN: 978-3-540-48289-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics