Abstract
Effienct and reliable integration of web data requires building programs called wrappers. Hand writting wrappers is tedious and error prone. Constant changes in the web, also implies that wrappers need to be constantly refactored. Machine learning has proven to be useful, but current techniques are either limited in expressivity, require non-intuitive user interaction or do not allow for n-ary extraction. We study using tree-patterns as an n-ary extraction language and propose an algorithm learning such queries. It calculates the most information-conservative tree-pattern which is a generalization of two input trees. A notable aspect is that the approach allows to learn queries containing both child and descendant relationships between nodes. More importantly, the proposed approach does not require any labeling other than the data which the user effectively wants to extract. The experiments reported show the effectiveness of the approach.
An erratum to this chapter can be found at http://dx.doi.org/10.1007/11914853_71.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baumgartner, R., Flesca, S., Gottlob, G.: Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto. In: Eiter, T., Faber, W., Truszczyński, M. (eds.) LPNMR 2001. LNCS (LNAI), vol. 2173, pp. 21–40. Springer, Heidelberg (2001)
Carme, J., Lemay, A., Niehren, J.: Learning Node Selecting Tree Transducer from Completely Annotated Examples. In: Int. Conf. on Grammar Induction, pp. 29–102 (2004)
Gilleron, R., Marty, P., Tommasi, M., Torre, F.: Adaptive Relation Extraction from Semi-Structured Data. In: 6émes Journées Francophones. Extraction et Gestion des Connaissances (2006)
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based Content Extraction of HTML Documents. In: Proc. of the 12th WWW Conference. Elsevier Science, Amsterdam (2003)
Habegger, B., Quafafou, M.: Context generalization for information extraction from the web. In: Proc. of the ACM/IEEE Web Intelligence Conference (2004)
Hsu, C., Dung, M.: Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web. Information Systems 23(8) (1998)
Knoblock, C., Lerman, K., Minton, S., Muslea, I.: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. Data Engineering Bulletin 23(4) (2003)
Kosala, R., Bruynooghe, M., den Bussche, J.V., Blockeel, H.: Information Extraction from web documents based on local unranked tree automaton inference. In: Proc. of the 18th Int. Joint Conf. on Artificial Intelligence (IJCAI-2003), pp. 403–408 (2003)
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. In: Artificial Intelligence (2000)
Lerman, K., Knoblock, C., Minton, S.: Automatic Data Extraction from Lists and Tables in Web Sources. In: IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, Washington (August 2001)
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent System 4(1-2) (March 2001)
XML Path Language (XPath) (1999), Available at: http://www.w3.org/TR/xpath
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Habegger, B., Debarbieux, D. (2006). Integrating Data from the Web by Machine-Learning Tree-Pattern Queries. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems 2006: CoopIS, DOA, GADA, and ODBASE. OTM 2006. Lecture Notes in Computer Science, vol 4275. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11914853_59
Download citation
DOI: https://doi.org/10.1007/11914853_59
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-48287-1
Online ISBN: 978-3-540-48289-5
eBook Packages: Computer ScienceComputer Science (R0)