Approximately Repetitive Structure Detection for Wrapper Induction

Gao, Xiaoying; Andreae, Peter; Collins, Richard

doi:10.1007/978-3-540-28633-2_62

Xiaoying Gao²¹,
Peter Andreae²¹ &
Richard Collins²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3157))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

1387 Accesses

Abstract

In recent years, much work has been invested into automatically learning wrappers for information extraction from HTML tables and lists. Our research has focused on a system that can learn a wrapper from a single unlabelled page. An essential step is to locate the tabular data within the page. This is not trivial when the structures of data tuples are similar but not identical. In this paper we describe an algorithm that can automatically detect approximate repetitive structures within one sequence. The algorithm does not rely on any domain knowledge or HTML heuristics and it can be used in detecting repetitive patterns and hence to learn wrappers from a single unlabeled tabular page.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Article Open access 20 August 2017

Detecting Unusual Behaviour and Mining Unstructured Data

Web Page Structured Content Detection Using Supervised Machine Learning

References

Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: IJCAI 1997, Nagoya, Japan, pp. 729–735 (1997)
Google Scholar
Freitag, D.: Information extration from html: Application of a general machine learning approach. In: AAAI 1998 (1998)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: The 3rd conference on Autonomous Agents(Agent 1999) (1999)
Google Scholar
Soderland, S.: Learning to extract text-based information from the world wide web. In: Proceedings of Third International Conference on Knowledge Discovery and Data Mining, KDD 1997 (1997)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)
Google Scholar
Lerman, K., Knoblock, C., Minton, S.: Automatic data extraction from lists and tables in web sources. In: Automatic Text Extraction and Mining workshop (ATEM 2001), IJCAI 2001, Seattle, WA (2001)
Google Scholar
Hong, T.W., Clark, K.L.: Using grammatical inference to automate information extraction from the Web. In: Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, pp. 216–223. Springer, Heidelberg (2001)
Chapter Google Scholar
Gao, X., Zhang, M., Andreae, P.: Learning information extraction patterns from tabular web pages without manual labelling. In: IEEE/WIC International Conference on Web Intelligence (WI 2003), Halifax, Canada, October 13-17, pp. 495–498 (2003)
Google Scholar
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. of Mol. Biol. 147, 195–197 (1981)
Article Google Scholar
Kushmerick, N.: Wrapper Induction for Information Extraction. PhD thesis, Department of Computer Science and Engineering, University of Washington (1997)
Google Scholar
Cohen, W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in html documents. In: The Eleventh International World Wide Web Conference WWW 2002 (2002)
Google Scholar
Gao, X., Sterling, L.: Knowledge-based information agents. In: Carbonell, J.G., Siekmann, J. (eds.). LNCS (LNAI), pp. 229–238. Springer, Heidelberg (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Mathematical and Computing Sciences, Victoria University of Wellington, Wellington, New Zealand
Xiaoying Gao, Peter Andreae & Richard Collins

Authors

Xiaoying Gao
View author publications
You can also search for this author in PubMed Google Scholar
Peter Andreae
View author publications
You can also search for this author in PubMed Google Scholar
Richard Collins
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Engineering and Information Technology, Centre for Quantum Computation and Intelligent Systems, and Australian ACS National Committee for Artificial Intelligence, University of Technology, Sydney, Australia
Chengqi Zhang
Department of Computer Science, Auckland University of Technology, 1020, Auckland, New Zealand
Hans W. Guesgen
Artificial Intelligence Technology Centre, Auckland University of Technology, Auckland, New Zealand
Wai-Kiang Yeap

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gao, X., Andreae, P., Collins, R. (2004). Approximately Repetitive Structure Detection for Wrapper Induction. In: Zhang, C., W. Guesgen, H., Yeap, WK. (eds) PRICAI 2004: Trends in Artificial Intelligence. PRICAI 2004. Lecture Notes in Computer Science(), vol 3157. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28633-2_62

Download citation

DOI: https://doi.org/10.1007/978-3-540-28633-2_62
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22817-2
Online ISBN: 978-3-540-28633-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics