Skip to main content

Approximately Repetitive Structure Detection for Wrapper Induction

  • Conference paper
PRICAI 2004: Trends in Artificial Intelligence (PRICAI 2004)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3157))

Included in the following conference series:

  • 1387 Accesses

Abstract

In recent years, much work has been invested into automatically learning wrappers for information extraction from HTML tables and lists. Our research has focused on a system that can learn a wrapper from a single unlabelled page. An essential step is to locate the tabular data within the page. This is not trivial when the structures of data tuples are similar but not identical. In this paper we describe an algorithm that can automatically detect approximate repetitive structures within one sequence. The algorithm does not rely on any domain knowledge or HTML heuristics and it can be used in detecting repetitive patterns and hence to learn wrappers from a single unlabeled tabular page.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: IJCAI 1997, Nagoya, Japan, pp. 729–735 (1997)

    Google Scholar 

  2. Freitag, D.: Information extration from html: Application of a general machine learning approach. In: AAAI 1998 (1998)

    Google Scholar 

  3. Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: The 3rd conference on Autonomous Agents(Agent 1999) (1999)

    Google Scholar 

  4. Soderland, S.: Learning to extract text-based information from the world wide web. In: Proceedings of Third International Conference on Knowledge Discovery and Data Mining, KDD 1997 (1997)

    Google Scholar 

  5. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)

    Google Scholar 

  6. Lerman, K., Knoblock, C., Minton, S.: Automatic data extraction from lists and tables in web sources. In: Automatic Text Extraction and Mining workshop (ATEM 2001), IJCAI 2001, Seattle, WA (2001)

    Google Scholar 

  7. Hong, T.W., Clark, K.L.: Using grammatical inference to automate information extraction from the Web. In: Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, pp. 216–223. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  8. Gao, X., Zhang, M., Andreae, P.: Learning information extraction patterns from tabular web pages without manual labelling. In: IEEE/WIC International Conference on Web Intelligence (WI 2003), Halifax, Canada, October 13-17, pp. 495–498 (2003)

    Google Scholar 

  9. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. of Mol. Biol. 147, 195–197 (1981)

    Article  Google Scholar 

  10. Kushmerick, N.: Wrapper Induction for Information Extraction. PhD thesis, Department of Computer Science and Engineering, University of Washington (1997)

    Google Scholar 

  11. Cohen, W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in html documents. In: The Eleventh International World Wide Web Conference WWW 2002 (2002)

    Google Scholar 

  12. Gao, X., Sterling, L.: Knowledge-based information agents. In: Carbonell, J.G., Siekmann, J. (eds.). LNCS (LNAI), pp. 229–238. Springer, Heidelberg (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gao, X., Andreae, P., Collins, R. (2004). Approximately Repetitive Structure Detection for Wrapper Induction. In: Zhang, C., W. Guesgen, H., Yeap, WK. (eds) PRICAI 2004: Trends in Artificial Intelligence. PRICAI 2004. Lecture Notes in Computer Science(), vol 3157. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28633-2_62

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-28633-2_62

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-22817-2

  • Online ISBN: 978-3-540-28633-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics