Skip to main content

Applying Pattern Mining to Web Information Extraction

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2001)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2035))

Included in the following conference series:


Information extraction (IE) from semi-structured Web documents is a critical issue for information integration systems on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors. For example, WIEN, Stalker, Softmealy, etc. However, this approach still requires human intervention to provide training examples. In this paper, we propose a novel idea to IE, by repeated pattern mining and multiple pattern alignment. The discovery of repeated patterns are realized through a data structure call PAT tree. In addition, incomplete patterns are further revised by pattern alignment to comprehend all pattern instances. This new track to IE involves no human effort and content-dependent heuristics. Experimental results show that the constructed extraction rules can achieves 97 percent extraction over fourteen popular search engines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others


  1. Chien, L.F. 1997. PAT-tree-based keyword extraction for Chinese information retrieval. In Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval. pp.50–58. 1997.

    Google Scholar 

  2. Doorenbos, R.B., Etzioni, O. and Weld, D.S. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the first international conference on Autonomous Agents. pp. 39–48, NewYork, NY, 1997, ACM Press.

    Google Scholar 

  3. Embley, D.; Jiang, Y.; and Ng. Y.-K. 1999. Record-boundary discovery in Web documents. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD’99). pp. 467–478, Philadelphia, Pennsylvania.

    Google Scholar 

  4. Gonnet, G.H.; Baeza-yates, R.A.; and Snider, T. 1992. New Indices for Text: Pat Trees and Pat Arrays. Information Retrieval: Data Structures and Algorithms, Prentice Hall.

    Google Scholar 

  5. Gusfield, D. 1997. Algorithms on strings, trees, and sequences, Cambridge. 1997.

    Google Scholar 

  6. Hsu, C.-N. and Dung, M.-T. 1998. Generating_nite-state transducers for semi-structured data extraction from the Web. Information Systems. 23(8):521–538.

    Article  Google Scholar 

  7. Knoblock, A. et al., ed., 1998. Proc. 1998 Workshop on AI and Information Integration, Menlo Park, California.: AAAI Press.

    Google Scholar 

  8. Kurtz, S. and Schleiermacher, C. 1999. REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics 15(5):426–427.

    Article  Google Scholar 

  9. Kushmerick, N.; Weld, D.; and Doorenbos, R. 1997 Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI).

    Google Scholar 

  10. Muslea, I.; Minton, S.; and Knoblock, C. 1999. A hierarchical approach to wrapper induction. In Proceedings of the 3rd International Conference on Autonomous Agents (Agents’99), Seattle, WA.

    Google Scholar 

  11. Muslea, I. 1999. Extraction patterns for information extraction tasks: a survey. In Proceedings of AAAI’99: Workshop on Machine Learning for Information Extraction

    Google Scholar 

Download references

Author information

Authors and Affiliations


Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chang, CH., Lui, SC., Wu, YC. (2001). Applying Pattern Mining to Web Information Extraction. In: Cheung, D., Williams, G.J., Li, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2001. Lecture Notes in Computer Science(), vol 2035. Springer, Berlin, Heidelberg.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41910-5

  • Online ISBN: 978-3-540-45357-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics