Abstract
Information extraction (IE) from semi-structured Web documents is a critical issue for information integration systems on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors. For example, WIEN, Stalker, Softmealy, etc. However, this approach still requires human intervention to provide training examples. In this paper, we propose a novel idea to IE, by repeated pattern mining and multiple pattern alignment. The discovery of repeated patterns are realized through a data structure call PAT tree. In addition, incomplete patterns are further revised by pattern alignment to comprehend all pattern instances. This new track to IE involves no human effort and content-dependent heuristics. Experimental results show that the constructed extraction rules can achieves 97 percent extraction over fourteen popular search engines.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chien, L.F. 1997. PAT-tree-based keyword extraction for Chinese information retrieval. In Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval. pp.50–58. 1997.
Doorenbos, R.B., Etzioni, O. and Weld, D.S. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the first international conference on Autonomous Agents. pp. 39–48, NewYork, NY, 1997, ACM Press.
Embley, D.; Jiang, Y.; and Ng. Y.-K. 1999. Record-boundary discovery in Web documents. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD’99). pp. 467–478, Philadelphia, Pennsylvania.
Gonnet, G.H.; Baeza-yates, R.A.; and Snider, T. 1992. New Indices for Text: Pat Trees and Pat Arrays. Information Retrieval: Data Structures and Algorithms, Prentice Hall.
Gusfield, D. 1997. Algorithms on strings, trees, and sequences, Cambridge. 1997.
Hsu, C.-N. and Dung, M.-T. 1998. Generating_nite-state transducers for semi-structured data extraction from the Web. Information Systems. 23(8):521–538.
Knoblock, A. et al., ed., 1998. Proc. 1998 Workshop on AI and Information Integration, Menlo Park, California.: AAAI Press.
Kurtz, S. and Schleiermacher, C. 1999. REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics 15(5):426–427.
Kushmerick, N.; Weld, D.; and Doorenbos, R. 1997 Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI).
Muslea, I.; Minton, S.; and Knoblock, C. 1999. A hierarchical approach to wrapper induction. In Proceedings of the 3rd International Conference on Autonomous Agents (Agents’99), Seattle, WA.
Muslea, I. 1999. Extraction patterns for information extraction tasks: a survey. In Proceedings of AAAI’99: Workshop on Machine Learning for Information Extraction
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chang, CH., Lui, SC., Wu, YC. (2001). Applying Pattern Mining to Web Information Extraction. In: Cheung, D., Williams, G.J., Li, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2001. Lecture Notes in Computer Science(), vol 2035. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45357-1_4
Download citation
DOI: https://doi.org/10.1007/3-540-45357-1_4
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41910-5
Online ISBN: 978-3-540-45357-4
eBook Packages: Springer Book Archive