Abstract
The World Wide Web is now undeniably the richest and most dense source of information, yet its structure makes it difficult to make use of that information in a systematic way. This paper extends a pattern discovery approach called IEPAD to the rapid generation of information extractors that can extract structured data from semi-structured Web documents. IEPAD is proposed to automate wrapper generation from a multiple-record Web page without user-labeled examples. In this paper, we consider another case when multiple Web pages are available but each input Web page contains only one record (called singular Web pages). To solve this case, a hierarchical multiple string alignment is proposed to allow wrapper induction for multiple singular Web pages.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
C.-H. Chang and S.-C. Lui. Iepad: Information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web, pages 223–231, Hong-Kong, 2001.
R.B. Doorenbos, O. Etzioni, and D.S. Weld. A scalable comparison-shopping agent for the world-wide web. In Proceedings of the 1st International Conference on Autonomous Agents, pages 39–48, NewYork, USA, 1997.
D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge, 1997.
C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521–538, 1998.
I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the 3rd International Conference on Autonomous Agents, pages 190–197, Seattle, WA, 1999.
E. Selberg and O. Etzioni. Multi-engine search and comparison using the metacrawler. In Proc. of the Fourth Intl. WWW Conference, Boston, USA, 1995.
N. Kushmerick D. Weld and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI), pages 729–737, Japan, 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chang, CH., Kuo, SC., Hwang, KY., Ho, TH., Lin, CL. (2002). Automatic Information Extraction for Multiple Singular Web Pages. In: Chen, MS., Yu, P.S., Liu, B. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2002. Lecture Notes in Computer Science(), vol 2336. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47887-6_29
Download citation
DOI: https://doi.org/10.1007/3-540-47887-6_29
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43704-8
Online ISBN: 978-3-540-47887-4
eBook Packages: Springer Book Archive