Abstract
We develop a novel framework that aims at automatically adapting previously learned information extraction knowledge from a source Web site to a new unseen target site in the same domain. Two kinds of features related to the text fragments from the Web documents are investigated. The first type of feature is called, a site-invariant feature. These features likely remain unchanged in Web pages from different sites in the same domain. The second type of feature is called a site-dependent feature. These features are different in the Web pages collected from different Web sites, while they are similar in the Web pages originating from the same site. In our framework, we derive the site-invariant features from previously learned extraction knowledge and the items previously collected or extracted from the source Web site. The derived site-invariant features will be exploited to automatically seek a new set of training examples in the new unseen target site. Both the site-dependent features and the site-invariant features of these automatically discovered training examples will be considered in the learning of new information extraction knowledge for the target site. We conducted extensive experiments on a set of real-world Web sites collected from three different domains to demonstrate the performance of our framework. For example, by just providing training examples from one online book catalog Web site, our approach can automatically extract information from ten different book catalog sites achieving an average precision and recall of 71.9% and 84.0% respectively without any further manual intervention.
- Ambite, J., Barish, G., Knoblock, C., Muslea, M., Oh, J., and Minton, S. 2002. Getting from here to there: Interactive planning and agent execution for optimizing travel. In Proceedings of the Fourteenth Innovative Applications of Artificial Intelligence Conference, 862--869. Google ScholarDigital Library
- Bilenko, M. and Mooney, R. 2003. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 39--48. Google ScholarDigital Library
- Blei, D., Bagnell, J., and McCallum, A. 2002. Learning with scope, with application to information extraction and classification. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI), 53--60. Google ScholarDigital Library
- Brin, S. 1998. Extracting patterns and relations from the World Wide Web. In Proceedings of the International Workshop on the Web and Databases, 172--183. Google ScholarDigital Library
- Califf, M. E. and Mooney, R. J. 2003. Bottom-up relational learning of pattern matching rules for information extraction. J. Mach. Learn. Res. 4, 177--210. Google ScholarDigital Library
- Chawathe, S., Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstaninou, Y., Ullman, J., and Widom, J. 1994. The TSIMMIS project: Integration of heterogeneous information sources. In Proceedings of the Information Processing Society of Japan, 7--18.Google Scholar
- Ciravegna, F. 2001.(LP)2 an adaptive algorithm for information extraction from Web-related texts. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI), 1251--1256. Google ScholarDigital Library
- Cohen, W. 1999. Reasoning about textual similarity in a Web-based information access system. Autonomous Agents and Multi-Agent Systems 2(1), 65--86. Google ScholarDigital Library
- Cohen, W. and Fan, W. 1999. Learning page-independent heuristics for extracting data from Web pages. Comput. Netw. 31(11-16), 1641--1652. Google ScholarDigital Library
- Cohen, W. W., Hurst, M., and Jensen, L. 2002. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of the Eleventh International World Wide Web Conference (WWW), 232--241. Google ScholarDigital Library
- Crescenzi, V., Mecca, G., and Merialdo, P. 2001. ROADRUNNER: Towards automatic data extraction from large Web sites. In Proceedings of the 27th Very Large Databases Conference (VLDB), 109--118. Google ScholarDigital Library
- Doorenbos, R. B., Etzioni, O., and Weld, D. S. 1997. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the First International Conference on Autonomous Agents, 39--48. Google ScholarDigital Library
- Downey, D., Etzioni, O., and Soderland, S. 2005. A probabilistic model of redundancy in information extraction. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence (IJCAI), 1034--1041. Google ScholarDigital Library
- Etzioni, O., Cafarella, M., Downey, D., Popescu, A. M., Shaked, T., Soderland, S., and Weld, D. 2005. Unsupervised named-entity extraction from the Web: An experimental study. Artif. Intell. 165(1), 91--134. Google ScholarDigital Library
- Freitag, D. and McCallum, A. 1999. Information extraction with HMMs and shrinkage. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction. Google ScholarDigital Library
- French, J. C., Powell, A. L., and Schulman, E. 1997. Applications of approximate word matching in information retrieval. In Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM), 9--15. Google ScholarDigital Library
- Ghani, R. and Jones, R. 2002. A comparison of efficacy and assumptions of bootstrapping algorithms for training information extraction systems. In Proceedings of the Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Data at the Linguistic Resources and Evaluation Conference.Google Scholar
- Golgher, P. and da Silva, A. 2001. Bootstrapping for example-based data extraction. In Proceedings of the Tenth ACM International Conference on Information and Knowledge Management (CIKM), 371--378. Google ScholarDigital Library
- Gusfield, D. 1997. Algorithms on Strings, Trees, and Sequences. Cambridge University Press. Google ScholarDigital Library
- Hogue, A. and Karger, D. 2005. Thresher: Automating the unwrapping of semantic content from the World Wide web. In Proceedings of the Fourteenth International World Wide Web Conference (WWW), 86--95. Google ScholarDigital Library
- Hsu, C. and Dung, M. 1998. Generating finite-state transducers for semi-structured data extraction from the Web. J. Info. Sys., Special Issue on Semistructured Data 23(8), 521--528. Google ScholarDigital Library
- Kushmerick, N. 2000a. Wrapper induction: Efficiency and expressiveness. Artif. Intell. 118(1-2), 15--68. Google ScholarDigital Library
- Kushmerick, N. 2000b. Wrapper verification. W.W.W. J. 3(2), 79--94. Google ScholarDigital Library
- Kushmerick, N. and Grace, B. 1998. The wrapper induction environment. In Proceedings of the Workshop on Software Tools for Developing Agents (AAAI), 131--132.Google Scholar
- Kushmerick, N. and Thomas, B. 2002. Adaptive information extraction: Core technologies for information agents. In Intelligents Information Agents R&D In Europe: An AgentLink Perspective, 79--103. Google ScholarDigital Library
- Lam, W., Wang, W., and Yue, C. W. 2003. Web discovery and filtering based on textual relevance feedback learning. Computational Intell. 19(2), 136--163.Google ScholarCross Ref
- Lerman, K., Minton, S., and Knoblock, C. 2003. Wrapper maintenance: A machine-learning approach. J. Artif. Intell. Res. 18, 149--181. Google ScholarDigital Library
- Lin, W. Y. and Lam, W. 2000. Learning to extract hierarchical information from semi-structured documents. In Proceedings of the Ninth International Conference on Information and Knowledge Management (CIKM), 250--257. Google ScholarDigital Library
- Liu, B., Grossman, R., and Zhai, Y. 2003. Mining data records in Web pages. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 601--606. Google ScholarDigital Library
- Muslea, I., Minton, S., and Knoblock, C. 2000. Selective sampling with redundant views. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI), 621--626. Google ScholarDigital Library
- Muslea, I., Minton, S., and Knoblock, C. 2001. Hierarchical wrapper induction for semistructured information sources. J. Autonomous Agents and Multi-Agent Systems 4(1-2), 93--114. Google ScholarDigital Library
- Riloff, E. and Jones, R. 1999. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI), 1044--1049. Google ScholarDigital Library
- Soderland, S. 1999. Learning information extraction rules for semi-structured and free text. Mach. Learn. 34(1-3), 233--272. Google ScholarDigital Library
- Srihari, R. and Li, W. 1999. Question answering supported by information extraction. In Proceedings of the Eighth Text REtrieval Conference (TREC-8), 185--196.Google Scholar
- Tejada, S., Knoblock, C., and Minton, S. 2001. Learning object identification rules for information integration. Info. Syst. 26(8), 607--635. Google ScholarDigital Library
- Tejada, S., Knoblock, C., and Minton, S. 2002. Learning domain-independent string transformation weights for high accuracy object identification. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 350--359. Google ScholarDigital Library
- Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer. Google ScholarDigital Library
- Wang, J. and Lochovsky, F. H. 2003. Data extraction and label assignment for Web databases. In Proceedings of the Twelfth International World Wide Web Conference (WWW), 187--196. Google ScholarDigital Library
- Wong, T. L. and Lam, W. 2002. Adapting information extraction knowledge for unseen web sites. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM), 506--513. Google ScholarDigital Library
- Wong, T. L. and Lam, W. 2004a. A probabilistic approach for adapting information extraction wrappers and discovering new attributes. In Proceedings of the IEEE International Conference on Data Mining (ICDM), 257--264. Google ScholarDigital Library
- Wong, T. L. and Lam, W. 2004b. Text mining from site-invariant and dependent features for information extraction knowledge adaptation. In Proceedings of the 2004 SIAM International Conference on Data Mining (SDM), 45--56.Google Scholar
- Wong, T. L. and Lam, W. 2005. Learning to refine ontology for a new Web site using a Bayesian approach. In Proceedings of the 2005 SIAM International Conference on Data Mining (SDM). 298--309.Google Scholar
Index Terms
- Adapting Web information extraction knowledge via mining site-invariant and site-dependent features
Recommendations
Mining web site's topic hierarchy
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide WebSearching and navigating a Web site is a tedious task and the hierarchical models, such as site maps, are frequently used for organizing the Web site's content. In this work, we propose to model a Web site's content structure using the topic hierarchy, ...
Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach
This paper presents a Bayesian learning framework for adapting information extraction wrappers with new attribute discovery, reducing human effort in extracting precise information from unseen Web sites. Our approach aims at automatically adapting the ...
Information Extraction from A Whole Web Site
Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006This paper focuses on information extraction from one site rather than from one page. A new directed-acyclic graph based representation method is introduced for representing link structures on the Web sites. A rule based language is developed for ...
Comments