skip to main content
article

Adapting Web information extraction knowledge via mining site-invariant and site-dependent features

Authors Info & Claims
Published:01 February 2007Publication History
Skip Abstract Section

Abstract

We develop a novel framework that aims at automatically adapting previously learned information extraction knowledge from a source Web site to a new unseen target site in the same domain. Two kinds of features related to the text fragments from the Web documents are investigated. The first type of feature is called, a site-invariant feature. These features likely remain unchanged in Web pages from different sites in the same domain. The second type of feature is called a site-dependent feature. These features are different in the Web pages collected from different Web sites, while they are similar in the Web pages originating from the same site. In our framework, we derive the site-invariant features from previously learned extraction knowledge and the items previously collected or extracted from the source Web site. The derived site-invariant features will be exploited to automatically seek a new set of training examples in the new unseen target site. Both the site-dependent features and the site-invariant features of these automatically discovered training examples will be considered in the learning of new information extraction knowledge for the target site. We conducted extensive experiments on a set of real-world Web sites collected from three different domains to demonstrate the performance of our framework. For example, by just providing training examples from one online book catalog Web site, our approach can automatically extract information from ten different book catalog sites achieving an average precision and recall of 71.9% and 84.0% respectively without any further manual intervention.

References

  1. Ambite, J., Barish, G., Knoblock, C., Muslea, M., Oh, J., and Minton, S. 2002. Getting from here to there: Interactive planning and agent execution for optimizing travel. In Proceedings of the Fourteenth Innovative Applications of Artificial Intelligence Conference, 862--869. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bilenko, M. and Mooney, R. 2003. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 39--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Blei, D., Bagnell, J., and McCallum, A. 2002. Learning with scope, with application to information extraction and classification. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI), 53--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Brin, S. 1998. Extracting patterns and relations from the World Wide Web. In Proceedings of the International Workshop on the Web and Databases, 172--183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Califf, M. E. and Mooney, R. J. 2003. Bottom-up relational learning of pattern matching rules for information extraction. J. Mach. Learn. Res. 4, 177--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chawathe, S., Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstaninou, Y., Ullman, J., and Widom, J. 1994. The TSIMMIS project: Integration of heterogeneous information sources. In Proceedings of the Information Processing Society of Japan, 7--18.Google ScholarGoogle Scholar
  7. Ciravegna, F. 2001.(LP)2 an adaptive algorithm for information extraction from Web-related texts. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI), 1251--1256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cohen, W. 1999. Reasoning about textual similarity in a Web-based information access system. Autonomous Agents and Multi-Agent Systems 2(1), 65--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cohen, W. and Fan, W. 1999. Learning page-independent heuristics for extracting data from Web pages. Comput. Netw. 31(11-16), 1641--1652. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cohen, W. W., Hurst, M., and Jensen, L. 2002. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of the Eleventh International World Wide Web Conference (WWW), 232--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Crescenzi, V., Mecca, G., and Merialdo, P. 2001. ROADRUNNER: Towards automatic data extraction from large Web sites. In Proceedings of the 27th Very Large Databases Conference (VLDB), 109--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Doorenbos, R. B., Etzioni, O., and Weld, D. S. 1997. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the First International Conference on Autonomous Agents, 39--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Downey, D., Etzioni, O., and Soderland, S. 2005. A probabilistic model of redundancy in information extraction. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence (IJCAI), 1034--1041. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Etzioni, O., Cafarella, M., Downey, D., Popescu, A. M., Shaked, T., Soderland, S., and Weld, D. 2005. Unsupervised named-entity extraction from the Web: An experimental study. Artif. Intell. 165(1), 91--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Freitag, D. and McCallum, A. 1999. Information extraction with HMMs and shrinkage. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. French, J. C., Powell, A. L., and Schulman, E. 1997. Applications of approximate word matching in information retrieval. In Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM), 9--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ghani, R. and Jones, R. 2002. A comparison of efficacy and assumptions of bootstrapping algorithms for training information extraction systems. In Proceedings of the Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Data at the Linguistic Resources and Evaluation Conference.Google ScholarGoogle Scholar
  18. Golgher, P. and da Silva, A. 2001. Bootstrapping for example-based data extraction. In Proceedings of the Tenth ACM International Conference on Information and Knowledge Management (CIKM), 371--378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gusfield, D. 1997. Algorithms on Strings, Trees, and Sequences. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hogue, A. and Karger, D. 2005. Thresher: Automating the unwrapping of semantic content from the World Wide web. In Proceedings of the Fourteenth International World Wide Web Conference (WWW), 86--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hsu, C. and Dung, M. 1998. Generating finite-state transducers for semi-structured data extraction from the Web. J. Info. Sys., Special Issue on Semistructured Data 23(8), 521--528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kushmerick, N. 2000a. Wrapper induction: Efficiency and expressiveness. Artif. Intell. 118(1-2), 15--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kushmerick, N. 2000b. Wrapper verification. W.W.W. J. 3(2), 79--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kushmerick, N. and Grace, B. 1998. The wrapper induction environment. In Proceedings of the Workshop on Software Tools for Developing Agents (AAAI), 131--132.Google ScholarGoogle Scholar
  25. Kushmerick, N. and Thomas, B. 2002. Adaptive information extraction: Core technologies for information agents. In Intelligents Information Agents R&D In Europe: An AgentLink Perspective, 79--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Lam, W., Wang, W., and Yue, C. W. 2003. Web discovery and filtering based on textual relevance feedback learning. Computational Intell. 19(2), 136--163.Google ScholarGoogle ScholarCross RefCross Ref
  27. Lerman, K., Minton, S., and Knoblock, C. 2003. Wrapper maintenance: A machine-learning approach. J. Artif. Intell. Res. 18, 149--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lin, W. Y. and Lam, W. 2000. Learning to extract hierarchical information from semi-structured documents. In Proceedings of the Ninth International Conference on Information and Knowledge Management (CIKM), 250--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Liu, B., Grossman, R., and Zhai, Y. 2003. Mining data records in Web pages. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 601--606. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Muslea, I., Minton, S., and Knoblock, C. 2000. Selective sampling with redundant views. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI), 621--626. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Muslea, I., Minton, S., and Knoblock, C. 2001. Hierarchical wrapper induction for semistructured information sources. J. Autonomous Agents and Multi-Agent Systems 4(1-2), 93--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Riloff, E. and Jones, R. 1999. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI), 1044--1049. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Soderland, S. 1999. Learning information extraction rules for semi-structured and free text. Mach. Learn. 34(1-3), 233--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Srihari, R. and Li, W. 1999. Question answering supported by information extraction. In Proceedings of the Eighth Text REtrieval Conference (TREC-8), 185--196.Google ScholarGoogle Scholar
  35. Tejada, S., Knoblock, C., and Minton, S. 2001. Learning object identification rules for information integration. Info. Syst. 26(8), 607--635. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Tejada, S., Knoblock, C., and Minton, S. 2002. Learning domain-independent string transformation weights for high accuracy object identification. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 350--359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Wang, J. and Lochovsky, F. H. 2003. Data extraction and label assignment for Web databases. In Proceedings of the Twelfth International World Wide Web Conference (WWW), 187--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Wong, T. L. and Lam, W. 2002. Adapting information extraction knowledge for unseen web sites. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM), 506--513. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Wong, T. L. and Lam, W. 2004a. A probabilistic approach for adapting information extraction wrappers and discovering new attributes. In Proceedings of the IEEE International Conference on Data Mining (ICDM), 257--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Wong, T. L. and Lam, W. 2004b. Text mining from site-invariant and dependent features for information extraction knowledge adaptation. In Proceedings of the 2004 SIAM International Conference on Data Mining (SDM), 45--56.Google ScholarGoogle Scholar
  42. Wong, T. L. and Lam, W. 2005. Learning to refine ontology for a new Web site using a Bayesian approach. In Proceedings of the 2005 SIAM International Conference on Data Mining (SDM). 298--309.Google ScholarGoogle Scholar

Index Terms

  1. Adapting Web information extraction knowledge via mining site-invariant and site-dependent features

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader