article

Adapting Web information extraction knowledge via mining site-invariant and site-dependent features

Authors:

Wai LamAuthors Info & Claims

ACM Transactions on Internet Technology (TOIT), Volume 7, Issue 1

Pages 6 - es

https://doi.org/10.1145/1189740.1189746

Published: 01 February 2007 Publication History

Abstract

We develop a novel framework that aims at automatically adapting previously learned information extraction knowledge from a source Web site to a new unseen target site in the same domain. Two kinds of features related to the text fragments from the Web documents are investigated. The first type of feature is called, a site-invariant feature. These features likely remain unchanged in Web pages from different sites in the same domain. The second type of feature is called a site-dependent feature. These features are different in the Web pages collected from different Web sites, while they are similar in the Web pages originating from the same site. In our framework, we derive the site-invariant features from previously learned extraction knowledge and the items previously collected or extracted from the source Web site. The derived site-invariant features will be exploited to automatically seek a new set of training examples in the new unseen target site. Both the site-dependent features and the site-invariant features of these automatically discovered training examples will be considered in the learning of new information extraction knowledge for the target site. We conducted extensive experiments on a set of real-world Web sites collected from three different domains to demonstrate the performance of our framework. For example, by just providing training examples from one online book catalog Web site, our approach can automatically extract information from ten different book catalog sites achieving an average precision and recall of 71.9% and 84.0% respectively without any further manual intervention.

References

[1]

Ambite, J., Barish, G., Knoblock, C., Muslea, M., Oh, J., and Minton, S. 2002. Getting from here to there: Interactive planning and agent execution for optimizing travel. In Proceedings of the Fourteenth Innovative Applications of Artificial Intelligence Conference, 862--869.

Digital Library

[2]

Bilenko, M. and Mooney, R. 2003. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 39--48.

Digital Library

[3]

Blei, D., Bagnell, J., and McCallum, A. 2002. Learning with scope, with application to information extraction and classification. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI), 53--60.

Digital Library

[4]

Brin, S. 1998. Extracting patterns and relations from the World Wide Web. In Proceedings of the International Workshop on the Web and Databases, 172--183.

Digital Library

[5]

Califf, M. E. and Mooney, R. J. 2003. Bottom-up relational learning of pattern matching rules for information extraction. J. Mach. Learn. Res. 4, 177--210.

Digital Library

[6]

Chawathe, S., Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstaninou, Y., Ullman, J., and Widom, J. 1994. The TSIMMIS project: Integration of heterogeneous information sources. In Proceedings of the Information Processing Society of Japan, 7--18.

[7]

Ciravegna, F. 2001.(LP)² an adaptive algorithm for information extraction from Web-related texts. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI), 1251--1256.

Digital Library

[8]

Cohen, W. 1999. Reasoning about textual similarity in a Web-based information access system. Autonomous Agents and Multi-Agent Systems 2(1), 65--86.

Digital Library

[9]

Cohen, W. and Fan, W. 1999. Learning page-independent heuristics for extracting data from Web pages. Comput. Netw. 31(11-16), 1641--1652.

Digital Library

[10]

Cohen, W. W., Hurst, M., and Jensen, L. 2002. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of the Eleventh International World Wide Web Conference (WWW), 232--241.

Digital Library

[11]

Crescenzi, V., Mecca, G., and Merialdo, P. 2001. ROADRUNNER: Towards automatic data extraction from large Web sites. In Proceedings of the 27th Very Large Databases Conference (VLDB), 109--118.

Digital Library

[12]

Doorenbos, R. B., Etzioni, O., and Weld, D. S. 1997. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the First International Conference on Autonomous Agents, 39--48.

Digital Library

[13]

Downey, D., Etzioni, O., and Soderland, S. 2005. A probabilistic model of redundancy in information extraction. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence (IJCAI), 1034--1041.

Digital Library

[14]

Etzioni, O., Cafarella, M., Downey, D., Popescu, A. M., Shaked, T., Soderland, S., and Weld, D. 2005. Unsupervised named-entity extraction from the Web: An experimental study. Artif. Intell. 165(1), 91--134.

Digital Library

[15]

Freitag, D. and McCallum, A. 1999. Information extraction with HMMs and shrinkage. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction.

Digital Library

[16]

French, J. C., Powell, A. L., and Schulman, E. 1997. Applications of approximate word matching in information retrieval. In Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM), 9--15.

Digital Library

[17]

Ghani, R. and Jones, R. 2002. A comparison of efficacy and assumptions of bootstrapping algorithms for training information extraction systems. In Proceedings of the Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Data at the Linguistic Resources and Evaluation Conference.

[18]

Golgher, P. and da Silva, A. 2001. Bootstrapping for example-based data extraction. In Proceedings of the Tenth ACM International Conference on Information and Knowledge Management (CIKM), 371--378.

Digital Library

[19]

Gusfield, D. 1997. Algorithms on Strings, Trees, and Sequences. Cambridge University Press.

Digital Library

[20]

Hogue, A. and Karger, D. 2005. Thresher: Automating the unwrapping of semantic content from the World Wide web. In Proceedings of the Fourteenth International World Wide Web Conference (WWW), 86--95.

Digital Library

[21]

Hsu, C. and Dung, M. 1998. Generating finite-state transducers for semi-structured data extraction from the Web. J. Info. Sys., Special Issue on Semistructured Data 23(8), 521--528.

Digital Library

[22]

Kushmerick, N. 2000a. Wrapper induction: Efficiency and expressiveness. Artif. Intell. 118(1-2), 15--68.

Digital Library

[23]

Kushmerick, N. 2000b. Wrapper verification. W.W.W. J. 3(2), 79--94.

Digital Library

[24]

Kushmerick, N. and Grace, B. 1998. The wrapper induction environment. In Proceedings of the Workshop on Software Tools for Developing Agents (AAAI), 131--132.

[25]

Kushmerick, N. and Thomas, B. 2002. Adaptive information extraction: Core technologies for information agents. In Intelligents Information Agents R&D In Europe: An AgentLink Perspective, 79--103.

Digital Library

[26]

Lam, W., Wang, W., and Yue, C. W. 2003. Web discovery and filtering based on textual relevance feedback learning. Computational Intell. 19(2), 136--163.

[27]

Lerman, K., Minton, S., and Knoblock, C. 2003. Wrapper maintenance: A machine-learning approach. J. Artif. Intell. Res. 18, 149--181.

Digital Library

[28]

Lin, W. Y. and Lam, W. 2000. Learning to extract hierarchical information from semi-structured documents. In Proceedings of the Ninth International Conference on Information and Knowledge Management (CIKM), 250--257.

Digital Library

[29]

Liu, B., Grossman, R., and Zhai, Y. 2003. Mining data records in Web pages. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 601--606.

Digital Library

[30]

Muslea, I., Minton, S., and Knoblock, C. 2000. Selective sampling with redundant views. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI), 621--626.

Digital Library

[31]

Muslea, I., Minton, S., and Knoblock, C. 2001. Hierarchical wrapper induction for semistructured information sources. J. Autonomous Agents and Multi-Agent Systems 4(1-2), 93--114.

Digital Library

[32]

Riloff, E. and Jones, R. 1999. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI), 1044--1049.

Digital Library

[33]

Soderland, S. 1999. Learning information extraction rules for semi-structured and free text. Mach. Learn. 34(1-3), 233--272.

Digital Library

[34]

Srihari, R. and Li, W. 1999. Question answering supported by information extraction. In Proceedings of the Eighth Text REtrieval Conference (TREC-8), 185--196.

[35]

Tejada, S., Knoblock, C., and Minton, S. 2001. Learning object identification rules for information integration. Info. Syst. 26(8), 607--635.

Digital Library

[36]

Tejada, S., Knoblock, C., and Minton, S. 2002. Learning domain-independent string transformation weights for high accuracy object identification. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 350--359.

Digital Library

[37]

Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer.

Digital Library

[38]

Wang, J. and Lochovsky, F. H. 2003. Data extraction and label assignment for Web databases. In Proceedings of the Twelfth International World Wide Web Conference (WWW), 187--196.

Digital Library

[39]

Wong, T. L. and Lam, W. 2002. Adapting information extraction knowledge for unseen web sites. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM), 506--513.

Digital Library

[40]

Wong, T. L. and Lam, W. 2004a. A probabilistic approach for adapting information extraction wrappers and discovering new attributes. In Proceedings of the IEEE International Conference on Data Mining (ICDM), 257--264.

Digital Library

[41]

Wong, T. L. and Lam, W. 2004b. Text mining from site-invariant and dependent features for information extraction knowledge adaptation. In Proceedings of the 2004 SIAM International Conference on Data Mining (SDM), 45--56.

[42]

Wong, T. L. and Lam, W. 2005. Learning to refine ontology for a new Web site using a Bayesian approach. In Proceedings of the 2005 SIAM International Conference on Data Mining (SDM). 298--309.

Cited By

Kong JBarkol OBergman RPnueli ASchein SZhang KZhao C(2018)Web Interface Interpretation Using Graph GrammarsIEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews10.1109/TSMCC.2011.217133542:4(590-602)Online publication date: 25-Dec-2018
https://dl.acm.org/doi/10.1109/TSMCC.2011.2171335
Wong TLam W(2018)Learning to extract and summarize hot item features from multiple auction web sitesKnowledge and Information Systems10.1007/s10115-007-0078-214:2(143-160)Online publication date: 29-Dec-2018
https://dl.acm.org/doi/10.1007/s10115-007-0078-2
Ku CLeroy G(2018)A crime reports analysis system to identify related crimesJournal of the American Society for Information Science and Technology10.1002/asi.2155262:8(1533-1547)Online publication date: 28-Dec-2018
https://dl.acm.org/doi/10.1002/asi.21552
Show More Cited By

Index Terms

Adapting Web information extraction knowledge via mining site-invariant and site-dependent features
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Logical and relational learning
        Inductive logic learning
2. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Mining web site's topic hierarchy
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web

Searching and navigating a Web site is a tedious task and the hierarchical models, such as site maps, are frequently used for organizing the Web site's content. In this work, we propose to model a Web site's content structure using the topic hierarchy, ...
Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach

This paper presents a Bayesian learning framework for adapting information extraction wrappers with new attribute discovery, reducing human effort in extracting precise information from unseen Web sites. Our approach aims at automatically adapting the ...
Information Extraction from A Whole Web Site
Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006

This paper focuses on information extraction from one site rather than from one page. A new directed-acyclic graph based representation method is introduced for representing link structures on the Web sites. A rule based language is developed for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology

ACM Transactions on Internet Technology Volume 7, Issue 1

February 2007

184 pages

ISSN:1533-5399

EISSN:1557-6051

DOI:10.1145/1189740

Issue’s Table of Contents

Copyright © 2007 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 February 2007

Published in TOIT Volume 7, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
1,615
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kong JBarkol OBergman RPnueli ASchein SZhang KZhao C(2018)Web Interface Interpretation Using Graph GrammarsIEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews10.1109/TSMCC.2011.217133542:4(590-602)Online publication date: 25-Dec-2018
https://dl.acm.org/doi/10.1109/TSMCC.2011.2171335
Wong TLam W(2018)Learning to extract and summarize hot item features from multiple auction web sitesKnowledge and Information Systems10.1007/s10115-007-0078-214:2(143-160)Online publication date: 29-Dec-2018
https://dl.acm.org/doi/10.1007/s10115-007-0078-2
Ku CLeroy G(2018)A crime reports analysis system to identify related crimesJournal of the American Society for Information Science and Technology10.1002/asi.2155262:8(1533-1547)Online publication date: 28-Dec-2018
https://dl.acm.org/doi/10.1002/asi.21552
Bing LWong TLam W(2016)Unsupervised Extraction of Popular Product Attributes from E-Commerce Web Sites by Considering Customer ReviewsACM Transactions on Internet Technology10.1145/285705416:2(1-17)Online publication date: 15-Apr-2016
https://dl.acm.org/doi/10.1145/2857054
Wu SWang Q(2013)An Adaptive Web Information Extraction Approach Based on STU-DOM TreeApplied Mechanics and Materials10.4028/www.scientific.net/AMM.397-400.1972397-400(1972-1978)Online publication date: Sep-2013
https://doi.org/10.4028/www.scientific.net/AMM.397-400.1972
Geraci FMaggini M(2013)A Fast Method for Web Template Extraction via a Multi-sequence Alignment ApproachKnowledge Discovery, Knowledge Engineering and Knowledge Management10.1007/978-3-642-37186-8_11(172-184)Online publication date: 2013
https://doi.org/10.1007/978-3-642-37186-8_11
Ferrez Rde Groc CCouto J(2013)Mining Product Features from the Web: A Self-supervised ApproachWeb Information Systems and Technologies10.1007/978-3-642-36608-6_19(296-311)Online publication date: 2013
https://doi.org/10.1007/978-3-642-36608-6_19
Nandhi kesavan RLatha K(2012)Lexical semantic based Bayesian model for adaptive wrapper generation2012 International Conference on Data Science & Engineering (ICDSE)10.1109/ICDSE.2012.6281907(19-22)Online publication date: Jul-2012
https://doi.org/10.1109/ICDSE.2012.6281907
kesavan RLatha K(2012)Lexical Semantic based Bayesian Model for Adaptive Wrapper GenerationProcedia Engineering10.1016/j.proeng.2012.06.38738(3343-3350)Online publication date: 2012
https://doi.org/10.1016/j.proeng.2012.06.387
Huang SZheng XWang XChen D(2011)News information extraction based on adaptive weighting using unsupervised Bayesian algorithmProceedings of the 2011 international conference on Web information systems and mining - Volume Part II10.5555/2045753.2045791(251-258)Online publication date: 24-Sep-2011
https://dl.acm.org/doi/10.5555/2045753.2045791
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents