Learning to adapt cross language information extraction wrapper

Wong, Tak-Lam

doi:10.1007/s10489-011-0305-0

Learning to adapt cross language information extraction wrapper

Published: 15 June 2011

Volume 36, pages 918–931, (2012)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Tak-Lam Wong¹

160 Accesses
3 Citations
Explore all metrics

Abstract

We propose a framework for adapting a previously learned wrapper from a source Web site to unseen sites in different languages. To achieve this, we exploit the previously learned information extraction knowledge and the previously extracted or collected items in the source Web site. These knowledge and data are automatically translated to the same language as the unseen sites via online Web resources such as online Web dictionaries or maps. Site independent features which capture the characteristics of the content of the data are then derived from the translated information. Several text mining methods are employed to automatically discover a set of machine labeled training examples in the unseen site. Both content oriented features and site dependent features of the machine labeled training examples are used for learning the new wrapper for the new unseen site using our language independent wrapper induction component. We conducted experiments on some real-world Web sites in different languages to demonstrate the effectiveness of our framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ambite JL, Barish G, Knoblock CA, Muslea CA, Oh J, Minton S (2002) Getting from here to there: interactive planning and agent execution for optimizing travel. In: Proceedings of the fourteenth innovative applications of artificial intelligence conference, pp 862–869
Google Scholar
Bilenko M, Mooney R (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pp 39–48
Chapter Google Scholar
Blei DM, Bagnell JA, McCallum AK (2002) Learning with scope, with application to information extraction and classification. In: Proceedings of the eighteenth conference on uncertainty in artificial intelligence, pp 53–60
Google Scholar
Brin S (1998) Extracting patterns and relations from the World Wide Web. In: Proceedings of the international workshop on the web and databases, pp 172–183
Google Scholar
Chang CH, Lui SC (2001) IEPAD: information extraction based on pattern discovery. In: Proceedings of the tenth international conference on world wide web, pp 681–688
Chapter Google Scholar
Chang CH, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18(10):1411–1428
Article Google Scholar
Ciravegna F (2001) (LP)² an adaptive algorithm for information extraction from web-related texts. In: Proceedings of the seventeenth international joint conference on artificial intelligence, pp 1251–1256
Google Scholar
Cohen WW, Fan W (1999) Learning page-independent heuristics for extracting data from Web pages. Comput Netw 31(11–16):1641–1652
Article Google Scholar
Cohen WW, Hurst M, Jensen L (2002) A flexible learning system for wrapping tables and lists in HTML documents. In: Proceedings of the eleventh international World Wide Web conference, pp 232–241
Chapter Google Scholar
Crescenzi V, Mecca G, Merialdo P (2001) ROADRUNNER: towards automatic data extraction from large web sites. In: Proceedings of the twenty-seventh very large databases conference, pp 109–118
Google Scholar
Doorenbos RB, Etzioni O, Weld DS (1997) A scalable comparison-shopping agent for the World-Wide Web. In: Proceedings of the first international conference on autonomous agents, pp 39–48
Chapter Google Scholar
Freitag D, McCallum A (1999) Information extraction with HMMs and shrinkage. In: Proceedings of the AAAI-99 workshop on machine learning for information extraction, pp 31–36
Google Scholar
Ghani R, Jones R (2002) A comparison of efficacy and assumptions of bootstrapping algorithms for training information extraction systems. In: Proceedings of the workshop on linguistic knowledge acquisition and representation: bootstrapping annotated data at the linguistic resources and evaluation conference
Google Scholar
Golgher PB, da Silva AS (2001) Bootstrapping for example-based data extraction. In: Proceedings of the tenth ACM international conference on information and knowledge management, pp 371–378
Google Scholar
Grenager T, Klein D, Manning C (2005) Unsupervised learning of field segmentation models for information extraction. In: Proceedings of the forty-third annual meeting of the association for computational linguistics, pp 371–378
Google Scholar
Gusfield D (1997) Algorithms on strings, trees, and sequences. Cambridge University Press, Cambridge
Book MATH Google Scholar
Hammer J, Garcia-Molina H, Cho J, Aranha R, Crespo A (1997) Extracting semistructured information from the Web. In: Proceedings of the workshop on management of semistructured data
Google Scholar
Kim S, Zhang BT (2003) Genetic mining of HTML structures for effective web-document retrieval. Appl Intell 18(3):243–256
Article Google Scholar
Kristjansson T, Culotta A, Viola P, McCallum A (2004) Interactive information extraction with constrained conditional random fields. In: Proceedings of the nineteenth national conference on artificial intelligence, pp 412–418
Google Scholar
Kushmerick N (1999) Regression testing for wrapper maintenance. In: Proceedings of the sixteenth national conference on artificial intelligence, pp 74–79
Google Scholar
Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif Intell 118(1–2):15–68
Article MathSciNet MATH Google Scholar
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of eighteenth international conference on machine learning, pp 282–289
Google Scholar
Lerman K, Minton S, Knoblock C (2003) Wrapper maintenance: a machine learning approach. J Artif Intell Res 149–181
Lerman K, Gazen C, Minton S, Knoblock C (2004) Populating the semantic web. In: Proceedings of the AAAI workshop on advances in text extraction and mining
Google Scholar
Lin WY, Lam W (2000) Learning to extract hierarchical information from semi-structured documents. In: Proceedings of the ninth international conference on information and knowledge management CIKM, pp 250–257
Chapter Google Scholar
Meng X, Wang H, Hu D, Li C (2003) A supervised visual wrapper generator for web-data extraction. In: Proceedings of the twenty-seventh annual international computer software and applications conference, pp 657–662
Chapter Google Scholar
Minton S, Nanjo C, Knoblock C, Michalowski M, Michelson M (2005) A heterogeneous field matching method for record linkage. In: Proceedings of the IEEE international conference on data mining, pp 314–321
Google Scholar
Muslea I, Minton S, Knoblock C (2000) Selective sampling with redundant views. In: Proceedings of the seventeenth national conference on artificial intelligence, pp 621–626
Google Scholar
Muslea I, Minton S, Knoblock C (2001) Hierarchical wrapper induction for semistructured information sources. J Auton Agents Multi-Agent Syst 4(1–2):93–114
Article Google Scholar
Riloff E, Jones R (1999) Learning dictionaries for information extraction by multi-level bootstrapping. In: Proceedings of the sixteenth national conference on artificial intelligence, pp 1044–1049
Google Scholar
Sanchez D, Isern D (2011) Automatic extraction of acronym definitions from the web. Appl Intell 34(2):311–327
Article Google Scholar
Satpal S, Bhadra S, Sundararajan S, Rastogi R, Sen P (2011) Web information extraction using Markov logic networks. In: Proceedings of the twenty-eighth international World Wide Web conference, pp 115–116
Chapter Google Scholar
Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn (1–3):233–272
Sutton C, Rohanimanesh K, McCallum A (2007) Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. In: Proceedings of twenty-first international conference on machine learning, pp 783–790
Google Scholar
Tejada S, Knoblock C, Minton S (2002) Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp 350–359
Chapter Google Scholar
Turmo J, Catala N, Rodriguez H (1999) An adaptable IE system to new domains. Appl Intell 10(2–3):225–246
Article Google Scholar
Turmo J, Ageno A, Catala N (2006) Adaptive information extraction. ACM Comput Surv 38(2):4
Article Google Scholar
Vapnik VN (1995) The nature of statistical learning theory. Springer, Berlin
MATH Google Scholar
Wang J, Lochovsky FH (2003) Data extraction and label assignment for Web databases. In: Proceedings of the twelfth international World Wide Web conference, pp 187–196
Chapter Google Scholar
Wong TL, Lam W (2002) Adapting information extraction knowledge for unseen web sites. In: Proceedings of the 2002 IEEE international conference on data mining, pp 506–513
Chapter Google Scholar
Wong TL, Chow KO, Lam W (2009) Cross language information extraction knowledge adaptation. In: Proceedings of the fourth international conference on rough sets and knowledge technology, pp 520–528
Google Scholar
Xiao L, Wissmann D, Brown M, Jablonski S (2004) Information extraction from the web: system and techniques. Appl Intell 21(2):195–224
Article MATH Google Scholar
Yamada Y, Ikeda D, Hirokawa S (2002) Automatic wrapper generation for multilingual Web resources. In: Proceedings of the fifth international conference on discovery science, pp 332–339
Google Scholar
Zhu J, Nie Z, Wen JR, Zhang B, Hon HW (2007) Webpage understanding: an integrated approach. In: Proceedings of the thirteenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 903–912
Google Scholar
Zhu J, Nie Z, Zhang B, Wen JR (2008) Dynamic hierarchical Markov random fields for integrated web data extraction. J Mach Learn Res 9:1583–1614
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Information Technology, The Hong Kong Institute of Education, 10 Lo Ping Road, N.T., Tai Po, Hong Kong
Tak-Lam Wong

Authors

Tak-Lam Wong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tak-Lam Wong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wong, TL. Learning to adapt cross language information extraction wrapper. Appl Intell 36, 918–931 (2012). https://doi.org/10.1007/s10489-011-0305-0

Download citation

Published: 15 June 2011
Issue Date: June 2012
DOI: https://doi.org/10.1007/s10489-011-0305-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning to adapt cross language information extraction wrapper

Abstract

Access this article

Similar content being viewed by others

Self Training Wrapper Induction with Linked Data

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning to adapt cross language information extraction wrapper

Abstract

Access this article

Similar content being viewed by others

Self Training Wrapper Induction with Linked Data

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation