Abstract
We propose a framework for adapting a previously learned wrapper from a source Web site to unseen sites in different languages. To achieve this, we exploit the previously learned information extraction knowledge and the previously extracted or collected items in the source Web site. These knowledge and data are automatically translated to the same language as the unseen sites via online Web resources such as online Web dictionaries or maps. Site independent features which capture the characteristics of the content of the data are then derived from the translated information. Several text mining methods are employed to automatically discover a set of machine labeled training examples in the unseen site. Both content oriented features and site dependent features of the machine labeled training examples are used for learning the new wrapper for the new unseen site using our language independent wrapper induction component. We conducted experiments on some real-world Web sites in different languages to demonstrate the effectiveness of our framework.
Similar content being viewed by others
References
Ambite JL, Barish G, Knoblock CA, Muslea CA, Oh J, Minton S (2002) Getting from here to there: interactive planning and agent execution for optimizing travel. In: Proceedings of the fourteenth innovative applications of artificial intelligence conference, pp 862–869
Bilenko M, Mooney R (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pp 39–48
Blei DM, Bagnell JA, McCallum AK (2002) Learning with scope, with application to information extraction and classification. In: Proceedings of the eighteenth conference on uncertainty in artificial intelligence, pp 53–60
Brin S (1998) Extracting patterns and relations from the World Wide Web. In: Proceedings of the international workshop on the web and databases, pp 172–183
Chang CH, Lui SC (2001) IEPAD: information extraction based on pattern discovery. In: Proceedings of the tenth international conference on world wide web, pp 681–688
Chang CH, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18(10):1411–1428
Ciravegna F (2001) (LP)2 an adaptive algorithm for information extraction from web-related texts. In: Proceedings of the seventeenth international joint conference on artificial intelligence, pp 1251–1256
Cohen WW, Fan W (1999) Learning page-independent heuristics for extracting data from Web pages. Comput Netw 31(11–16):1641–1652
Cohen WW, Hurst M, Jensen L (2002) A flexible learning system for wrapping tables and lists in HTML documents. In: Proceedings of the eleventh international World Wide Web conference, pp 232–241
Crescenzi V, Mecca G, Merialdo P (2001) ROADRUNNER: towards automatic data extraction from large web sites. In: Proceedings of the twenty-seventh very large databases conference, pp 109–118
Doorenbos RB, Etzioni O, Weld DS (1997) A scalable comparison-shopping agent for the World-Wide Web. In: Proceedings of the first international conference on autonomous agents, pp 39–48
Freitag D, McCallum A (1999) Information extraction with HMMs and shrinkage. In: Proceedings of the AAAI-99 workshop on machine learning for information extraction, pp 31–36
Ghani R, Jones R (2002) A comparison of efficacy and assumptions of bootstrapping algorithms for training information extraction systems. In: Proceedings of the workshop on linguistic knowledge acquisition and representation: bootstrapping annotated data at the linguistic resources and evaluation conference
Golgher PB, da Silva AS (2001) Bootstrapping for example-based data extraction. In: Proceedings of the tenth ACM international conference on information and knowledge management, pp 371–378
Grenager T, Klein D, Manning C (2005) Unsupervised learning of field segmentation models for information extraction. In: Proceedings of the forty-third annual meeting of the association for computational linguistics, pp 371–378
Gusfield D (1997) Algorithms on strings, trees, and sequences. Cambridge University Press, Cambridge
Hammer J, Garcia-Molina H, Cho J, Aranha R, Crespo A (1997) Extracting semistructured information from the Web. In: Proceedings of the workshop on management of semistructured data
Kim S, Zhang BT (2003) Genetic mining of HTML structures for effective web-document retrieval. Appl Intell 18(3):243–256
Kristjansson T, Culotta A, Viola P, McCallum A (2004) Interactive information extraction with constrained conditional random fields. In: Proceedings of the nineteenth national conference on artificial intelligence, pp 412–418
Kushmerick N (1999) Regression testing for wrapper maintenance. In: Proceedings of the sixteenth national conference on artificial intelligence, pp 74–79
Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif Intell 118(1–2):15–68
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of eighteenth international conference on machine learning, pp 282–289
Lerman K, Minton S, Knoblock C (2003) Wrapper maintenance: a machine learning approach. J Artif Intell Res 149–181
Lerman K, Gazen C, Minton S, Knoblock C (2004) Populating the semantic web. In: Proceedings of the AAAI workshop on advances in text extraction and mining
Lin WY, Lam W (2000) Learning to extract hierarchical information from semi-structured documents. In: Proceedings of the ninth international conference on information and knowledge management CIKM, pp 250–257
Meng X, Wang H, Hu D, Li C (2003) A supervised visual wrapper generator for web-data extraction. In: Proceedings of the twenty-seventh annual international computer software and applications conference, pp 657–662
Minton S, Nanjo C, Knoblock C, Michalowski M, Michelson M (2005) A heterogeneous field matching method for record linkage. In: Proceedings of the IEEE international conference on data mining, pp 314–321
Muslea I, Minton S, Knoblock C (2000) Selective sampling with redundant views. In: Proceedings of the seventeenth national conference on artificial intelligence, pp 621–626
Muslea I, Minton S, Knoblock C (2001) Hierarchical wrapper induction for semistructured information sources. J Auton Agents Multi-Agent Syst 4(1–2):93–114
Riloff E, Jones R (1999) Learning dictionaries for information extraction by multi-level bootstrapping. In: Proceedings of the sixteenth national conference on artificial intelligence, pp 1044–1049
Sanchez D, Isern D (2011) Automatic extraction of acronym definitions from the web. Appl Intell 34(2):311–327
Satpal S, Bhadra S, Sundararajan S, Rastogi R, Sen P (2011) Web information extraction using Markov logic networks. In: Proceedings of the twenty-eighth international World Wide Web conference, pp 115–116
Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn (1–3):233–272
Sutton C, Rohanimanesh K, McCallum A (2007) Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. In: Proceedings of twenty-first international conference on machine learning, pp 783–790
Tejada S, Knoblock C, Minton S (2002) Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp 350–359
Turmo J, Catala N, Rodriguez H (1999) An adaptable IE system to new domains. Appl Intell 10(2–3):225–246
Turmo J, Ageno A, Catala N (2006) Adaptive information extraction. ACM Comput Surv 38(2):4
Vapnik VN (1995) The nature of statistical learning theory. Springer, Berlin
Wang J, Lochovsky FH (2003) Data extraction and label assignment for Web databases. In: Proceedings of the twelfth international World Wide Web conference, pp 187–196
Wong TL, Lam W (2002) Adapting information extraction knowledge for unseen web sites. In: Proceedings of the 2002 IEEE international conference on data mining, pp 506–513
Wong TL, Chow KO, Lam W (2009) Cross language information extraction knowledge adaptation. In: Proceedings of the fourth international conference on rough sets and knowledge technology, pp 520–528
Xiao L, Wissmann D, Brown M, Jablonski S (2004) Information extraction from the web: system and techniques. Appl Intell 21(2):195–224
Yamada Y, Ikeda D, Hirokawa S (2002) Automatic wrapper generation for multilingual Web resources. In: Proceedings of the fifth international conference on discovery science, pp 332–339
Zhu J, Nie Z, Wen JR, Zhang B, Hon HW (2007) Webpage understanding: an integrated approach. In: Proceedings of the thirteenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 903–912
Zhu J, Nie Z, Zhang B, Wen JR (2008) Dynamic hierarchical Markov random fields for integrated web data extraction. J Mach Learn Res 9:1583–1614
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wong, TL. Learning to adapt cross language information extraction wrapper. Appl Intell 36, 918–931 (2012). https://doi.org/10.1007/s10489-011-0305-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-011-0305-0