Skip to main content
Log in

Learning to adapt cross language information extraction wrapper

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

We propose a framework for adapting a previously learned wrapper from a source Web site to unseen sites in different languages. To achieve this, we exploit the previously learned information extraction knowledge and the previously extracted or collected items in the source Web site. These knowledge and data are automatically translated to the same language as the unseen sites via online Web resources such as online Web dictionaries or maps. Site independent features which capture the characteristics of the content of the data are then derived from the translated information. Several text mining methods are employed to automatically discover a set of machine labeled training examples in the unseen site. Both content oriented features and site dependent features of the machine labeled training examples are used for learning the new wrapper for the new unseen site using our language independent wrapper induction component. We conducted experiments on some real-world Web sites in different languages to demonstrate the effectiveness of our framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ambite JL, Barish G, Knoblock CA, Muslea CA, Oh J, Minton S (2002) Getting from here to there: interactive planning and agent execution for optimizing travel. In: Proceedings of the fourteenth innovative applications of artificial intelligence conference, pp 862–869

    Google Scholar 

  2. Bilenko M, Mooney R (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pp 39–48

    Chapter  Google Scholar 

  3. Blei DM, Bagnell JA, McCallum AK (2002) Learning with scope, with application to information extraction and classification. In: Proceedings of the eighteenth conference on uncertainty in artificial intelligence, pp 53–60

    Google Scholar 

  4. Brin S (1998) Extracting patterns and relations from the World Wide Web. In: Proceedings of the international workshop on the web and databases, pp 172–183

    Google Scholar 

  5. Chang CH, Lui SC (2001) IEPAD: information extraction based on pattern discovery. In: Proceedings of the tenth international conference on world wide web, pp 681–688

    Chapter  Google Scholar 

  6. Chang CH, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18(10):1411–1428

    Article  Google Scholar 

  7. Ciravegna F (2001) (LP)2 an adaptive algorithm for information extraction from web-related texts. In: Proceedings of the seventeenth international joint conference on artificial intelligence, pp 1251–1256

    Google Scholar 

  8. Cohen WW, Fan W (1999) Learning page-independent heuristics for extracting data from Web pages. Comput Netw 31(11–16):1641–1652

    Article  Google Scholar 

  9. Cohen WW, Hurst M, Jensen L (2002) A flexible learning system for wrapping tables and lists in HTML documents. In: Proceedings of the eleventh international World Wide Web conference, pp 232–241

    Chapter  Google Scholar 

  10. Crescenzi V, Mecca G, Merialdo P (2001) ROADRUNNER: towards automatic data extraction from large web sites. In: Proceedings of the twenty-seventh very large databases conference, pp 109–118

    Google Scholar 

  11. Doorenbos RB, Etzioni O, Weld DS (1997) A scalable comparison-shopping agent for the World-Wide Web. In: Proceedings of the first international conference on autonomous agents, pp 39–48

    Chapter  Google Scholar 

  12. Freitag D, McCallum A (1999) Information extraction with HMMs and shrinkage. In: Proceedings of the AAAI-99 workshop on machine learning for information extraction, pp 31–36

    Google Scholar 

  13. Ghani R, Jones R (2002) A comparison of efficacy and assumptions of bootstrapping algorithms for training information extraction systems. In: Proceedings of the workshop on linguistic knowledge acquisition and representation: bootstrapping annotated data at the linguistic resources and evaluation conference

    Google Scholar 

  14. Golgher PB, da Silva AS (2001) Bootstrapping for example-based data extraction. In: Proceedings of the tenth ACM international conference on information and knowledge management, pp 371–378

    Google Scholar 

  15. Grenager T, Klein D, Manning C (2005) Unsupervised learning of field segmentation models for information extraction. In: Proceedings of the forty-third annual meeting of the association for computational linguistics, pp 371–378

    Google Scholar 

  16. Gusfield D (1997) Algorithms on strings, trees, and sequences. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  17. Hammer J, Garcia-Molina H, Cho J, Aranha R, Crespo A (1997) Extracting semistructured information from the Web. In: Proceedings of the workshop on management of semistructured data

    Google Scholar 

  18. Kim S, Zhang BT (2003) Genetic mining of HTML structures for effective web-document retrieval. Appl Intell 18(3):243–256

    Article  Google Scholar 

  19. Kristjansson T, Culotta A, Viola P, McCallum A (2004) Interactive information extraction with constrained conditional random fields. In: Proceedings of the nineteenth national conference on artificial intelligence, pp 412–418

    Google Scholar 

  20. Kushmerick N (1999) Regression testing for wrapper maintenance. In: Proceedings of the sixteenth national conference on artificial intelligence, pp 74–79

    Google Scholar 

  21. Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif Intell 118(1–2):15–68

    Article  MathSciNet  MATH  Google Scholar 

  22. Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of eighteenth international conference on machine learning, pp 282–289

    Google Scholar 

  23. Lerman K, Minton S, Knoblock C (2003) Wrapper maintenance: a machine learning approach. J Artif Intell Res 149–181

  24. Lerman K, Gazen C, Minton S, Knoblock C (2004) Populating the semantic web. In: Proceedings of the AAAI workshop on advances in text extraction and mining

    Google Scholar 

  25. Lin WY, Lam W (2000) Learning to extract hierarchical information from semi-structured documents. In: Proceedings of the ninth international conference on information and knowledge management CIKM, pp 250–257

    Chapter  Google Scholar 

  26. Meng X, Wang H, Hu D, Li C (2003) A supervised visual wrapper generator for web-data extraction. In: Proceedings of the twenty-seventh annual international computer software and applications conference, pp 657–662

    Chapter  Google Scholar 

  27. Minton S, Nanjo C, Knoblock C, Michalowski M, Michelson M (2005) A heterogeneous field matching method for record linkage. In: Proceedings of the IEEE international conference on data mining, pp 314–321

    Google Scholar 

  28. Muslea I, Minton S, Knoblock C (2000) Selective sampling with redundant views. In: Proceedings of the seventeenth national conference on artificial intelligence, pp 621–626

    Google Scholar 

  29. Muslea I, Minton S, Knoblock C (2001) Hierarchical wrapper induction for semistructured information sources. J Auton Agents Multi-Agent Syst 4(1–2):93–114

    Article  Google Scholar 

  30. Riloff E, Jones R (1999) Learning dictionaries for information extraction by multi-level bootstrapping. In: Proceedings of the sixteenth national conference on artificial intelligence, pp 1044–1049

    Google Scholar 

  31. Sanchez D, Isern D (2011) Automatic extraction of acronym definitions from the web. Appl Intell 34(2):311–327

    Article  Google Scholar 

  32. Satpal S, Bhadra S, Sundararajan S, Rastogi R, Sen P (2011) Web information extraction using Markov logic networks. In: Proceedings of the twenty-eighth international World Wide Web conference, pp 115–116

    Chapter  Google Scholar 

  33. Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn (1–3):233–272

  34. Sutton C, Rohanimanesh K, McCallum A (2007) Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. In: Proceedings of twenty-first international conference on machine learning, pp 783–790

    Google Scholar 

  35. Tejada S, Knoblock C, Minton S (2002) Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp 350–359

    Chapter  Google Scholar 

  36. Turmo J, Catala N, Rodriguez H (1999) An adaptable IE system to new domains. Appl Intell 10(2–3):225–246

    Article  Google Scholar 

  37. Turmo J, Ageno A, Catala N (2006) Adaptive information extraction. ACM Comput Surv 38(2):4

    Article  Google Scholar 

  38. Vapnik VN (1995) The nature of statistical learning theory. Springer, Berlin

    MATH  Google Scholar 

  39. Wang J, Lochovsky FH (2003) Data extraction and label assignment for Web databases. In: Proceedings of the twelfth international World Wide Web conference, pp 187–196

    Chapter  Google Scholar 

  40. Wong TL, Lam W (2002) Adapting information extraction knowledge for unseen web sites. In: Proceedings of the 2002 IEEE international conference on data mining, pp 506–513

    Chapter  Google Scholar 

  41. Wong TL, Chow KO, Lam W (2009) Cross language information extraction knowledge adaptation. In: Proceedings of the fourth international conference on rough sets and knowledge technology, pp 520–528

    Google Scholar 

  42. Xiao L, Wissmann D, Brown M, Jablonski S (2004) Information extraction from the web: system and techniques. Appl Intell 21(2):195–224

    Article  MATH  Google Scholar 

  43. Yamada Y, Ikeda D, Hirokawa S (2002) Automatic wrapper generation for multilingual Web resources. In: Proceedings of the fifth international conference on discovery science, pp 332–339

    Google Scholar 

  44. Zhu J, Nie Z, Wen JR, Zhang B, Hon HW (2007) Webpage understanding: an integrated approach. In: Proceedings of the thirteenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 903–912

    Google Scholar 

  45. Zhu J, Nie Z, Zhang B, Wen JR (2008) Dynamic hierarchical Markov random fields for integrated web data extraction. J Mach Learn Res 9:1583–1614

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tak-Lam Wong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wong, TL. Learning to adapt cross language information extraction wrapper. Appl Intell 36, 918–931 (2012). https://doi.org/10.1007/s10489-011-0305-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-011-0305-0

Keywords

Navigation