Skip to main content
Log in

L-wrappers: concepts, properties and construction

A declarative approach to data extraction from web sources

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

In this paper, we propose a novel class of wrappers (logic wrappers) inspired by the logic prog- ramming paradigm. The developed Logic wrappers (L-wrapper) have declarative semantics, and therefore: (i) their specification is decoupled from their implementation and (ii) they can be generated using inductive logic programming. We also define a convenient way for mapping L-wrappers to XSLT for efficient processing using available XSLT processing engines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Anton T (2005) XPath-wrapper induction by generalizing tree traversal patterns. In: Mathias Bauer, Boris Brandherm, Johannes Fürnkranz, Gunter Grieser, Andreas Hotho, Andreas Jedlitschka, Alexander Krner (eds) Lernen, Wissensentdeckung und Adaptivitt (LWA) 2005. GI Workshops, Saarbrcken, pp 126–133

    Google Scholar 

  2. Baumgartner R, Flesca S, Gottlob G (2001) The Elog web extraction language. In: Nieuwenhuis R, Voronkov A (eds) Proceedings of LPAR’2001, LNAI 2250. Springer, Berlin Heidelberg New York, pp 548–560

    Google Scholar 

  3. Baumgartner R, Frolich O, Gottlob G, Harz P, Herzog M, Lehmann P (2005) Web data extraction for business intelligence: the Lixto approach. In: Gottfried Vossen, Frank Leymann, Peter C. Lockemann, Wolffried Stucky (eds) Datenbanksysteme in Business, Technologie und Web, 11. Fachtagung des GI-Fachbereichs “Datenbanken und ” (DBIS), Karslrhue, Germany, 2005. Lecture Notes in Informatics, vol 65, GI, pp 30–47

  4. Bădică C, Bădică A (2004) Rule learning for feature values extraction from HTML product information sheets. In: Boley H, Antoniou G (eds) Proceedings RuleML’04, Hiroshima LNCS, 3323. Springer, Berlin Heidelberg New York, pp 37–48

    Google Scholar 

  5. Bădică C, Popescu E, Bădică A (2005a) Learning logic wrappers for information extraction from the Web. In: Papazoglou M, Yamazaki, K (eds) Proceedings of the SAINT’2005 Workshops. Computer Intelligence for Exabyte Scale Data Explosion. IEEE Computer Society Press, Trento pp 336–339

  6. Bădică C, Bădică A, Popescu E (2005b) Tuples extraction from HTML using logic wrappers and inductive logic programming. In: Szczepaniak, PS, Kacprzyk J, Niewiadomski A (eds) Proceedings of the AWIC’05, Lodz, Poland LNAI 3528. Springer, Berlin Heidelberg New York, pp 44–50

    Google Scholar 

  7. Bădică C, Bădică A (2005) Logic wrappers and XSLT transformations for tuples extraction from HTML. In: Bressan S, Ceri S, Hunt E, Ives ZG, Bellahsene Z, Rys M, Unland R, (eds) Proceedings, 3rd international XML database symposium XSym’05, Trondheim LNCS 3671. Springer, Berlin Heidelberg New York, pp 177–191

    Google Scholar 

  8. Bernardoni C, Fiumara G, Marchi M, Provetti A (2006) Declarative Web data extraction and annotation. 20th workshop on logic programming, WLP. Vienna, Austria

  9. Bex GJ, Maneth S, Neven F (2002) A formal model for an expressive fragment of XSLT. Inf syst Elsevier 27: 21–39

    Article  MATH  Google Scholar 

  10. Chakrabarti S (2003) Mining the Web. Discovering knowledge from hypertext data. Morgan Kaufmann Publishers

  11. Chidlovskii B (2003) Information extraction from Tree documents by learning subtree delimiters. Proceedings of the IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), Acapulco, Mexico pp 3–8

  12. Clark J (1999) XSLT transformation (XSLT) 1.0. W3C recommendation, 16 November 1999, http://www.w3. org/TR/xslt2

  13. Cormen TH, Leiserson CE, Rivest RR (1990) Introduction to Algorithms. MIT Press, Cambridge

    MATH  Google Scholar 

  14. Freitag D (1998) Information extraction from HTML: application of a general machine learning approach. In: Proceedings of AAAI’98, pp 517–523

  15. Gottlob G, Koch C, Schulz KU (2004) Conjunctive queries over trees. In: Proceedings of the PODS’2004, Paris, France. ACM Press, pp 189–200

  16. Gottlob G, Koch C (2004) Monadic datalog and the expressive power of languages for Web information extraction. J ACM 51 (1):74–113

    Article  MathSciNet  Google Scholar 

  17. Knoblock C (2002) Agents for gathering, integrating, and monitoring information for travel planning. In: Intelligent systems for tourism. IEEE Intell Syst Nov./Dec.:53–66

  18. Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif intell, Elsevier 118:15–68

    MATH  MathSciNet  Google Scholar 

  19. Laender AHF, Ribeiro-Neto B, Silva AS, Teixeira, JS (2002) A brief survey of Web data extraction tools. In: SIGMOD record, ACM Press, 31(2): 84–93

  20. Laender AHF, Ribeiro-Neto B, Silva AS (2002b) DEByE – data extraction by example. Data Knowl Eng 40 (2):121–154

    Article  MATH  Google Scholar 

  21. Laudon KC, Traver CG (2004) E-commerce business technology society (2nd edn). Pearson Addison-Wesley, location

    Google Scholar 

  22. Lenhert W, Sundheim B (1991) A performance evaluation of text-analysis technologies. AI Mag 12(3):81–94

    Google Scholar 

  23. Liu B, Grossman R, Zhai Y(2004) Mining web pages for data records. IEEE Intell Syst Nov./Dec.:49–55

  24. Mitchell TM (1997) Machine learning, McGraw-Hill, location

  25. Oxygen XML Editor. http://www.oxygenxml.com/2

  26. Quinlan JR, Cameron-Jones RM (1995) Induction of logic programs: FOIL and related systems. New Generation Comput 13:287–312

    Article  Google Scholar 

  27. Sakamoto H, Arimura H, Arikawa S (2002) Knowledge discovery from semistructured texts. In: Arikawa S, Shinohara A (eds) Progress in discovery science Lecture Notes in Computer Science, 2281. Springer, Berlin Heidelberg New York, pp 586–599

    Google Scholar 

  28. Thomas B (2000) Token-templates and logic programs for intelligent web search Intelligent Information Systems. Special Issue: Methodologies Intell Inf Syst 14(2/3):241–261

    Google Scholar 

  29. Xiao L, Wissmann D, Brown M, Jablonski S (2001) Information extraction from HTML: combining XML and standard techniques IE from the Web. In: Monostori L, Vancza J, Ali M (eds) Proceedings of IEA/AIE 2001 Lecture Notes in Artificial Intelligence, 2070, Springer, Berlin Heidelberg New York, 165–174

  30. XML Path Language (XPath) Version 1.0 http://www.w3.2. org/TR/xslt2

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Costin Bădică.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bădică, C., Bădică, A., Popescu, E. et al. L-wrappers: concepts, properties and construction. Soft Comput 11, 753–772 (2007). https://doi.org/10.1007/s00500-006-0118-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-006-0118-y

Keywords

Navigation