Skip to main content

The Lixto Project: Exploring New Frontiers of Web Data Extraction

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 4042))

Abstract

The Lixto project is an ongoing research effort in the area of Web data extraction. Whereas the project originally started out with the idea to develop a logic-based extraction language and a tool to visually define extraction programs from sample Web pages, the scope of the project has been extended over time. Today, new issues such as employing learning algorithms for the definition of extraction programs, automatically extracting data from Web pages featuring a table-centric visual appearance, and extracting from alternative document formats such as PDF are being investigated.

This work is funded in part by the Austrian Federal Ministry for Transport, Innovation and Technology under the FIT-IT Semantic Systems program.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aiello, M., Monz, C., Todoran, L., Worring, M.: Document understanding for a broad class of documents. Int. J. of Document Anal. and Recog. 5(1), 1–16 (2002)

    Article  MATH  Google Scholar 

  2. Altamura, O., Esposito, F., Malerba, D.: Transforming Paper Documents into XML Format with WISDOM++. Intl. J. of Doc. Anal. and Recog. 4(1), 2–17 (2001)

    Article  Google Scholar 

  3. Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB 2001), Rome, Italy, pp. 119–128 (2001)

    Google Scholar 

  4. Baumgartner, R., Ceresna, M., Ledermüller, G.: Automating Web Navigation in Web Data Extraction. In: Proceedings of International Conference on Intelligent Agents, Vienna, Austria (to appear, 2005)

    Google Scholar 

  5. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K.: Learnability and the Vapnik-Chervonenkis dimension. J. ACM 36(4), 929–965 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  6. Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. Computer Networks 31(11–16), 1623–1640 (1999)

    Article  Google Scholar 

  7. Ceresna, M., Gottlob, G.: Query Based Learning of XPath Fragments. In: Proceedings of Dagstuhl Seminar on Machine Learning for the Semantic Web (05071), Dagstuhl, Germany (2005)

    Google Scholar 

  8. Embley, D.W.: Toward Semantic Understanding – An Approach Based on Information Extraction Ontologies. In: Proceedings of the Fifteenth Australasian Database Conference, Dunedin, New Zealand, p. 3 (2004)

    Google Scholar 

  9. Gottlob, G., Koch, C.: A Formal Comparison of Visual Web Wrapper Generators. In: Wiedermann, J., Tel, G., Pokorný, J., Bieliková, M., Štuller, J. (eds.) SOFSEM 2006. LNCS, vol. 3831, pp. 30–48. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  10. Gottlob, G., Koch, C.: Monadic datalog and the expressive power of languages for Web information extraction. J. ACM 51(1), 74–113 (2004)

    Article  MathSciNet  Google Scholar 

  11. Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., Flesca, S.: The Lixto Data Extraction Project - Back and Forth between Theory and Practice. In: Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGAR Symposium on Principles of Database Systems, Paris, France, pp. 1–12 (2004)

    Google Scholar 

  12. Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for processing XPath queries. ACM Trans. Database Syst. 30(2), 444–491 (2005)

    Article  MathSciNet  Google Scholar 

  13. Hassan, T., Baumgartner, R.: Using Graph Matching Techniques to Wrap Data from PDF Documents. In: Proceedings of the 15th International World Wide Web Conference (Poster Track), Edinburgh, UK (to appear, 2006)

    Google Scholar 

  14. Hurst, M.: The Interpretation of Tables in Texts. PhD thesis, University of Edinburgh (2000)

    Google Scholar 

  15. Levenshtein, V.I.: Binary Codes Capable of Correcting Spurious Insertions and Deletions of Ones. Russian Problemy Peredachi Informatsii 1, 12–25 (1965)

    Google Scholar 

  16. Llados, J., Marti, E., Villanueva, J.J.: Symbol Recognition by Error-Tolerant Subgraph Matching between Region Adjacency Graphs. IEEE Tran. on Pattern Anal. and Mach. Intel. 23(10), 1137–1143 (2001)

    Article  Google Scholar 

  17. Page, L., Brin, S.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30(1–7), 107–117 (1998)

    Google Scholar 

  18. Silva, A.C., Alipio, J., Torgo, L.: Automatic Selection of Table Areas in Documents for Information Extraction. In: 11th Protuguese Conference on Artificial Intelligence, EPIA, pp. 460–465 (2003)

    Google Scholar 

  19. XML Path Language (XPath), Version 1.0, http://www.w3.org/TR/xpath

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Carme, J. et al. (2006). The Lixto Project: Exploring New Frontiers of Web Data Extraction. In: Bell, D.A., Hong, J. (eds) Flexible and Efficient Information Handling. BNCOD 2006. Lecture Notes in Computer Science, vol 4042. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11788911_1

Download citation

  • DOI: https://doi.org/10.1007/11788911_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-35969-2

  • Online ISBN: 978-3-540-35971-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics