Skip to main content

Chapter 6: Web Data Extraction for Service Creation

  • Chapter

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5950))

Abstract

Web data extraction is an enabling technique in the search computing scenario. In this chapter, we first review the state of the art in wrapper technologies focusing on how wrapper generators can be used to create unified services that integrate data from Web Applications and Web services in various domains. Next, we describe the Lixto approach and we present the Lixto Suite as one example of Web Process Integration. Finally, application areas and future challenges and the usage of wrapper technologies in the search computing context is discussed.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adelberg, B.: Nodose - a tool for semi-automatically extracting structured and semistructured data from text documents. In: SIGMOD Record, pp. 283–294 (1998)

    Google Scholar 

  2. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD 2003: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 337–348. ACM, New York (2003)

    Google Scholar 

  3. Arocena, G.O., Mendelzon, A.O.: Weboql: restructuring documents, databases, and webs. Theor. Pract. Object Syst. 5(3), 127–141 (1999)

    Article  Google Scholar 

  4. Baumgartner, R., Ceresna, M., Ledermüller, G.: Deep web navigation in web data extraction. In: Proc. of IAWTIC (2005)

    Google Scholar 

  5. Baumgartner, R., Flesca, S., Gottlob, G.: Declarative Information Extraction, Web Crawling and Recursive Wrapping with Lixto. In: Eiter, T., Faber, W., Truszczyński, M. (eds.) LPNMR 2001. LNCS (LNAI), vol. 2173, p. 21. Springer, Heidelberg (2001)

    Google Scholar 

  6. Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proc. of VLDB (2001)

    Google Scholar 

  7. Baumgartner, R., Herzog, M., Gottlob, G.: Visual programming of web data aggregation applications. In: Proc. of IIWeb 2003 (2003)

    Google Scholar 

  8. Baumgartner, R., Gatterbauer, W., Gottlob, G.: Web data extraction system. In: Encyclopedia of Database Systems (2009)

    Google Scholar 

  9. Baumgartner, R., Gottlob, G., Herzog, M.: Scalable web data extraction for online market intelligence, vol. 2, pp. 1512–1523 (2009)

    Google Scholar 

  10. Baumgartner, R., Gottlob, G., Herzog, M., Slany, W.: Interactively Adding Web Service Interfaces to Existing Web Applications. In: Proc. of SAINT (2004)

    Google Scholar 

  11. Baumgartner, R., Herzog, M.: Using Lixto for automating portal-based b2b processes in the automotive industry. International Journal of Electronic Business 2(5), 519–530 (2004)

    Article  Google Scholar 

  12. Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Flint: Google-basing the web. In: EDBT 2008: Proceedings of the 11th international conference on Extending database technology, pp. 720–724. ACM, New York (2008)

    Google Scholar 

  13. Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)

    Article  Google Scholar 

  14. Cafarella, M.J., Ré, C., Suciu, D., Etzioni, O., Banko, M.: Structured querying of web text: A technical challenge. In: CIDR (2007)

    Google Scholar 

  15. Crescenzi, V., Mecca, G.: Grammars have exceptions. Inf. Syst. 23(9), 539–565 (1998)

    Article  Google Scholar 

  16. Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. J. ACM 51(5), 731–779 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  17. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB 2001: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118. Morgan Kaufmann Publishers Inc., San Francisco (2001)

    Google Scholar 

  18. Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale, D.W., Ng, Y.k., Smith, R.D.: Conceptual-model-based data extraction from multiple-record web pages. Data and Knowledge Engineering 31, 227–251 (1999)

    Article  MATH  Google Scholar 

  19. Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the web. Commun. ACM 51(12), 68–74 (2008)

    Article  Google Scholar 

  20. Freitag, D.: Information extraction from html: Application of a general machine learning approach. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence, pp. 517–523 (1998)

    Google Scholar 

  21. Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., Pollak, B.: Towards domain-independent information extraction from web tables. In: Proc. of WWW, May 8-12 (2007)

    Google Scholar 

  22. Gottlob, G., Koch, C.: Monadic Datalog and the Expressive Power of Web Information Extraction Languages. Journal of the ACM 51(1) (2004)

    Google Scholar 

  23. Hammer, J., McHugh, J., Garcia-Molina, H.: Semistructured data: The tsimmis experience. In: Proceedings of the First East-European Workshop on Advances in Databases and Information Systems, ADBIS 1997, pp. 1–8 (1997)

    Google Scholar 

  24. He, B., Chang, K.C.-C.: Statistical schema matching across web query interfaces. In: SIGMOD 2003: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 217–228. ACM, New York (2003)

    Google Scholar 

  25. He, B., Zhang, Z., Chang, K.C.-C.: Towards building a metaquerier: Extracting and matching web query interfaces. In: International Conference on Data Engineering, pp. 1098–1099 (2005)

    Google Scholar 

  26. Herzog, M., Gottlob, G.: InfoPipes: A flexible framework for M-Commerce applications. In: Proc. of TES workshop at VLDB (2001)

    Google Scholar 

  27. Holzinger, W., Krüpl, B., Baumgartner, R.: Automated ontology-driven metasearch generation with metamorph. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) WISE 2009. LNCS, vol. 5802, pp. 473–480. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  28. Chang, C.h., Lui, S.-C.: Iepad: Information extraction based on pattern discovery, pp. 681–688 (2001)

    Google Scholar 

  29. Jurić, D., Banek, M., Skočir, Z.: Uncovering the deep web: Transferring relational database content and metadata to OWL ontologies. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part I. LNCS (LNAI), vol. 5177, pp. 456–463. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  30. Kayed, M., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. on Knowl. and Data Eng. 18(10), 1411–1428 (2006); Member-Chang, Chia-Hui and Member-Girgis, Moheb Ramzy

    Article  Google Scholar 

  31. Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and reliably extracting data from the web: a machine learning approach, pp. 275–287 (2003)

    Google Scholar 

  32. Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118, 2000 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  33. Laender, A.H.F., Ribeiro-Neto, B., da Silva, A.S.: Debye - date extraction by example. Data Knowl. Eng. 40(2), 121–154 (2002)

    Article  MATH  Google Scholar 

  34. Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Rec. 31(2), 84–93 (2002)

    Article  Google Scholar 

  35. Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. In: SIGMOD 2004: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pp. 119–130. ACM, New York (2004)

    Google Scholar 

  36. Lerman, K., Minton, S.N., Knoblock, C.A.: Wrapper maintenance: a machine learning approach. J. Artif. Int. Res. 18(1), 149–181 (2003)

    MATH  Google Scholar 

  37. Liu, L., Pu, C., Han, W.: Xwrap: An xml-enabled wrapper construction system for web information sources. In: ICDE, pp. 611–621 (2000)

    Google Scholar 

  38. Raposo, J., Pan, A., Alvarez, M., Hidalgo, J., Vina, A.: The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes. In: Proceedings of DEXA 2002, Aix-en-Provence, France (2002)

    Google Scholar 

  39. Riloff, E.: Automatically constructing a dictionary for information extraction tasks. In: Proceedings of the Eleventh National Conference on Artificial Intelligence, pp. 811–816. MIT Press, Cambridge (1993)

    Google Scholar 

  40. Sahuguet, A., Azavant, F.: Building intelligent web applications using lightweight wrappers. Data Knowl. Eng. 36(3), 283–316 (2001)

    Article  MATH  Google Scholar 

  41. Shen, W., Derose, P., Vu, L., Doan, A., Ramakrishnan, R.: Source-aware entity matching: A compositional approach. In: IEEE 23rd International Conference on Data Engineering, ICDE 2007, pp. 196–205 (2007)

    Google Scholar 

  42. Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. In: SIGMOD 2008: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1031–1042. ACM, New York (2008)

    Chapter  Google Scholar 

  43. Shen, W., Doan, A., Naughton, J.F., Ramakrishnan, R.: Declarative information extraction using datalog with embedded extraction predicates. In: VLDB 2007: Proceedings of the 33rd international conference on Very large data bases, pp. 1033–1044. VLDB Endowment (2007)

    Google Scholar 

  44. Soderland, S., Cardie, C., Mooney, R.: Learning information extraction rules for semi-structured and free text. Machine Learning, 233–272 (1999)

    Google Scholar 

  45. Soderland, S., Fisher, D., Aseltine, J., Lehnert, W.: Crystal: Inducing a conceptual dictionary. In: Mellish, C. (ed.) Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1314–1319. Morgan Kaufmann, San Francisco (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Baumgartner, R., Campi, A., Gottlob, G., Herzog, M. (2010). Chapter 6: Web Data Extraction for Service Creation. In: Ceri, S., Brambilla, M. (eds) Search Computing. Lecture Notes in Computer Science, vol 5950. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12310-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12310-8_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12309-2

  • Online ISBN: 978-3-642-12310-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics