Chapter 6: Web Data Extraction for Service Creation

Baumgartner, Robert; Campi, Alessandro; Gottlob, Georg; Herzog, Marcus

doi:10.1007/978-3-642-12310-8_6

Robert Baumgartner¹⁷,
Alessandro Campi¹⁸,
Georg Gottlob¹⁹ &
…
Marcus Herzog¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5950))

1030 Accesses

Abstract

Web data extraction is an enabling technique in the search computing scenario. In this chapter, we first review the state of the art in wrapper technologies focusing on how wrapper generators can be used to create unified services that integrate data from Web Applications and Web services in various domains. Next, we describe the Lixto approach and we present the Lixto Suite as one example of Web Process Integration. Finally, application areas and future challenges and the usage of wrapper technologies in the search computing context is discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

User-Friendly and Extensible Web Data Extraction

Semantic Web Service Search: A Brief Survey

Article 25 November 2015

State-of-the-Art Survey on Web Search

References

Adelberg, B.: Nodose - a tool for semi-automatically extracting structured and semistructured data from text documents. In: SIGMOD Record, pp. 283–294 (1998)
Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD 2003: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 337–348. ACM, New York (2003)
Google Scholar
Arocena, G.O., Mendelzon, A.O.: Weboql: restructuring documents, databases, and webs. Theor. Pract. Object Syst. 5(3), 127–141 (1999)
Article Google Scholar
Baumgartner, R., Ceresna, M., Ledermüller, G.: Deep web navigation in web data extraction. In: Proc. of IAWTIC (2005)
Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Declarative Information Extraction, Web Crawling and Recursive Wrapping with Lixto. In: Eiter, T., Faber, W., Truszczyński, M. (eds.) LPNMR 2001. LNCS (LNAI), vol. 2173, p. 21. Springer, Heidelberg (2001)
Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proc. of VLDB (2001)
Google Scholar
Baumgartner, R., Herzog, M., Gottlob, G.: Visual programming of web data aggregation applications. In: Proc. of IIWeb 2003 (2003)
Google Scholar
Baumgartner, R., Gatterbauer, W., Gottlob, G.: Web data extraction system. In: Encyclopedia of Database Systems (2009)
Google Scholar
Baumgartner, R., Gottlob, G., Herzog, M.: Scalable web data extraction for online market intelligence, vol. 2, pp. 1512–1523 (2009)
Google Scholar
Baumgartner, R., Gottlob, G., Herzog, M., Slany, W.: Interactively Adding Web Service Interfaces to Existing Web Applications. In: Proc. of SAINT (2004)
Google Scholar
Baumgartner, R., Herzog, M.: Using Lixto for automating portal-based b2b processes in the automotive industry. International Journal of Electronic Business 2(5), 519–530 (2004)
Article Google Scholar
Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Flint: Google-basing the web. In: EDBT 2008: Proceedings of the 11th international conference on Extending database technology, pp. 720–724. ACM, New York (2008)
Google Scholar
Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)
Article Google Scholar
Cafarella, M.J., Ré, C., Suciu, D., Etzioni, O., Banko, M.: Structured querying of web text: A technical challenge. In: CIDR (2007)
Google Scholar
Crescenzi, V., Mecca, G.: Grammars have exceptions. Inf. Syst. 23(9), 539–565 (1998)
Article Google Scholar
Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. J. ACM 51(5), 731–779 (2004)
Article MathSciNet MATH Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB 2001: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Google Scholar
Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale, D.W., Ng, Y.k., Smith, R.D.: Conceptual-model-based data extraction from multiple-record web pages. Data and Knowledge Engineering 31, 227–251 (1999)
Article MATH Google Scholar
Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the web. Commun. ACM 51(12), 68–74 (2008)
Article Google Scholar
Freitag, D.: Information extraction from html: Application of a general machine learning approach. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence, pp. 517–523 (1998)
Google Scholar
Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., Pollak, B.: Towards domain-independent information extraction from web tables. In: Proc. of WWW, May 8-12 (2007)
Google Scholar
Gottlob, G., Koch, C.: Monadic Datalog and the Expressive Power of Web Information Extraction Languages. Journal of the ACM 51(1) (2004)
Google Scholar
Hammer, J., McHugh, J., Garcia-Molina, H.: Semistructured data: The tsimmis experience. In: Proceedings of the First East-European Workshop on Advances in Databases and Information Systems, ADBIS 1997, pp. 1–8 (1997)
Google Scholar
He, B., Chang, K.C.-C.: Statistical schema matching across web query interfaces. In: SIGMOD 2003: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 217–228. ACM, New York (2003)
Google Scholar
He, B., Zhang, Z., Chang, K.C.-C.: Towards building a metaquerier: Extracting and matching web query interfaces. In: International Conference on Data Engineering, pp. 1098–1099 (2005)
Google Scholar
Herzog, M., Gottlob, G.: InfoPipes: A flexible framework for M-Commerce applications. In: Proc. of TES workshop at VLDB (2001)
Google Scholar
Holzinger, W., Krüpl, B., Baumgartner, R.: Automated ontology-driven metasearch generation with metamorph. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) WISE 2009. LNCS, vol. 5802, pp. 473–480. Springer, Heidelberg (2009)
Chapter Google Scholar
Chang, C.h., Lui, S.-C.: Iepad: Information extraction based on pattern discovery, pp. 681–688 (2001)
Google Scholar
Jurić, D., Banek, M., Skočir, Z.: Uncovering the deep web: Transferring relational database content and metadata to OWL ontologies. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part I. LNCS (LNAI), vol. 5177, pp. 456–463. Springer, Heidelberg (2008)
Chapter Google Scholar
Kayed, M., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. on Knowl. and Data Eng. 18(10), 1411–1428 (2006); Member-Chang, Chia-Hui and Member-Girgis, Moheb Ramzy
Article Google Scholar
Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and reliably extracting data from the web: a machine learning approach, pp. 275–287 (2003)
Google Scholar
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118, 2000 (2000)
Article MathSciNet MATH Google Scholar
Laender, A.H.F., Ribeiro-Neto, B., da Silva, A.S.: Debye - date extraction by example. Data Knowl. Eng. 40(2), 121–154 (2002)
Article MATH Google Scholar
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Rec. 31(2), 84–93 (2002)
Article Google Scholar
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. In: SIGMOD 2004: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pp. 119–130. ACM, New York (2004)
Google Scholar
Lerman, K., Minton, S.N., Knoblock, C.A.: Wrapper maintenance: a machine learning approach. J. Artif. Int. Res. 18(1), 149–181 (2003)
MATH Google Scholar
Liu, L., Pu, C., Han, W.: Xwrap: An xml-enabled wrapper construction system for web information sources. In: ICDE, pp. 611–621 (2000)
Google Scholar
Raposo, J., Pan, A., Alvarez, M., Hidalgo, J., Vina, A.: The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes. In: Proceedings of DEXA 2002, Aix-en-Provence, France (2002)
Google Scholar
Riloff, E.: Automatically constructing a dictionary for information extraction tasks. In: Proceedings of the Eleventh National Conference on Artificial Intelligence, pp. 811–816. MIT Press, Cambridge (1993)
Google Scholar
Sahuguet, A., Azavant, F.: Building intelligent web applications using lightweight wrappers. Data Knowl. Eng. 36(3), 283–316 (2001)
Article MATH Google Scholar
Shen, W., Derose, P., Vu, L., Doan, A., Ramakrishnan, R.: Source-aware entity matching: A compositional approach. In: IEEE 23rd International Conference on Data Engineering, ICDE 2007, pp. 196–205 (2007)
Google Scholar
Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. In: SIGMOD 2008: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1031–1042. ACM, New York (2008)
Chapter Google Scholar
Shen, W., Doan, A., Naughton, J.F., Ramakrishnan, R.: Declarative information extraction using datalog with embedded extraction predicates. In: VLDB 2007: Proceedings of the 33rd international conference on Very large data bases, pp. 1033–1044. VLDB Endowment (2007)
Google Scholar
Soderland, S., Cardie, C., Mooney, R.: Learning information extraction rules for semi-structured and free text. Machine Learning, 233–272 (1999)
Google Scholar
Soderland, S., Fisher, D., Aseltine, J., Lehnert, W.: Crystal: Inducing a conceptual dictionary. In: Mellish, C. (ed.) Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1314–1319. Morgan Kaufmann, San Francisco (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

Lixto Software GmbH, Favoritenstrasse 9-11, 1040, Wien, Austria
Robert Baumgartner & Marcus Herzog
Politecnico di Milano, DEI, Piazza Leonardo da Vinci 32, 20133, Milano, Italy
Alessandro Campi
Computing Laboratory, Oxford University, U.K.
Georg Gottlob

Authors

Robert Baumgartner
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Campi
View author publications
You can also search for this author in PubMed Google Scholar
Georg Gottlob
View author publications
You can also search for this author in PubMed Google Scholar
Marcus Herzog
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dipartimento di Elettronica e Informazione, Politecnico di Milano, Piazza L. Da Vinci, 32, I20133, Milano, Italy
Stefano Ceri & Marco Brambilla &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Baumgartner, R., Campi, A., Gottlob, G., Herzog, M. (2010). Chapter 6: Web Data Extraction for Service Creation. In: Ceri, S., Brambilla, M. (eds) Search Computing. Lecture Notes in Computer Science, vol 5950. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12310-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-12310-8_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12309-2
Online ISBN: 978-3-642-12310-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics