Site-Wide Wrapper Induction for Life Science Deep Web Databases

Mir, Saqib; Staab, Steffen; Rojas, Isabel

doi:10.1007/978-3-642-02879-3_9

Saqib Mir^21,22,
Steffen Staab²² &
Isabel Rojas²¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5647))

Included in the following conference series:

International Workshop on Data Integration in the Life Sciences

497 Accesses

Abstract

We present a novel approach to automatic information extraction from Deep Web Life Science databases using wrapper induction. Traditional wrapper induction techniques focus on learning wrappers based on examples from one class of Web pages, i.e. from Web pages that are all similar in structure and content. Thereby, traditional wrapper induction targets the understanding of Web pages generated from a database using the same generation template as observed in the example set. However, Life Science Web sites typically contain structurally diverse web pages from multiple classes making the problem more challenging. Furthermore, we observed that such Life Science Web sites do not just provide mere data, but they also tend to provide schema information in terms of data labels – giving further cues for solving the Web site wrapping task. Our solution to this novel challenge of Site-Wide wrapper induction consists of a sequence of steps: 1. classification of similar Web pages into classes, 2. discovery of these classes and 3. wrapper induction for each class. Our approach thus allows us to perform unsupervised information retrieval from across an entire Web site. We test our algorithm against three real-world biochemical deep Web sources and report our preliminary results, which are very promising.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A survey on semantic schema discovery

Article 27 November 2021

Datamining with Ontologies

Knowledge Harvesting: Achievements and Challenges

References

Anton, T.: XPath-Wrapper Induction by generalizing tree traversal patterns. In: Workshop on Web Mining, in ECML/PKDD (2006)
Google Scholar
Barbosa, L., Freire, J.: Searching for Hidden-Web Databases. In: WebDB, pp. 1–6 (2005)
Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proc. 27th Interntnl. Conference on Very Large Data Bases, pp. 119–128 (2001)
Google Scholar
Chakrabarti, S., et al.: Mining the Web’s link structure. Computer 32(8), 60–67 (1999)
Article Google Scholar
Chang, K.C.-C., Cho, J.: Accessing the Web: From Search to Integration. In: Proceedings of the 2006 ACM SIGMOD Conference (2006)
Google Scholar
Chang, C.-H., Hsu, C.-N., Lui, S.-C.: Automatic information extraction from semi-structured web pages by pattern discovery. SCI expanded 35(1), 129–147 (2003), Special Issue on Web Retrieval and Mining
Google Scholar
Chang, K.C.-C., He, B., Zhang, Z.: Mining Semantics for Large Scale Integration on the Web: Evidences, Insights and Challenges. SIGKDD Explorations 6(2), 67–76 (2004)
Article Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: VLDB, pp. 109–118 (2001)
Google Scholar
Crescenzi, V., Merialdo, P., Missier, P.: Clustering Web pages based on their structure. Data & Knowledge Engineering 54, 279–299 (2005)
Article Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Improving the expressiveness of ROADRUNNER. In: SEBD, pp. 62–69 (2004)
Google Scholar
de Castro Reis, D., et al.: Automatic web news extraction using tree edit distance. In: WWW13, pp. 502–511 (2004)
Google Scholar
Degtyarenko, K., et al.: ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 350, D344–D350 (2008)
Google Scholar
Golovin, A., et al.: E-MSD: an integrated data. Nucleic Acids Research 32(Database issue), 211–216 (2004)
Article Google Scholar
He, B., Chang, K.C.-C.: Statistical Schema Matching across Web Query Interfaces. In: SIGMOD Conference, pp. 217–228 (2003)
Google Scholar
He, H., Meng, W., Yu, C.T., Wu, Z.: WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce. In: VLDB, pp. 357–368 (2003)
Google Scholar
He, B., Tao, T., Chang, K.C.-C.: Organizing structured web sources by query schemas: a clustering approach. In: CIKM, pp. 22–31 (2004)
Google Scholar
Kanehisa, M.: The KEGG database. In: Novartis Found Symp., vol. 247, pp. 91–101, discussion 101–3, 119–28, 244–52 (2002)
Google Scholar
Knoblock, C., Kambhampati, C.: Information Integration on the Web. In: AAAI (2002)
Google Scholar
Kabra, G., Li, C., Chang, K.C.C.: Query Routing: Finding Ways in the Maze of the DeepWeb. In: WIRI 2005, pp. 64–73 (2005)
Google Scholar
Kushmerick, N.: Wrapper Induction for information extraction. In: ICAI (1998)
Google Scholar
Kushmerick, N.: Learning to Invoke Web Forms. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.) CoopIS 2003, DOA 2003, and ODBASE 2003. LNCS, vol. 2888, pp. 997–1013. Springer, Heidelberg (2003)
Chapter Google Scholar
Laender, A.H.F., Ribeiro-Neto, B., Silva, A.S.D., Teixeira, J.S.: A brief survey of web data extraction tools. ACM SIGMOD Record 31(2), 84–93 (2002)
Article Google Scholar
Lu, Y., et al.: Clustering e-commerce search engines based on search interface pages using WISE-Cluster. Data Knowl. Eng. 59(2), 231–246 (2006)
Article Google Scholar
Madhavan, J., et al.: Corpus-based Schema Matching. In: ICDE, pp. 57–68 (2005)
Google Scholar
Myllymaki, J., Jackson, J.: Robust Web Data Extraction with XML Path Expressions. IBM Research Report (2002)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.: Stalker: Learning extraction rules for semistructured, web-based information sources. In: AAAI 1998: AI and Information Integration Workshop (1998)
Google Scholar
Meng, W., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: WWW14 (2005)
Google Scholar
Sahuguet, A., Azavant, F.: Building intelligent Web applications using lightweight wrappers. Data Knowl. Eng. 36(3), 283–316 (2001)
Article Google Scholar
Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. In: CIKM 2005, pp. 381–388 (2005)
Google Scholar
Vidal, A., et al.: Structure-driven crawler generation by example. In: SIGIR 2006, pp. 292–299 (2006)
Google Scholar
Wang, J., Wen, J.-R., Lochovsky, F.H., Ma, W.-Y.: Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. In: VLDB, pp. 408–419 (2004)
Google Scholar
Wu, W., Doan, A., Yu, C.T.: WebIQ: Learning from the Web to Match Deep-Web Query Interfaces. In: ICDE, p. 44 (2006)
Google Scholar
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW12, p. 187–196 (2003)
Google Scholar
Zhang, Z., He, B., Chang, K.C.-C.: Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax. In: SIGMOD Conference, pp. 107–118 (2004)
Google Scholar
Zhai, Y., Liu, B.: Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment. In: AAAI 2006, Boston, USA, July 16-20 (2006)
Google Scholar
Zhai, Y., Liu, B.: Extracting Web Data Using Instance-Based Learning. In: WWW16 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

EML Research, Schloss-Wolfsbrunnenweg 33, D-69118, Heidelberg, Germany
Saqib Mir & Isabel Rojas
Institute for Computer Science, University of Koblenz-Landau, D-56016, Koblenz, Germany
Saqib Mir & Steffen Staab

Authors

Saqib Mir
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Staab
View author publications
You can also search for this author in PubMed Google Scholar
Isabel Rojas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, University of Manchester, M13 9PL, Manchester, UK
Norman W. Paton & Paolo Missier &
School of Computer Science, The University of Manchester, Oxford Road, M13 9PL, Manchester, UK
Cornelia Hedeler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mir, S., Staab, S., Rojas, I. (2009). Site-Wide Wrapper Induction for Life Science Deep Web Databases. In: Paton, N.W., Missier, P., Hedeler, C. (eds) Data Integration in the Life Sciences. DILS 2009. Lecture Notes in Computer Science(), vol 5647. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02879-3_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-02879-3_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02878-6
Online ISBN: 978-3-642-02879-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics