Abstract
We present a novel approach to automatic information extraction from Deep Web Life Science databases using wrapper induction. Traditional wrapper induction techniques focus on learning wrappers based on examples from one class of Web pages, i.e. from Web pages that are all similar in structure and content. Thereby, traditional wrapper induction targets the understanding of Web pages generated from a database using the same generation template as observed in the example set. However, Life Science Web sites typically contain structurally diverse web pages from multiple classes making the problem more challenging. Furthermore, we observed that such Life Science Web sites do not just provide mere data, but they also tend to provide schema information in terms of data labels – giving further cues for solving the Web site wrapping task. Our solution to this novel challenge of Site-Wide wrapper induction consists of a sequence of steps: 1. classification of similar Web pages into classes, 2. discovery of these classes and 3. wrapper induction for each class. Our approach thus allows us to perform unsupervised information retrieval from across an entire Web site. We test our algorithm against three real-world biochemical deep Web sources and report our preliminary results, which are very promising.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Anton, T.: XPath-Wrapper Induction by generalizing tree traversal patterns. In: Workshop on Web Mining, in ECML/PKDD (2006)
Barbosa, L., Freire, J.: Searching for Hidden-Web Databases. In: WebDB, pp. 1–6 (2005)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proc. 27th Interntnl. Conference on Very Large Data Bases, pp. 119–128 (2001)
Chakrabarti, S., et al.: Mining the Web’s link structure. Computer 32(8), 60–67 (1999)
Chang, K.C.-C., Cho, J.: Accessing the Web: From Search to Integration. In: Proceedings of the 2006 ACM SIGMOD Conference (2006)
Chang, C.-H., Hsu, C.-N., Lui, S.-C.: Automatic information extraction from semi-structured web pages by pattern discovery. SCI expanded 35(1), 129–147 (2003), Special Issue on Web Retrieval and Mining
Chang, K.C.-C., He, B., Zhang, Z.: Mining Semantics for Large Scale Integration on the Web: Evidences, Insights and Challenges. SIGKDD Explorations 6(2), 67–76 (2004)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: VLDB, pp. 109–118 (2001)
Crescenzi, V., Merialdo, P., Missier, P.: Clustering Web pages based on their structure. Data & Knowledge Engineering 54, 279–299 (2005)
Crescenzi, V., Mecca, G., Merialdo, P.: Improving the expressiveness of ROADRUNNER. In: SEBD, pp. 62–69 (2004)
de Castro Reis, D., et al.: Automatic web news extraction using tree edit distance. In: WWW13, pp. 502–511 (2004)
Degtyarenko, K., et al.: ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 350, D344–D350 (2008)
Golovin, A., et al.: E-MSD: an integrated data. Nucleic Acids Research 32(Database issue), 211–216 (2004)
He, B., Chang, K.C.-C.: Statistical Schema Matching across Web Query Interfaces. In: SIGMOD Conference, pp. 217–228 (2003)
He, H., Meng, W., Yu, C.T., Wu, Z.: WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce. In: VLDB, pp. 357–368 (2003)
He, B., Tao, T., Chang, K.C.-C.: Organizing structured web sources by query schemas: a clustering approach. In: CIKM, pp. 22–31 (2004)
Kanehisa, M.: The KEGG database. In: Novartis Found Symp., vol. 247, pp. 91–101, discussion 101–3, 119–28, 244–52 (2002)
Knoblock, C., Kambhampati, C.: Information Integration on the Web. In: AAAI (2002)
Kabra, G., Li, C., Chang, K.C.C.: Query Routing: Finding Ways in the Maze of the DeepWeb. In: WIRI 2005, pp. 64–73 (2005)
Kushmerick, N.: Wrapper Induction for information extraction. In: ICAI (1998)
Kushmerick, N.: Learning to Invoke Web Forms. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.) CoopIS 2003, DOA 2003, and ODBASE 2003. LNCS, vol. 2888, pp. 997–1013. Springer, Heidelberg (2003)
Laender, A.H.F., Ribeiro-Neto, B., Silva, A.S.D., Teixeira, J.S.: A brief survey of web data extraction tools. ACM SIGMOD Record 31(2), 84–93 (2002)
Lu, Y., et al.: Clustering e-commerce search engines based on search interface pages using WISE-Cluster. Data Knowl. Eng. 59(2), 231–246 (2006)
Madhavan, J., et al.: Corpus-based Schema Matching. In: ICDE, pp. 57–68 (2005)
Myllymaki, J., Jackson, J.: Robust Web Data Extraction with XML Path Expressions. IBM Research Report (2002)
Muslea, I., Minton, S., Knoblock, C.: Stalker: Learning extraction rules for semistructured, web-based information sources. In: AAAI 1998: AI and Information Integration Workshop (1998)
Meng, W., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: WWW14 (2005)
Sahuguet, A., Azavant, F.: Building intelligent Web applications using lightweight wrappers. Data Knowl. Eng. 36(3), 283–316 (2001)
Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. In: CIKM 2005, pp. 381–388 (2005)
Vidal, A., et al.: Structure-driven crawler generation by example. In: SIGIR 2006, pp. 292–299 (2006)
Wang, J., Wen, J.-R., Lochovsky, F.H., Ma, W.-Y.: Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. In: VLDB, pp. 408–419 (2004)
Wu, W., Doan, A., Yu, C.T.: WebIQ: Learning from the Web to Match Deep-Web Query Interfaces. In: ICDE, p. 44 (2006)
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW12, p. 187–196 (2003)
Zhang, Z., He, B., Chang, K.C.-C.: Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax. In: SIGMOD Conference, pp. 107–118 (2004)
Zhai, Y., Liu, B.: Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment. In: AAAI 2006, Boston, USA, July 16-20 (2006)
Zhai, Y., Liu, B.: Extracting Web Data Using Instance-Based Learning. In: WWW16 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mir, S., Staab, S., Rojas, I. (2009). Site-Wide Wrapper Induction for Life Science Deep Web Databases. In: Paton, N.W., Missier, P., Hedeler, C. (eds) Data Integration in the Life Sciences. DILS 2009. Lecture Notes in Computer Science(), vol 5647. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02879-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-02879-3_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02878-6
Online ISBN: 978-3-642-02879-3
eBook Packages: Computer ScienceComputer Science (R0)