Automatic Generation of Data Types for Classification of Deep Web Sources

Ngu, Anne H. H.; Buttler, David; Critchlow, Terence

doi:10.1007/11530084_21

Anne H. H. Ngu²¹,
David Buttler²² &
Terence Critchlow²²

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3615))

Included in the following conference series:

International Workshop on Data Integration in the Life Sciences

851 Accesses

Abstract

A Service Class Description (SCD) is an effective meta-data based approach for discovering Deep Web sources whose data exhibit some regular patterns. However, it is tedious and error prone to create an SCD description manually. Moreover, a manually created SCD is not adaptive to the frequent changes of Web sources. It requires its creator to identify all the possible input and output types of a service a priori. In many domains, it is impossible to exhaustively list all the possible input and output data types of a source in advance. In this paper, we describe machine learning approaches for automatic generation of the data types of an SCD. We propose two different approaches for learning data types of a class of Web sources. The Brute-Force Learner is able to generate data types that can achieve high recall, but with low precision. The Clustering-based Learner generates data types that have a high precision rate, but with a lower recall rate. We demonstrate the feasibility of these two learning-based solutions for automatic generation of data types for citation Web sources and presented a quantitative evaluation of these two solutions.

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-ENG-48. UCRL-CONF-209719.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Hybrid Supervised/Unsupervised Machine Learning Approach to Classify Web Services

Describing and Organizing Semantic Web and Machine Learning Systems in the SWeMLS-KG

DSD: The Data Source Description Vocabulary

References

Buttler, D., Liu, L., Pu, C.: A fully automated object extraction system for the world wide web. In: Proceedings of IEEE International Conference on Distributed Computing Systems (April 2001)
Google Scholar
Caverlee, J., Liu, L., Buttler, D.: Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web. In: Proceedings of the 20th IEEE International Conference on Data Engineering (ICDE 2004), Boston, USA (2004)
Google Scholar
Fallside, D.C.: XML Schema Part 0: Primer. Technical report, World Wide Web Consortium (2001), http://www.w3.org/TR/xmlschema-0/
He, B., Chang, K.C.-C.: Statistical schema matching across web query interfaces. In: Proceedings of ACM/SIGMOD Conference on Management of Data, San Diego, CA. ACM Press, New York (2003)
Google Scholar
Madhavan, J., Berstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of the Twenty-seven International Conference on Very Large Databases, Roma, Italy. VLDB Endowment (2001)
Google Scholar
Ngu, A.H., Buttler, D., Critchlow, T.: Automatic Generation of data Types for Classification of Deep Web Sources. Technical Report UCRL-CONF-209719, Lawrence Livermore National Laboratory (2005)
Google Scholar
Ngu, A.H., Rocco, D., Critchlow, T., Buttler, D.: Automatic discovery and inferencing of complex Bioinformatics Web Interfaces. Technical Report UCRL-JRNL-201611 (to appear in WWW journal, Springer), Lawrence Livermore National Laboratory, Livermore, CA (2004)
Google Scholar
Soderland, S.: Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning 1(44), 1–44 (1999)
Google Scholar
Wang, J., Wen, J.-R., Lochovsky, F., Ma, W.-Y.: Instance-based schema matching for web databases by domain-specific query probing. In: Proceedings of the Thirty International Conference on Very Large Databases, Toronta, Canada. VLDB Endowment (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Texas State University, San Marcos, TX, 78666, USA
Anne H. H. Ngu
Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA, 94551, USA
David Buttler & Terence Critchlow

Authors

Anne H. H. Ngu
View author publications
You can also search for this author in PubMed Google Scholar
David Buttler
View author publications
You can also search for this author in PubMed Google Scholar
Terence Critchlow
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of California, Davis,
Bertram Ludäscher
University of Maryland, College Park, 20742, MD, USA
Louiqa Raschid

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ngu, A.H.H., Buttler, D., Critchlow, T. (2005). Automatic Generation of Data Types for Classification of Deep Web Sources. In: Ludäscher, B., Raschid, L. (eds) Data Integration in the Life Sciences. DILS 2005. Lecture Notes in Computer Science(), vol 3615. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11530084_21

Download citation

DOI: https://doi.org/10.1007/11530084_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27967-9
Online ISBN: 978-3-540-31879-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics