Skip to main content

Automatic Generation of Data Types for Classification of Deep Web Sources

  • Conference paper
Data Integration in the Life Sciences (DILS 2005)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3615))

Included in the following conference series:

  • 851 Accesses

Abstract

A Service Class Description (SCD) is an effective meta-data based approach for discovering Deep Web sources whose data exhibit some regular patterns. However, it is tedious and error prone to create an SCD description manually. Moreover, a manually created SCD is not adaptive to the frequent changes of Web sources. It requires its creator to identify all the possible input and output types of a service a priori. In many domains, it is impossible to exhaustively list all the possible input and output data types of a source in advance. In this paper, we describe machine learning approaches for automatic generation of the data types of an SCD. We propose two different approaches for learning data types of a class of Web sources. The Brute-Force Learner is able to generate data types that can achieve high recall, but with low precision. The Clustering-based Learner generates data types that have a high precision rate, but with a lower recall rate. We demonstrate the feasibility of these two learning-based solutions for automatic generation of data types for citation Web sources and presented a quantitative evaluation of these two solutions.

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-ENG-48. UCRL-CONF-209719.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Buttler, D., Liu, L., Pu, C.: A fully automated object extraction system for the world wide web. In: Proceedings of IEEE International Conference on Distributed Computing Systems (April 2001)

    Google Scholar 

  2. Caverlee, J., Liu, L., Buttler, D.: Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web. In: Proceedings of the 20th IEEE International Conference on Data Engineering (ICDE 2004), Boston, USA (2004)

    Google Scholar 

  3. Fallside, D.C.: XML Schema Part 0: Primer. Technical report, World Wide Web Consortium (2001), http://www.w3.org/TR/xmlschema-0/

  4. He, B., Chang, K.C.-C.: Statistical schema matching across web query interfaces. In: Proceedings of ACM/SIGMOD Conference on Management of Data, San Diego, CA. ACM Press, New York (2003)

    Google Scholar 

  5. Madhavan, J., Berstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of the Twenty-seven International Conference on Very Large Databases, Roma, Italy. VLDB Endowment (2001)

    Google Scholar 

  6. Ngu, A.H., Buttler, D., Critchlow, T.: Automatic Generation of data Types for Classification of Deep Web Sources. Technical Report UCRL-CONF-209719, Lawrence Livermore National Laboratory (2005)

    Google Scholar 

  7. Ngu, A.H., Rocco, D., Critchlow, T., Buttler, D.: Automatic discovery and inferencing of complex Bioinformatics Web Interfaces. Technical Report UCRL-JRNL-201611 (to appear in WWW journal, Springer), Lawrence Livermore National Laboratory, Livermore, CA (2004)

    Google Scholar 

  8. Soderland, S.: Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning 1(44), 1–44 (1999)

    Google Scholar 

  9. Wang, J., Wen, J.-R., Lochovsky, F., Ma, W.-Y.: Instance-based schema matching for web databases by domain-specific query probing. In: Proceedings of the Thirty International Conference on Very Large Databases, Toronta, Canada. VLDB Endowment (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ngu, A.H.H., Buttler, D., Critchlow, T. (2005). Automatic Generation of Data Types for Classification of Deep Web Sources. In: Ludäscher, B., Raschid, L. (eds) Data Integration in the Life Sciences. DILS 2005. Lecture Notes in Computer Science(), vol 3615. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11530084_21

Download citation

  • DOI: https://doi.org/10.1007/11530084_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-27967-9

  • Online ISBN: 978-3-540-31879-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics