Abstract
A Service Class Description (SCD) is an effective meta-data based approach for discovering Deep Web sources whose data exhibit some regular patterns. However, it is tedious and error prone to create an SCD description manually. Moreover, a manually created SCD is not adaptive to the frequent changes of Web sources. It requires its creator to identify all the possible input and output types of a service a priori. In many domains, it is impossible to exhaustively list all the possible input and output data types of a source in advance. In this paper, we describe machine learning approaches for automatic generation of the data types of an SCD. We propose two different approaches for learning data types of a class of Web sources. The Brute-Force Learner is able to generate data types that can achieve high recall, but with low precision. The Clustering-based Learner generates data types that have a high precision rate, but with a lower recall rate. We demonstrate the feasibility of these two learning-based solutions for automatic generation of data types for citation Web sources and presented a quantitative evaluation of these two solutions.
This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-ENG-48. UCRL-CONF-209719.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Buttler, D., Liu, L., Pu, C.: A fully automated object extraction system for the world wide web. In: Proceedings of IEEE International Conference on Distributed Computing Systems (April 2001)
Caverlee, J., Liu, L., Buttler, D.: Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web. In: Proceedings of the 20th IEEE International Conference on Data Engineering (ICDE 2004), Boston, USA (2004)
Fallside, D.C.: XML Schema Part 0: Primer. Technical report, World Wide Web Consortium (2001), http://www.w3.org/TR/xmlschema-0/
He, B., Chang, K.C.-C.: Statistical schema matching across web query interfaces. In: Proceedings of ACM/SIGMOD Conference on Management of Data, San Diego, CA. ACM Press, New York (2003)
Madhavan, J., Berstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of the Twenty-seven International Conference on Very Large Databases, Roma, Italy. VLDB Endowment (2001)
Ngu, A.H., Buttler, D., Critchlow, T.: Automatic Generation of data Types for Classification of Deep Web Sources. Technical Report UCRL-CONF-209719, Lawrence Livermore National Laboratory (2005)
Ngu, A.H., Rocco, D., Critchlow, T., Buttler, D.: Automatic discovery and inferencing of complex Bioinformatics Web Interfaces. Technical Report UCRL-JRNL-201611 (to appear in WWW journal, Springer), Lawrence Livermore National Laboratory, Livermore, CA (2004)
Soderland, S.: Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning 1(44), 1–44 (1999)
Wang, J., Wen, J.-R., Lochovsky, F., Ma, W.-Y.: Instance-based schema matching for web databases by domain-specific query probing. In: Proceedings of the Thirty International Conference on Very Large Databases, Toronta, Canada. VLDB Endowment (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ngu, A.H.H., Buttler, D., Critchlow, T. (2005). Automatic Generation of Data Types for Classification of Deep Web Sources. In: Ludäscher, B., Raschid, L. (eds) Data Integration in the Life Sciences. DILS 2005. Lecture Notes in Computer Science(), vol 3615. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11530084_21
Download citation
DOI: https://doi.org/10.1007/11530084_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27967-9
Online ISBN: 978-3-540-31879-8
eBook Packages: Computer ScienceComputer Science (R0)