A Novel Clustering-Based Approach to Schema Matching

Pei, Jin; Hong, Jun; Bell, David

doi:10.1007/11890393_7

Jin Pei¹⁸,
Jun Hong¹⁸ &
David Bell¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4243))

Included in the following conference series:

International Conference on Advances in Information Systems

828 Accesses
9 Citations

Abstract

Schema matching is a critical step in data integration from multiple heterogeneous data sources. This paper presents a new approach to schema matching, based on two observations. First, it is easier to find attribute correspondences between those schemas that are contextually similar. Second, the attribute correspondences found between these schemas can be used to help find new attribute correspondences between other schemas. Motivated by these observations, we propose a novel clustering-based approach to schema matching. First, we cluster schemas on the basis of their contextual similarity. Second, we cluster attributes of the schemas that are in the same schema cluster to find attribute correspondences between these schemas. Third, we cluster attributes across different schema clusters using statistical information gleaned from the existing attribute clusters to find attribute correspondences between more schemas. We leverage a fast clustering algorithm, the K-Means algorithm, to the above three clustering tasks. We have evaluated our approach in the context of integrating information from multiple web interfaces and the results show the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

PROCLAIM: An Unsupervised Approach to Discover Domain-Specific Attribute Matchings from Heterogeneous Sources

Content Data Based Schema Matching

YAM: A Step Forward for Generating a Dedicated Schema Matcher

References

Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: ICDE 2002, Washington, DC, USA, pp. 117–128. IEEE Computer Society Press, Los Alamitos (2002)
Google Scholar
He, B., Chang, K., Han, J.: Discovering complex matchings across web query interfaces: a correlation mining approach. In: KDD 2004, pp. 148–157. ACM Press, New York (2004)
Chapter Google Scholar
Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: SIGMOD 2004, pp. 95–106. ACM Press, New York (2004)
Chapter Google Scholar
He, B., Chang, K.: Statistical schema matching across web query interfaces. In: SIGMOD 2003, pp. 217–228. ACM Press, New York (2003)
Chapter Google Scholar
Do, H., Rahm, E.: Coma - a system for flexible combination of schema matching approaches. In: VLDB 2002, HongKong (2002)
Google Scholar
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB jounal 10, 334–350 (2001)
Article MATH Google Scholar
He, H., Meng, W., Yu, C.T., Wu, Z.: Wise-integrator: An automatic integrator of web search interfaces for ecommerce. In: VLDB 2003, pp. 357–268 (2003)
Google Scholar
Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based schema matching. In: ICDE 2005 (2005)
Google Scholar
Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Chichester (1990)
Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: KDD 1999, pp. 16–22. ACM Press, New York (1999)
Chapter Google Scholar
Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McCraw-Hill, New York (1983)
MATH Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)
Google Scholar
Lange, T., Roth, V., Braun, M.L., Buhmann, J.: Stability-based validation of clustering solutions. Neural Computation 16, 1299–1323 (2004)
Article MATH Google Scholar
Levine, E.E.: Resampling method for unsupervised estimation of cluster validity. Neural Computation 13, 2573–2593 (2001)
Article MATH Google Scholar
UIUC: Icq datasets: http://metaquerier.cs.uiuc.edu/repository/datasets/icq/index.html

Download references

Author information

Authors and Affiliations

School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast, BT7 1NN, UK
Jin Pei, Jun Hong & David Bell

Authors

Jin Pei
View author publications
You can also search for this author in PubMed Google Scholar
Jun Hong
View author publications
You can also search for this author in PubMed Google Scholar
David Bell
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dokuz Eylül University, lzmir, Turkey
Tatyana Yakhno
University of Vienna, Vienna, Austria
Erich J. Neuhold

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pei, J., Hong, J., Bell, D. (2006). A Novel Clustering-Based Approach to Schema Matching. In: Yakhno, T., Neuhold, E.J. (eds) Advances in Information Systems. ADVIS 2006. Lecture Notes in Computer Science, vol 4243. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11890393_7

Download citation

DOI: https://doi.org/10.1007/11890393_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46291-0
Online ISBN: 978-3-540-46292-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics