A Constrained Clustering Approach to Duplicate Detection Among Relational Data

Wang, Chao; Lu, Jie; Zhang, Guangquan

doi:10.1007/978-3-540-71701-0_31

Chao Wang¹,
Jie Lu¹ &
Guangquan Zhang¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4426))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1797 Accesses
5 Citations

Abstract

This paper proposes an approach to detect duplicates among relational data. Traditional methods for record linkage or duplicate detection work on a set of records which have no explicit relations with each other. These records can be formatted into a single database table for processing. However, there are situations that records from different sources can not be flattened into one table and records within one source have certain (semantic) relations between them. The duplicate detection issue of these relational data records/instances can be dealt with by formatting them into several tables and applying traditional methods to each table. However, as the relations among the original data records are ignored, this approach generates poor or inconsistent results. This paper analyzes the characteristics of relational data and proposes a particular clustering approach to perform duplicate detection. This approach incorporates constraint rules derived from the characteristics of relational data and therefore yields better and more consistent results, which are revealed by our experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)
Article Google Scholar
Winkler, W.E.: Methods for record linkage and bayesian networks. Technical report, U.S. Census Bureau, Statistical Research Division (2002)
Google Scholar
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, Edmonton, Alberta, Canada, pp. 475–480. ACM Press, New York (2002), doi:10.1145/775047.775116
Chapter Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, D.C., pp. 39–48. ACM Press, New York (2003), doi:10.1145/956750.956759
Chapter Google Scholar
Singla, P., Domingos, P.: Collective object identification. In: Kaelbling, L.P., Saffiotti, A. (eds.) IJCAI, pp. 1636–1637. Professional Book Center, Denver (2005)
Google Scholar
Culotta, A., McCallum, A.: Joint deduplication of multiple record types in relational data. In: Fourteenth Conference on Information and Knowledge Management (CIKM) (2005)
Google Scholar
Newcombe, H.B., et al.: Automatic linkage of vital records. Science 130, 954–959 (1959)
Article Google Scholar
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: KDD ’00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, Massachusetts, United States, pp. 169–178. ACM Press, New York (2000), doi:10.1145/347090.347123
Chapter Google Scholar
Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, Edmonton, Alberta, Canada, pp. 350–359. ACM Press, New York (2002), doi:10.1145/775047.775099
Chapter Google Scholar
Vapnik, V.N.: The nature of statistical learning theory, 2nd edn. Statistics for engineering and information science. Springer, New York (1999)
Google Scholar
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Brodley, C.E., Danyluk, A.P. (eds.) ICML, pp. 282–289. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American 284(5), 34–43 (2001)
Article Google Scholar
McGuinness, D.L., van Harmelen, F.: Owl web ontology language overview. W3C recommendation (2004), http://www.w3.org/tr/2004/rec-owl-features-20040210
Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. SIGMOD Rec. 22(2), 207–216 (1993), doi:10.1145/170036.170072
Article Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information Technology, University of Technology, Sydney, PO Box 123, Broadway, NSW 2007, Australia
Chao Wang, Jie Lu & Guangquan Zhang

Authors

Chao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jie Lu
View author publications
You can also search for this author in PubMed Google Scholar
Guangquan Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Zhi-Hua Zhou Hang Li Qiang Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, C., Lu, J., Zhang, G. (2007). A Constrained Clustering Approach to Duplicate Detection Among Relational Data. In: Zhou, ZH., Li, H., Yang, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71701-0_31

Download citation

DOI: https://doi.org/10.1007/978-3-540-71701-0_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71700-3
Online ISBN: 978-3-540-71701-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics