Skip to main content

A Constrained Clustering Approach to Duplicate Detection Among Relational Data

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4426))

Included in the following conference series:

Abstract

This paper proposes an approach to detect duplicates among relational data. Traditional methods for record linkage or duplicate detection work on a set of records which have no explicit relations with each other. These records can be formatted into a single database table for processing. However, there are situations that records from different sources can not be flattened into one table and records within one source have certain (semantic) relations between them. The duplicate detection issue of these relational data records/instances can be dealt with by formatting them into several tables and applying traditional methods to each table. However, as the relations among the original data records are ignored, this approach generates poor or inconsistent results. This paper analyzes the characteristics of relational data and proposes a particular clustering approach to perform duplicate detection. This approach incorporates constraint rules derived from the characteristics of relational data and therefore yields better and more consistent results, which are revealed by our experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)

    Article  Google Scholar 

  2. Winkler, W.E.: Methods for record linkage and bayesian networks. Technical report, U.S. Census Bureau, Statistical Research Division (2002)

    Google Scholar 

  3. Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, Edmonton, Alberta, Canada, pp. 475–480. ACM Press, New York (2002), doi:10.1145/775047.775116

    Chapter  Google Scholar 

  4. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, D.C., pp. 39–48. ACM Press, New York (2003), doi:10.1145/956750.956759

    Chapter  Google Scholar 

  5. Singla, P., Domingos, P.: Collective object identification. In: Kaelbling, L.P., Saffiotti, A. (eds.) IJCAI, pp. 1636–1637. Professional Book Center, Denver (2005)

    Google Scholar 

  6. Culotta, A., McCallum, A.: Joint deduplication of multiple record types in relational data. In: Fourteenth Conference on Information and Knowledge Management (CIKM) (2005)

    Google Scholar 

  7. Newcombe, H.B., et al.: Automatic linkage of vital records. Science 130, 954–959 (1959)

    Article  Google Scholar 

  8. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: KDD ’00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, Massachusetts, United States, pp. 169–178. ACM Press, New York (2000), doi:10.1145/347090.347123

    Chapter  Google Scholar 

  9. Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, Edmonton, Alberta, Canada, pp. 350–359. ACM Press, New York (2002), doi:10.1145/775047.775099

    Chapter  Google Scholar 

  10. Vapnik, V.N.: The nature of statistical learning theory, 2nd edn. Statistics for engineering and information science. Springer, New York (1999)

    Google Scholar 

  11. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Brodley, C.E., Danyluk, A.P. (eds.) ICML, pp. 282–289. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  12. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American 284(5), 34–43 (2001)

    Article  Google Scholar 

  13. McGuinness, D.L., van Harmelen, F.: Owl web ontology language overview. W3C recommendation (2004), http://www.w3.org/tr/2004/rec-owl-features-20040210

  14. Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. SIGMOD Rec. 22(2), 207–216 (1993), doi:10.1145/170036.170072

    Article  Google Scholar 

  15. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Zhi-Hua Zhou Hang Li Qiang Yang

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Wang, C., Lu, J., Zhang, G. (2007). A Constrained Clustering Approach to Duplicate Detection Among Relational Data. In: Zhou, ZH., Li, H., Yang, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71701-0_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-71701-0_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-71700-3

  • Online ISBN: 978-3-540-71701-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics