Skip to main content

Efficient Duplicate Record Detection Based on Similarity Estimation

  • Conference paper
Web-Age Information Management (WAIM 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6184))

Included in the following conference series:

Abstract

In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on the attributes of them and the similarity is measured as the weight of such matching. However, the intuitive method has two aspects of shortcomings. The one in efficiency is that it needs to compare all records pairwise. The one in effectiveness is that a strict duplicate records judgment condition results in a low rate of recall. To make the method work in practice, an efficient method is presented in this paper. Based on similarity estimation, the basic idea is to estimate the range of the records similarity in O(1) time, and to determine whether they are duplicate records according to the estimation. Theoretical analysis and experimental results show that the method is effective and efficient.

Supported by the National Science Foundation of China (No 60703012, 60773063), the NSFC-RGC of China (No. 60831160525), National Grant of Fundamental Research 973 Program of China (No.2006CB303000), National Grant of High Technology 863 Program of China (No. 2009AA01Z149), Key Program of the National Natural Science Foundation of China (No. 60933001), National Post doctor Foundation of China (No. 20090450126), Development Program for Outstanding Young Teachers in Harbin Institute of Technology (no. HITQNJS.2009.052).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering (2007)

    Google Scholar 

  2. Ristad, E.S., Yianilos, P.N.: Learning String-Edit Distance. IEEE Transactions on Pattern Analysis and Machine Intelligence (May 1998)

    Google Scholar 

  3. Kuhn, H.W.: The hungarian method for the assignment problem. Naval res. Logist. Quart. (1955)

    Google Scholar 

  4. Munkres, J.: Algorithms for the assignment and transportation problems. J. Soc. Indust. App1. Math. (1957)

    Google Scholar 

  5. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: SIGKDD, pp. 39–48 (August 2003)

    Google Scholar 

  6. Chandel, Hassanzadeh, O., Koudas, N., et al.: Benchmarking declarative approximate selection predicates. In: SIGMOD, pp. 353–364 (June 2007)

    Google Scholar 

  7. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD, pp. 313–324 (June 2003)

    Google Scholar 

  8. Cohen, W.W.: Data integration using similarity joins and a word-based information representation language. ACM Trans. on Information Systems 18(3), 288–321 (2000)

    Article  Google Scholar 

  9. Borkar, V.R., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: SIGMOD, pp. 175–186 (May 2001)

    Google Scholar 

  10. Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. In: NIPS (December 2004)

    Google Scholar 

  11. Viola, P.A., Narasimhan, M.: Learning to extract information from semi-structured text using a discriminative context free grammar. In: SIGIR, pp. 330–337 (August 2005)

    Google Scholar 

  12. Cohen, W.W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In: SIGKDD, pp. 89–98 (August 2004)

    Google Scholar 

  13. Arasu, Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE, pp. 40-49 (April 2008)

    Google Scholar 

  14. Arasu, Kaushik, R.: A Grammar-based Entity Representation Framework for Data Cleaning. In: SIGMOD, pp. 233–244 (June 2009)

    Google Scholar 

  15. Mohan, L., Hongzhi, W., Jianzhong, L., Hong, G.: Duplicate Record Detection Method Based on Optimal Bipartite Graph Matching. In: NDBC (October 2009)

    Google Scholar 

  16. Alpaydin, E.: Introduction to Machine Learning. MIT Press, Cambridge (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, M., Wang, H., Li, J., Gao, H. (2010). Efficient Duplicate Record Detection Based on Similarity Estimation. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds) Web-Age Information Management. WAIM 2010. Lecture Notes in Computer Science, vol 6184. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14246-8_58

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14246-8_58

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14245-1

  • Online ISBN: 978-3-642-14246-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics