Efficient Duplicate Record Detection Based on Similarity Estimation

Li, Mohan; Wang, Hongzhi; Li, Jianzhong; Gao, Hong

doi:10.1007/978-3-642-14246-8_58

Mohan Li²⁰,
Hongzhi Wang²⁰,
Jianzhong Li²⁰ &
…
Hong Gao²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6184))

Included in the following conference series:

International Conference on Web-Age Information Management

1708 Accesses
5 Citations

Abstract

In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on the attributes of them and the similarity is measured as the weight of such matching. However, the intuitive method has two aspects of shortcomings. The one in efficiency is that it needs to compare all records pairwise. The one in effectiveness is that a strict duplicate records judgment condition results in a low rate of recall. To make the method work in practice, an efficient method is presented in this paper. Based on similarity estimation, the basic idea is to estimate the range of the records similarity in O(1) time, and to determine whether they are duplicate records according to the estimation. Theoretical analysis and experimental results show that the method is effective and efficient.

Supported by the National Science Foundation of China (No 60703012, 60773063), the NSFC-RGC of China (No. 60831160525), National Grant of Fundamental Research 973 Program of China (No.2006CB303000), National Grant of High Technology 863 Program of China (No. 2009AA01Z149), Key Program of the National Natural Science Foundation of China (No. 60933001), National Post doctor Foundation of China (No. 20090450126), Development Program for Outstanding Young Teachers in Harbin Institute of Technology (no. HITQNJS.2009.052).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering (2007)
Google Scholar
Ristad, E.S., Yianilos, P.N.: Learning String-Edit Distance. IEEE Transactions on Pattern Analysis and Machine Intelligence (May 1998)
Google Scholar
Kuhn, H.W.: The hungarian method for the assignment problem. Naval res. Logist. Quart. (1955)
Google Scholar
Munkres, J.: Algorithms for the assignment and transportation problems. J. Soc. Indust. App1. Math. (1957)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: SIGKDD, pp. 39–48 (August 2003)
Google Scholar
Chandel, Hassanzadeh, O., Koudas, N., et al.: Benchmarking declarative approximate selection predicates. In: SIGMOD, pp. 353–364 (June 2007)
Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD, pp. 313–324 (June 2003)
Google Scholar
Cohen, W.W.: Data integration using similarity joins and a word-based information representation language. ACM Trans. on Information Systems 18(3), 288–321 (2000)
Article Google Scholar
Borkar, V.R., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: SIGMOD, pp. 175–186 (May 2001)
Google Scholar
Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. In: NIPS (December 2004)
Google Scholar
Viola, P.A., Narasimhan, M.: Learning to extract information from semi-structured text using a discriminative context free grammar. In: SIGIR, pp. 330–337 (August 2005)
Google Scholar
Cohen, W.W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In: SIGKDD, pp. 89–98 (August 2004)
Google Scholar
Arasu, Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE, pp. 40-49 (April 2008)
Google Scholar
Arasu, Kaushik, R.: A Grammar-based Entity Representation Framework for Data Cleaning. In: SIGMOD, pp. 233–244 (June 2009)
Google Scholar
Mohan, L., Hongzhi, W., Jianzhong, L., Hong, G.: Duplicate Record Detection Method Based on Optimal Bipartite Graph Matching. In: NDBC (October 2009)
Google Scholar
Alpaydin, E.: Introduction to Machine Learning. MIT Press, Cambridge (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Harbin Institute of Technology, Harbin, 150001
Mohan Li, Hongzhi Wang, Jianzhong Li & Hong Gao

Authors

Mohan Li
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
Lei Chen
Computer Department, Sichuan University, 610064, Chengdu, China
Changjie Tang
Department of Computer Science, Duke University, Box 90129, NC 27708-0129, Durham, USA
Jun Yang
College of Computer Science, Zhejiang University, 388 Yuhangtang Road, 310058, Hangzhou, China
Yunjun Gao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, M., Wang, H., Li, J., Gao, H. (2010). Efficient Duplicate Record Detection Based on Similarity Estimation. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds) Web-Age Information Management. WAIM 2010. Lecture Notes in Computer Science, vol 6184. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14246-8_58

Download citation

DOI: https://doi.org/10.1007/978-3-642-14246-8_58
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14245-1
Online ISBN: 978-3-642-14246-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics