Abstract
In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on the attributes of them and the similarity is measured as the weight of such matching. However, the intuitive method has two aspects of shortcomings. The one in efficiency is that it needs to compare all records pairwise. The one in effectiveness is that a strict duplicate records judgment condition results in a low rate of recall. To make the method work in practice, an efficient method is presented in this paper. Based on similarity estimation, the basic idea is to estimate the range of the records similarity in O(1) time, and to determine whether they are duplicate records according to the estimation. Theoretical analysis and experimental results show that the method is effective and efficient.
Supported by the National Science Foundation of China (No 60703012, 60773063), the NSFC-RGC of China (No. 60831160525), National Grant of Fundamental Research 973 Program of China (No.2006CB303000), National Grant of High Technology 863 Program of China (No. 2009AA01Z149), Key Program of the National Natural Science Foundation of China (No. 60933001), National Post doctor Foundation of China (No. 20090450126), Development Program for Outstanding Young Teachers in Harbin Institute of Technology (no. HITQNJS.2009.052).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering (2007)
Ristad, E.S., Yianilos, P.N.: Learning String-Edit Distance. IEEE Transactions on Pattern Analysis and Machine Intelligence (May 1998)
Kuhn, H.W.: The hungarian method for the assignment problem. Naval res. Logist. Quart. (1955)
Munkres, J.: Algorithms for the assignment and transportation problems. J. Soc. Indust. App1. Math. (1957)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: SIGKDD, pp. 39–48 (August 2003)
Chandel, Hassanzadeh, O., Koudas, N., et al.: Benchmarking declarative approximate selection predicates. In: SIGMOD, pp. 353–364 (June 2007)
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD, pp. 313–324 (June 2003)
Cohen, W.W.: Data integration using similarity joins and a word-based information representation language. ACM Trans. on Information Systems 18(3), 288–321 (2000)
Borkar, V.R., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: SIGMOD, pp. 175–186 (May 2001)
Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. In: NIPS (December 2004)
Viola, P.A., Narasimhan, M.: Learning to extract information from semi-structured text using a discriminative context free grammar. In: SIGIR, pp. 330–337 (August 2005)
Cohen, W.W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In: SIGKDD, pp. 89–98 (August 2004)
Arasu, Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE, pp. 40-49 (April 2008)
Arasu, Kaushik, R.: A Grammar-based Entity Representation Framework for Data Cleaning. In: SIGMOD, pp. 233–244 (June 2009)
Mohan, L., Hongzhi, W., Jianzhong, L., Hong, G.: Duplicate Record Detection Method Based on Optimal Bipartite Graph Matching. In: NDBC (October 2009)
Alpaydin, E.: Introduction to Machine Learning. MIT Press, Cambridge (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, M., Wang, H., Li, J., Gao, H. (2010). Efficient Duplicate Record Detection Based on Similarity Estimation. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds) Web-Age Information Management. WAIM 2010. Lecture Notes in Computer Science, vol 6184. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14246-8_58
Download citation
DOI: https://doi.org/10.1007/978-3-642-14246-8_58
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14245-1
Online ISBN: 978-3-642-14246-8
eBook Packages: Computer ScienceComputer Science (R0)