Abstract
The problem of identifying approximately duplicate records between databases is known, among others, as duplicate detection or record linkage. To this end, typically either rules or a weighted aggregation of distances between the individual attributes of potential duplicates is used. However, choosing the appropriate rules, distance functions, weights, and thresholds requires deep understanding of the application domain or a good representative training set for supervised learning approaches. In this paper we present an unsupervised, domain independent approach that starts with a broad alignment of potential duplicates, and analyses the distribution of observed distances among potential duplicates and among non-duplicates to iteratively refine the initial alignment. Evaluations show that this approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Galhardas, H., Florescu, D., Shasha, D., Simon, E.: An extensible framework for data cleaning. In: Proceddings of the 16th International Conference on Data Engineering (ICDE 2000), p. 312 (2000)
Hernandez, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2, 9–37 (1998)
Jaro, M.: Advances in record linkage methodology as applied to matching the 1985 census of tampa. Journal of the American Statistical Society 84, 414–420 (1989)
Monge, A., Elkan, C.: An efficient domain independent algorithm for detecting approximately duplicate database records. In: Proceedings of the SIGMOD Workshop on Data Mining and Knowledge Discovery (1997)
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string metrics for matching names and records. In: Proceedings of the KDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, pp. 13–18 (2003)
Bilenko, M., Mooney, R.J.: Learning to combine trained distance metrics for duplicate detection in databases. Technical Report AI 02-296, Artificial Intelligence Laboratory, University of Texas at Austin, Austin, TX (2002)
Ristad, E.S., Yianilos, P.N.: Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 522–532 (1998)
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002), Edmonton, Alberta (2002)
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002), Edmonton, Alberta (2002)
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th International Conference on Very Large Data Bases(VLDB 2002) (2002)
Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity uncertainty and citation matching. In: Advances in Neural Information Processing Systems, vol. 15. MIT Press, Cambridge (2003)
Bhattacharya, I., Getoor, L.: Deduplication and group detection using links. In: Proceedings of the KDD 2004 Workshop on Link Analysis and Group Detection (2004)
Domingos, P., Domingos, P.: Multi-relational record linkage. In: Proceedings of the KDD 2004 Workshop on Multi-Relational Data Mining, pp. 31–48 (2004)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)
Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130, 954–959 (1959)
Winkler, W.E.: Using the em algorithm for weight computation in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 667–671 (1988)
Winkler, W.E.: Improved decision rules in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 274–279 (1993)
Larsen, M.D., Rubin, D.B.: Alternative automated record linkage using mixture models. Journal of the American Statistical Association 79, 32–41 (2001)
Winkler, W.E.: The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC (1999)
Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: AUAI 2004: Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 454–461. AUAI Press (2004)
Lehti, P., Fankhauser, P.: A precise blocking method for record linkage. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 210–220. Springer, Heidelberg (2005)
Sachs, L.: Angewandte Statistik, pp. 434–435. Springer, Berlin (2004)
Baeza-Yates, R., Ribiero-Neto, B.: Modern Information Retrieval, pp. 74–79. Addison Wesley, Reading (1999)
Levenshtein, V.I.: Binary codes capable of correcting insertions and reversals. Soviet Physics Doklady 10, 707–710 (1966)
Elfeky, M.G., Verykios, V.S., Elmargarid, A.K.: Tailor: A record linkage toolbox. In: Proceedings of the 18th International Conference on Data Engineering, ICDE 2002 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lehti, P., Fankhauser, P. (2005). Probabilistic Iterative Duplicate Detection. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems 2005: CoopIS, DOA, and ODBASE. OTM 2005. Lecture Notes in Computer Science, vol 3761. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11575801_19
Download citation
DOI: https://doi.org/10.1007/11575801_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29738-3
Online ISBN: 978-3-540-32120-0
eBook Packages: Computer ScienceComputer Science (R0)