Abstract
The importance of identifying records in databases that refer to the same real-world entity (“duplicate detection”) has been recognized in both research and practice. However, existing supervised approaches for duplicate detection need training data with labeled instances of duplicates and non-duplicates, which is often costly and time-consuming to generate. On the contrary, unsupervised approaches can forego such training data but may suffer from limiting assumptions (e.g., monotonicity) and providing less reliable results. To address the issue of generating high-quality results using easy to acquire duplicate-free training data only, we propose a probabilistic approach for anomaly-based duplicate detection. Duplicates exhibit specific characteristics which differ significantly from the characteristics of non-duplicates and therefore represent anomalies. Based on the grade of anomaly compared to duplicate-free training data, our approach assigns the probability of being a duplicate to each analyzed pair of records while avoiding limiting assumptions (of existing approaches). We demonstrate the practical applicability and effectiveness of our approach in a real-world setting by analyzing customer master data of a German insurer. The evaluation shows that the results provided by the approach are reliable and useful for decision support and can outperform even fully supervised state-of-the-art approaches for duplicate detection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Mathematical pseudocode for both methods of instantiation is available at: https://github.com/aoberm/Anomaly-Based-Duplicate-Detection.
References
Fan, W.: Data quality. From theory to practice. ACM SIGMOD Rec. 44(3), 7–18 (2015). https://doi.org/10.1145/2854006.2854008
Helmis, S., Hollmann, R.: Webbased Dataintegration. Approaches to Measure and Maintain the Quality of Information in Heterogeneous Databases Using a Fully Web-Based Tool. Springer, Heidelberg (2009)
Heinrich, B., Klier, M., Obermeier, A.A., Schiller, A.: Event-driven duplicate detection: a probability-based approach. In: Proceedings of the 26th ECIS (2018)
Bleiholder, J., Schmid, J.: Dataintegration and deduplication. In: Daten- und Informationsqualität, pp. 121–140. Springer, Heidelberg (2015)
Draisbach, U.: Partitioning for Efficient Duplicate Detection in Relational Data. Springer, Heidelberg (2012)
Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: Proceedings of the 14th ACM SIGKDD, pp. 151–159 (2008)
Christen, P.: A two-step classification approach to unsupervised record linkage. In: Proceedings of the 6th AusDM, pp. 111–119 (2007)
Lehti, P., Fankhauser, P.: Unsupervised duplicate detection using sample non-duplicates. In: Spaccapietra, S. (ed.) Journal on Data Semantics VII. LNCS, vol. 4244, pp. 136–164. Springer, Heidelberg (2006). https://doi.org/10.1007/11890591_5
Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: TAILOR: a record linkage toolbox. In: Proceedings of the 18th ICDE, pp. 17–28 (2002)
Gu, L., Baxter, R.: Decision models for record linkage. In: Williams, G.J., Simoff, S.J. (eds.) Data Mining. LNCS (LNAI), vol. 3755, pp. 146–160. Springer, Heidelberg (2006). https://doi.org/10.1007/11677437_12
Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th UAI, pp. 454–461 (2004)
Jurek, A., Deepak, P.: It pays to be certain: unsupervised record linkage via ambiguity minimization. In: Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M., Rashidi, L. (eds.) PAKDD 2018. LNCS (LNAI), vol. 10939, pp. 177–190. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93040-4_15
Peffers, K., Tuunanen, T., Rothenberger, M.A., Chatterjee, S.: A design science research methodology for information systems research. JMIS 24(3), 45–77 (2007)
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection. A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Winkler, W.E.: Overview of record linkage and current research directions. U.S. Bureau of the Census (2006)
Tromp, M., Ravelli, A.C., Bonsel, G.J., Hasman, A., Reitsma, J.B.: Results from simulated data sets. Probabilistic record linkage outperforms deterministic record linkage. J. Clin. Epidemiol. 64(5), 565–572 (2011)
Hettiarachchi, G.P., Hettiarachchi, N.N., Hettiarachchi, D.S., Ebisuya, A.: Next generation data classification and linkage. Role of probabilistic models and artificial intelligence. In: Proceedings of the 4th IEEE GHTC, pp. 569–576 (2014)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Belin, T.R., Rubin, D.B.: A method for calibrating false-match rates in record linkage. J. Am. Stat. Assoc. 90(430), 694–707 (1995)
Steorts, R.C., Hall, R., Fienberg, S.E.: A Bayesian approach to graphical record linkage and deduplication. J. Am. Stat. Assoc. 111(516), 1660–1672 (2016)
Thibaudeau, Y.: The discrimination power of dependency structures in record linkage. U.S. Bureau of the Census (1992)
Winkler, W.E.: Improved decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of Survey Research Methods Section, pp. 274–279. American Statistical Association (1993)
Scott, D.W.: Multivariate Density Estimation. Theory, Practice, and Visualization. Wiley, Hoboken (2015)
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning, pp. 73–78 (2003)
Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. U.S. Bureau of the Census (1990)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
Seabold, S., Perktold, J.: Statsmodels. Econometric and statistical modeling with python. In: Proceedings of the 9th Python in Science Conference, pp. 57–61 (2010)
Hoerl, A.E., Fallin, H.K.: Reliability of subjective evaluations in a high incentive situation. J. Roy. Stat. Soc. Ser. A (General) 137(2), 227–230 (1974)
Murphy, A.H., Winkler, R.L.: Reliability of subjective probability forecasts of precipitation and temperature. Appl. Stat. 26(1), 41–47 (1977)
Murphy, A.H., Winkler, R.L.: A general framework for forecast verification. Mon. Weather Rev. 115(7), 1330–1338 (1987)
Sanders, F.: On subjective probability forecasting. J. Appl. Meteorol. 2(2), 191–201 (1963)
Bröcker, J., Smith, L.A.: Increasing the reliability of reliability diagrams. Weather Forecast. 22(3), 651–661 (2007)
Murphy, A.H.: A new vector partition of the probability score. J. Appl. Meteorol. 12(4), 595–600 (1973)
Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)
de Bruin, J.: Python Record Linkage Toolkit. https://github.com/J535D165/recordlinkage. Accessed 4 Jan 2019
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Obermeier, A. (2019). Anomaly-Based Duplicate Detection: A Probabilistic Approach. In: Tulu, B., Djamasbi, S., Leroy, G. (eds) Extending the Boundaries of Design Science Theory and Practice. DESRIST 2019. Lecture Notes in Computer Science(), vol 11491. Springer, Cham. https://doi.org/10.1007/978-3-030-19504-5_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-19504-5_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-19503-8
Online ISBN: 978-3-030-19504-5
eBook Packages: Computer ScienceComputer Science (R0)