Anomaly-Based Duplicate Detection: A Probabilistic Approach

Obermeier, Andreas

doi:10.1007/978-3-030-19504-5_15

Andreas Obermeier¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11491))

Included in the following conference series:

International Conference on Design Science Research in Information Systems and Technology

1860 Accesses

Abstract

The importance of identifying records in databases that refer to the same real-world entity (“duplicate detection”) has been recognized in both research and practice. However, existing supervised approaches for duplicate detection need training data with labeled instances of duplicates and non-duplicates, which is often costly and time-consuming to generate. On the contrary, unsupervised approaches can forego such training data but may suffer from limiting assumptions (e.g., monotonicity) and providing less reliable results. To address the issue of generating high-quality results using easy to acquire duplicate-free training data only, we propose a probabilistic approach for anomaly-based duplicate detection. Duplicates exhibit specific characteristics which differ significantly from the characteristics of non-duplicates and therefore represent anomalies. Based on the grade of anomaly compared to duplicate-free training data, our approach assigns the probability of being a duplicate to each analyzed pair of records while avoiding limiting assumptions (of existing approaches). We demonstrate the practical applicability and effectiveness of our approach in a real-world setting by analyzing customer master data of a German insurer. The evaluation shows that the results provided by the approach are reliable and useful for decision support and can outperform even fully supervised state-of-the-art approaches for duplicate detection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Detecting Near Duplicate Dataset

Missing values compensation in duplicates detection using hot deck method

Article Open access 21 August 2021

Unsupervised record matching with noisy and incomplete data

Article Open access 23 May 2018

Notes

1.
Mathematical pseudocode for both methods of instantiation is available at: https://github.com/aoberm/Anomaly-Based-Duplicate-Detection.

References

Fan, W.: Data quality. From theory to practice. ACM SIGMOD Rec. 44(3), 7–18 (2015). https://doi.org/10.1145/2854006.2854008
Article Google Scholar
Helmis, S., Hollmann, R.: Webbased Dataintegration. Approaches to Measure and Maintain the Quality of Information in Heterogeneous Databases Using a Fully Web-Based Tool. Springer, Heidelberg (2009)
Google Scholar
Heinrich, B., Klier, M., Obermeier, A.A., Schiller, A.: Event-driven duplicate detection: a probability-based approach. In: Proceedings of the 26th ECIS (2018)
Google Scholar
Bleiholder, J., Schmid, J.: Dataintegration and deduplication. In: Daten- und Informationsqualität, pp. 121–140. Springer, Heidelberg (2015)
Chapter Google Scholar
Draisbach, U.: Partitioning for Efficient Duplicate Detection in Relational Data. Springer, Heidelberg (2012)
Google Scholar
Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: Proceedings of the 14th ACM SIGKDD, pp. 151–159 (2008)
Google Scholar
Christen, P.: A two-step classification approach to unsupervised record linkage. In: Proceedings of the 6th AusDM, pp. 111–119 (2007)
Google Scholar
Lehti, P., Fankhauser, P.: Unsupervised duplicate detection using sample non-duplicates. In: Spaccapietra, S. (ed.) Journal on Data Semantics VII. LNCS, vol. 4244, pp. 136–164. Springer, Heidelberg (2006). https://doi.org/10.1007/11890591_5
Chapter Google Scholar
Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: TAILOR: a record linkage toolbox. In: Proceedings of the 18th ICDE, pp. 17–28 (2002)
Google Scholar
Gu, L., Baxter, R.: Decision models for record linkage. In: Williams, G.J., Simoff, S.J. (eds.) Data Mining. LNCS (LNAI), vol. 3755, pp. 146–160. Springer, Heidelberg (2006). https://doi.org/10.1007/11677437_12
Chapter Google Scholar
Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th UAI, pp. 454–461 (2004)
Google Scholar
Jurek, A., Deepak, P.: It pays to be certain: unsupervised record linkage via ambiguity minimization. In: Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M., Rashidi, L. (eds.) PAKDD 2018. LNCS (LNAI), vol. 10939, pp. 177–190. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93040-4_15
Chapter Google Scholar
Peffers, K., Tuunanen, T., Rothenberger, M.A., Chatterjee, S.: A design science research methodology for information systems research. JMIS 24(3), 45–77 (2007)
Google Scholar
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Book Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection. A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Winkler, W.E.: Overview of record linkage and current research directions. U.S. Bureau of the Census (2006)
Google Scholar
Tromp, M., Ravelli, A.C., Bonsel, G.J., Hasman, A., Reitsma, J.B.: Results from simulated data sets. Probabilistic record linkage outperforms deterministic record linkage. J. Clin. Epidemiol. 64(5), 565–572 (2011)
Article Google Scholar
Hettiarachchi, G.P., Hettiarachchi, N.N., Hettiarachchi, D.S., Ebisuya, A.: Next generation data classification and linkage. Role of probabilistic models and artificial intelligence. In: Proceedings of the 4th IEEE GHTC, pp. 569–576 (2014)
Google Scholar
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Article MATH Google Scholar
Belin, T.R., Rubin, D.B.: A method for calibrating false-match rates in record linkage. J. Am. Stat. Assoc. 90(430), 694–707 (1995)
Article MATH Google Scholar
Steorts, R.C., Hall, R., Fienberg, S.E.: A Bayesian approach to graphical record linkage and deduplication. J. Am. Stat. Assoc. 111(516), 1660–1672 (2016)
Article MathSciNet Google Scholar
Thibaudeau, Y.: The discrimination power of dependency structures in record linkage. U.S. Bureau of the Census (1992)
Google Scholar
Winkler, W.E.: Improved decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of Survey Research Methods Section, pp. 274–279. American Statistical Association (1993)
Google Scholar
Scott, D.W.: Multivariate Density Estimation. Theory, Practice, and Visualization. Wiley, Hoboken (2015)
Book MATH Google Scholar
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning, pp. 73–78 (2003)
Google Scholar
Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. U.S. Bureau of the Census (1990)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
MathSciNet Google Scholar
Seabold, S., Perktold, J.: Statsmodels. Econometric and statistical modeling with python. In: Proceedings of the 9th Python in Science Conference, pp. 57–61 (2010)
Google Scholar
Hoerl, A.E., Fallin, H.K.: Reliability of subjective evaluations in a high incentive situation. J. Roy. Stat. Soc. Ser. A (General) 137(2), 227–230 (1974)
Article Google Scholar
Murphy, A.H., Winkler, R.L.: Reliability of subjective probability forecasts of precipitation and temperature. Appl. Stat. 26(1), 41–47 (1977)
Article Google Scholar
Murphy, A.H., Winkler, R.L.: A general framework for forecast verification. Mon. Weather Rev. 115(7), 1330–1338 (1987)
Article Google Scholar
Sanders, F.: On subjective probability forecasting. J. Appl. Meteorol. 2(2), 191–201 (1963)
Article Google Scholar
Bröcker, J., Smith, L.A.: Increasing the reliability of reliability diagrams. Weather Forecast. 22(3), 651–661 (2007)
Article Google Scholar
Murphy, A.H.: A new vector partition of the probability score. J. Appl. Meteorol. 12(4), 595–600 (1973)
Article Google Scholar
Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)
Article Google Scholar
de Bruin, J.: Python Record Linkage Toolkit. https://github.com/J535D165/recordlinkage. Accessed 4 Jan 2019

Download references

Author information

Authors and Affiliations

University of Ulm, 89069, Ulm, Germany
Andreas Obermeier

Authors

Andreas Obermeier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andreas Obermeier .

Editor information

Editors and Affiliations

Foisie Business School, Worcester Polytechnic Institute, Worcester, MA, USA
Bengisu Tulu
Foisie Business School, Worcester Polytechnic Institute, Worcester, MA, USA
Soussan Djamasbi
Eller College of Management, University of Arizona, Tucson, AZ, USA
Gondy Leroy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Obermeier, A. (2019). Anomaly-Based Duplicate Detection: A Probabilistic Approach. In: Tulu, B., Djamasbi, S., Leroy, G. (eds) Extending the Boundaries of Design Science Theory and Practice. DESRIST 2019. Lecture Notes in Computer Science(), vol 11491. Springer, Cham. https://doi.org/10.1007/978-3-030-19504-5_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-19504-5_15
Published: 27 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-19503-8
Online ISBN: 978-3-030-19504-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Anomaly-Based Duplicate Detection: A Probabilistic Approach

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Detecting Near Duplicate Dataset

Missing values compensation in duplicates detection using hot deck method

Unsupervised record matching with noisy and incomplete data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Anomaly-Based Duplicate Detection: A Probabilistic Approach

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Detecting Near Duplicate Dataset

Missing values compensation in duplicates detection using hot deck method

Unsupervised record matching with noisy and incomplete data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation