Skip to main content

Anomaly-Based Duplicate Detection: A Probabilistic Approach

  • Conference paper
  • First Online:
Extending the Boundaries of Design Science Theory and Practice (DESRIST 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11491))

  • 1860 Accesses

Abstract

The importance of identifying records in databases that refer to the same real-world entity (“duplicate detection”) has been recognized in both research and practice. However, existing supervised approaches for duplicate detection need training data with labeled instances of duplicates and non-duplicates, which is often costly and time-consuming to generate. On the contrary, unsupervised approaches can forego such training data but may suffer from limiting assumptions (e.g., monotonicity) and providing less reliable results. To address the issue of generating high-quality results using easy to acquire duplicate-free training data only, we propose a probabilistic approach for anomaly-based duplicate detection. Duplicates exhibit specific characteristics which differ significantly from the characteristics of non-duplicates and therefore represent anomalies. Based on the grade of anomaly compared to duplicate-free training data, our approach assigns the probability of being a duplicate to each analyzed pair of records while avoiding limiting assumptions (of existing approaches). We demonstrate the practical applicability and effectiveness of our approach in a real-world setting by analyzing customer master data of a German insurer. The evaluation shows that the results provided by the approach are reliable and useful for decision support and can outperform even fully supervised state-of-the-art approaches for duplicate detection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Mathematical pseudocode for both methods of instantiation is available at: https://github.com/aoberm/Anomaly-Based-Duplicate-Detection.

References

  1. Fan, W.: Data quality. From theory to practice. ACM SIGMOD Rec. 44(3), 7–18 (2015). https://doi.org/10.1145/2854006.2854008

    Article  Google Scholar 

  2. Helmis, S., Hollmann, R.: Webbased Dataintegration. Approaches to Measure and Maintain the Quality of Information in Heterogeneous Databases Using a Fully Web-Based Tool. Springer, Heidelberg (2009)

    Google Scholar 

  3. Heinrich, B., Klier, M., Obermeier, A.A., Schiller, A.: Event-driven duplicate detection: a probability-based approach. In: Proceedings of the 26th ECIS (2018)

    Google Scholar 

  4. Bleiholder, J., Schmid, J.: Dataintegration and deduplication. In: Daten- und Informationsqualität, pp. 121–140. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  5. Draisbach, U.: Partitioning for Efficient Duplicate Detection in Relational Data. Springer, Heidelberg (2012)

    Google Scholar 

  6. Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: Proceedings of the 14th ACM SIGKDD, pp. 151–159 (2008)

    Google Scholar 

  7. Christen, P.: A two-step classification approach to unsupervised record linkage. In: Proceedings of the 6th AusDM, pp. 111–119 (2007)

    Google Scholar 

  8. Lehti, P., Fankhauser, P.: Unsupervised duplicate detection using sample non-duplicates. In: Spaccapietra, S. (ed.) Journal on Data Semantics VII. LNCS, vol. 4244, pp. 136–164. Springer, Heidelberg (2006). https://doi.org/10.1007/11890591_5

    Chapter  Google Scholar 

  9. Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: TAILOR: a record linkage toolbox. In: Proceedings of the 18th ICDE, pp. 17–28 (2002)

    Google Scholar 

  10. Gu, L., Baxter, R.: Decision models for record linkage. In: Williams, G.J., Simoff, S.J. (eds.) Data Mining. LNCS (LNAI), vol. 3755, pp. 146–160. Springer, Heidelberg (2006). https://doi.org/10.1007/11677437_12

    Chapter  Google Scholar 

  11. Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th UAI, pp. 454–461 (2004)

    Google Scholar 

  12. Jurek, A., Deepak, P.: It pays to be certain: unsupervised record linkage via ambiguity minimization. In: Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M., Rashidi, L. (eds.) PAKDD 2018. LNCS (LNAI), vol. 10939, pp. 177–190. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93040-4_15

    Chapter  Google Scholar 

  13. Peffers, K., Tuunanen, T., Rothenberger, M.A., Chatterjee, S.: A design science research methodology for information systems research. JMIS 24(3), 45–77 (2007)

    Google Scholar 

  14. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2

    Book  Google Scholar 

  15. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection. A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  16. Winkler, W.E.: Overview of record linkage and current research directions. U.S. Bureau of the Census (2006)

    Google Scholar 

  17. Tromp, M., Ravelli, A.C., Bonsel, G.J., Hasman, A., Reitsma, J.B.: Results from simulated data sets. Probabilistic record linkage outperforms deterministic record linkage. J. Clin. Epidemiol. 64(5), 565–572 (2011)

    Article  Google Scholar 

  18. Hettiarachchi, G.P., Hettiarachchi, N.N., Hettiarachchi, D.S., Ebisuya, A.: Next generation data classification and linkage. Role of probabilistic models and artificial intelligence. In: Proceedings of the 4th IEEE GHTC, pp. 569–576 (2014)

    Google Scholar 

  19. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)

    Article  MATH  Google Scholar 

  20. Belin, T.R., Rubin, D.B.: A method for calibrating false-match rates in record linkage. J. Am. Stat. Assoc. 90(430), 694–707 (1995)

    Article  MATH  Google Scholar 

  21. Steorts, R.C., Hall, R., Fienberg, S.E.: A Bayesian approach to graphical record linkage and deduplication. J. Am. Stat. Assoc. 111(516), 1660–1672 (2016)

    Article  MathSciNet  Google Scholar 

  22. Thibaudeau, Y.: The discrimination power of dependency structures in record linkage. U.S. Bureau of the Census (1992)

    Google Scholar 

  23. Winkler, W.E.: Improved decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of Survey Research Methods Section, pp. 274–279. American Statistical Association (1993)

    Google Scholar 

  24. Scott, D.W.: Multivariate Density Estimation. Theory, Practice, and Visualization. Wiley, Hoboken (2015)

    Book  MATH  Google Scholar 

  25. Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning, pp. 73–78 (2003)

    Google Scholar 

  26. Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. U.S. Bureau of the Census (1990)

    Google Scholar 

  27. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  28. Seabold, S., Perktold, J.: Statsmodels. Econometric and statistical modeling with python. In: Proceedings of the 9th Python in Science Conference, pp. 57–61 (2010)

    Google Scholar 

  29. Hoerl, A.E., Fallin, H.K.: Reliability of subjective evaluations in a high incentive situation. J. Roy. Stat. Soc. Ser. A (General) 137(2), 227–230 (1974)

    Article  Google Scholar 

  30. Murphy, A.H., Winkler, R.L.: Reliability of subjective probability forecasts of precipitation and temperature. Appl. Stat. 26(1), 41–47 (1977)

    Article  Google Scholar 

  31. Murphy, A.H., Winkler, R.L.: A general framework for forecast verification. Mon. Weather Rev. 115(7), 1330–1338 (1987)

    Article  Google Scholar 

  32. Sanders, F.: On subjective probability forecasting. J. Appl. Meteorol. 2(2), 191–201 (1963)

    Article  Google Scholar 

  33. Bröcker, J., Smith, L.A.: Increasing the reliability of reliability diagrams. Weather Forecast. 22(3), 651–661 (2007)

    Article  Google Scholar 

  34. Murphy, A.H.: A new vector partition of the probability score. J. Appl. Meteorol. 12(4), 595–600 (1973)

    Article  Google Scholar 

  35. Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)

    Article  Google Scholar 

  36. de Bruin, J.: Python Record Linkage Toolkit. https://github.com/J535D165/recordlinkage. Accessed 4 Jan 2019

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andreas Obermeier .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Obermeier, A. (2019). Anomaly-Based Duplicate Detection: A Probabilistic Approach. In: Tulu, B., Djamasbi, S., Leroy, G. (eds) Extending the Boundaries of Design Science Theory and Practice. DESRIST 2019. Lecture Notes in Computer Science(), vol 11491. Springer, Cham. https://doi.org/10.1007/978-3-030-19504-5_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-19504-5_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-19503-8

  • Online ISBN: 978-3-030-19504-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics