skip to main content
10.1145/3555776.3578724acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
poster

On evaluating text similarity measures for customer data deduplication

Published: 07 June 2023 Publication History

Abstract

In this paper, we summarize the results obtained while evaluating 44 similarity measures for text values, which represent real institutional customers data. These data come from a project conducted for a large financial institution in Poland. The similarity measures were assessed based on similarity values they returned and based on their execution times. To the best of our knowledge, it is the first report that evaluates such a large selection of different similarity measures.

References

[1]
Madhavi Alamuri, Bapi Raju Surampudi, and Atul Negi. 2014. A survey of distance/similarity measures for categorical data. In Int. Joint Conf. on Neural Networks (IJCNN). IEEE, 1907--1914.
[2]
Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: a generic approach to entity resolution. VLDB Journal 18, 1 (2009), 255--276.
[3]
Paweł Boiński, Mariusz Sienkiewicz, Bartosz Bębel, Robert Wrembel, Dariusz Gałęzowski, and Waldemar Graniszewski. 2022. On Customer Data Deduplication: Lessons Learned from a R&D Project in the Financial Sector. In Proc. of the Workshops of the EDBT/ICDT 2022 Joint Conference (CEUR Workshop Proceedings), Vol. 3135. CEUR-WS.org.
[4]
Shyam Boriah, Varun Chandola, and Vipin Kumar. 2008. Similarity Measures for Categorical Data: A Comparative Evaluation. In SIAM Int. Conf. on Data Mining (SDM). SIAM, 243--254.
[5]
Peter Christen. 2006. A Comparison of Personal Name Matching: Techniques and Practical Issues. In Int. Conf. on Data Mining (ICDM). IEEE Computer Society, 290--294.
[6]
Peter Christen. 2012. Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.
[7]
Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, and Kostas Stefanidis. 2021. An Overview of End-to-End Entity Resolution for Big Data. Comput. Surveys 53, 6 (2021), 127:1--127:42.
[8]
Adrian Colyer. 2020. The morning paper on An overview of end-to-end entity resolution for big data. https://blog.acolyer.org/2020/12/14/entity-resolution/.
[9]
María del Pilar Angeles and Adrian Espino-Gamez. 2015. Comparison of Methods Hamming Distance, Jaro, and Monge-Elkan. In Int. Conf. on Advances in Databases, Knowledge, and Data Applications (DBKDA). 63--69.
[10]
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering 19, 1 (2007),1--16.
[11]
Sergio Jiménez, Claudia Jeanneth Becerra, Alexander F. Gelbukh, and Fabio A. González. 2009. Generalized Mongue-Elkan Method for Approximate Text String Comparison. In Int. Conf. on Computational Linguistics and Intelligent Text Processing (CICLing) (LNCS), Alexander F. Gelbukh (Ed.), Vol. 5449. Springer, 559--570.
[12]
Hanna Köpcke and Erhard Rahm. 2010. Frameworks for entity matching: A comparison. Data & Knowledge Engineering 69, 2 (2010), 197--210.
[13]
Alvaro E. Monge and Charles Elkan. 1997. An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. In Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD).
[14]
Felix Naumann. 2013. Similarity measures. Hasso Plattner Institut.
[15]
George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas. 2020. Blocking and Filtering Techniques for Entity Resolution: A Survey. Comput. Surveys 53, 2 (2020), 31:1--31:42.
[16]
George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas, and Manolis Koubarakis. 2019. Domain- and Structure-Agnostic End-to-End Entity Resolution with JedAI. SIGMOD Record 48, 4 (2019), 30--36.
[17]
Textdistance. [n. d.]. Python package: textdistance. https://pypi.org/project/textdistance/.

Cited By

View all
  • (2024)On Customer Data Deduplication - Research vs. Industrial Perspective:New Trends in Database and Information Systems10.1007/978-3-031-70421-5_37(392-400)Online publication date: 14-Nov-2024

Index Terms

  1. On evaluating text similarity measures for customer data deduplication

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing
      March 2023
      1932 pages
      ISBN:9781450395175
      DOI:10.1145/3555776
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 07 June 2023

      Check for updates

      Author Tags

      1. data quality
      2. entity resolution
      3. data deduplication
      4. text similarity measures

      Qualifiers

      • Poster

      Funding Sources

      Conference

      SAC '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

      Upcoming Conference

      SAC '25
      The 40th ACM/SIGAPP Symposium on Applied Computing
      March 31 - April 4, 2025
      Catania , Italy

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)18
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 16 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)On Customer Data Deduplication - Research vs. Industrial Perspective:New Trends in Database and Information Systems10.1007/978-3-031-70421-5_37(392-400)Online publication date: 14-Nov-2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media