skip to main content
10.1145/1999299.1999302acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

To compare or not to compare: making entity resolution more efficient

Published: 12 June 2011 Publication History

Abstract

Blocking methods are crucial for making the inherently quadratic task of Entity Resolution more efficient. The blocking methods proposed in the literature rely on the homogeneity of data and the availability of binding schema information; thus, they are inapplicable to the voluminous, noisy, and highly heterogeneous data of the Web 2.0 user-generated content. To deal with such data, attribute-agnostic blocking has been recently introduced, following a two-fold strategy: the first layer places entities into overlapping blocks in order to achieve high effectiveness, while the second layer reduces the number of unnecessary comparisons in order to enhance efficiency.
In this paper, we present a set of techniques that can be plugged into the second strategy layer of attribute-agnostic blocking to further improve its efficiency. We introduce a technique that eliminates redundant comparisons, and, based on this, we incorporate an approximate method for pruning comparisons that are highly likely to involve non-matching entities. We also introduce a novel measure for quantifying the redundancy a blocking method entails and explain how it can be used to a-priori tune the process of comparisons pruning. We apply our blocking techniques on two large, real-world data sets and report results that demonstrate a substantial increase in efficiency at a negligible (if any) cost in effectiveness.

References

[1]
M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, 2006.
[2]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1), 2007.
[3]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001.
[4]
M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD Conference, 1995.
[5]
H. Kim and D. Lee. HARRA: fast iterative hashed record linkage for large-scale data collections. In EDBT, 2010.
[6]
C. Li, L. Jin, and S. Mehrotra. Supporting efficient record linkage for large data sets using mapping techniques. WWW Journal, 9(4), 2006.
[7]
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD, 2000.
[8]
M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In AAAI, 2006.
[9]
G. Papadakis, E. Ioannou, C. Niederée, and P. Fankhauser. Efficient entity resolution for large heterogeneous information spaces. In WSDM, 2011.
[10]
G. Papadakis, E. Ioannou, C. Niederée, T. Palpanas, and W. Nejdl. Eliminating the redundancy in blocking-based entity resolution methods. In JCDL, 2011.
[11]
T. Vries, H. Ke, S. Chawla, and P. Christen. Robust record linkage blocking using suffix arrays. In CIKM, 2009.
[12]
S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In SIGMOD Conference, 2009.

Cited By

View all
  • (2021)High-Value Token-Blocking: Efficient Blocking Method for Record LinkageACM Transactions on Knowledge Discovery from Data10.1145/345052716:2(1-17)Online publication date: 21-Jul-2021
  • (2020)Blocking and Filtering Techniques for Entity ResolutionACM Computing Surveys10.1145/337745553:2(1-42)Online publication date: 20-Mar-2020
  • (2020)Efficient Entity Resolution on Heterogeneous RecordsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.289819132:5(912-926)Online publication date: 1-May-2020
  • Show More Cited By

Index Terms

  1. To compare or not to compare: making entity resolution more efficient

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SWIM '11: Proceedings of the International Workshop on Semantic Web Information Management
      June 2011
      61 pages
      ISBN:9781450306515
      DOI:10.1145/1999299
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 June 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. attribute-agnostic blocking
      2. data cleaning
      3. entity resolution

      Qualifiers

      • Research-article

      Conference

      SIGMOD/PODS '11
      Sponsor:

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)3
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 16 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)High-Value Token-Blocking: Efficient Blocking Method for Record LinkageACM Transactions on Knowledge Discovery from Data10.1145/345052716:2(1-17)Online publication date: 21-Jul-2021
      • (2020)Blocking and Filtering Techniques for Entity ResolutionACM Computing Surveys10.1145/337745553:2(1-42)Online publication date: 20-Mar-2020
      • (2020)Efficient Entity Resolution on Heterogeneous RecordsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.289819132:5(912-926)Online publication date: 1-May-2020
      • (2020)Efficient Entity Resolution on Heterogeneous Records(Extended abstract)2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.9238348(2074-2075)Online publication date: Apr-2020
      • (2020)A Survey on Blocking Technology of Entity ResolutionJournal of Computer Science and Technology10.1007/s11390-020-0350-435:4(769-793)Online publication date: 27-Jul-2020
      • (2018)Overlapped Hashing: A Novel Scalable Blocking Technique for Entity Resolution in Big-Data EraIntelligent Computing10.1007/978-3-030-01174-1_32(427-441)Online publication date: 2-Nov-2018
      • (2017)Record linkage approaches in big data: A state of art study2017 13th International Computer Engineering Conference (ICENCO)10.1109/ICENCO.2017.8289792(224-230)Online publication date: Dec-2017
      • (2017)Entity reconciliation in big data sourcesExpert Systems with Applications: An International Journal10.1016/j.eswa.2017.03.01080:C(14-27)Online publication date: 1-Sep-2017
      • (2017)Resolving Entity on A Large scaleDistributed and Parallel Databases10.1007/s10619-017-7205-135:3-4(303-332)Online publication date: 1-Dec-2017
      • (2014)Meta-Blocking: Taking Entity Resolutionto the Next LevelIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2013.5426:8(1946-1960)Online publication date: Aug-2014
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media