research-article

Entity resolution with iterative blocking

Authors:
Steven Euijong Whang

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

,
David Menestrina

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

,
Georgia Koutrika

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

,
Martin Theobald

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

,
Hector Garcia-Molina

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of dataJune 2009Pages 219–232https://doi.org/10.1145/1559845.1559870

Published:29 June 2009Publication History

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Pages 219–232

ABSTRACT

Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. Various blocking techniques can be used to enhance the performance of ER by dividing the records into blocks in multiple ways and only comparing records within the same block. However, most blocking techniques process blocks separately and do not exploit the results of other blocks. In this paper, we propose an iterative blocking framework where the ER results of blocks are reflected to subsequently processed blocks. Blocks are now iteratively processed until no block contains any more matching records. Compared to simple blocking, iterative blocking may achieve higher accuracy because reflecting the ER results of blocks to other blocks may generate additional record matches. Iterative blocking may also be more efficient because processing a block now saves the processing time for other blocks. We implement a scalable iterative blocking system and demonstrate that iterative blocking can be more accurate and efficient than blocking for large datasets.

References

N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In FOCS, pages 238--, 2002. Google ScholarDigital Library
R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Identification, 2003.Google Scholar
O. Benjelloun, H. Garcia-Molina, D. Menestrina, S. E. Whang, Q. Su, and J. Widom. Swoosh: a generic approach to entity resolution. VLDB J., 2008. Google ScholarDigital Library
I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In Proc. of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 2004. Google ScholarDigital Library
M. Bilenko, B. Kamath, and R. Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, 2006. Google ScholarDigital Library
S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In Proc. of ICDE, Tokyo, Japan, 2005. Google ScholarDigital Library
X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005. Google ScholarDigital Library
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007. Google ScholarDigital Library
L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, 2003.Google Scholar
L. Gu and R. A. Baxter. Adaptive filtering for efficient record linkage. In SDM, 2004.Google ScholarCross Ref
M. Hernandez and S. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998. Google ScholarDigital Library
M. A. Hernáandez and S. J. Stolfo. The merge/purge problem for large databases. In Proc. of ACM SIGMOD, pages 127--138, 1995. Google ScholarDigital Library
T. N. Herzog, F. J. Scheuren, and W. E. Winkler. Data Quality and Record Linkage Techniques. Springer, July 2007. Google ScholarDigital Library
P. Indyk. A small approximately min-wise independent family of hash functions. J. Algorithms, 38(1):84--90, 2001. Google ScholarDigital Library
A. K. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. of KDD, pages 169--178, Boston, MA, 2000. Google ScholarDigital Library
M. Michelson and C. Knoblock. Learning blocking schemes for record linkage. In AAAI, 2006. Google ScholarDigital Library
A. E. Monge and C. P. Elkan. An efficient domain independent algorithm for detecting approximately duplicate database records. In SIGMOD DMKD, 1997.Google Scholar
H. B. Newcombe. Handbook of record linkage: methods for health and statistical studies, administration, and business. Oxford University Press, Inc., New York, NY, USA, 1988. Google ScholarDigital Library
H. B. Newcombe and J. M. Kennedy. Record linkage: making maximum use of the discriminating power of identifying information. Commun. ACM, 5(11):563--566, 1962. Google ScholarDigital Library
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of ACM SIGKDD, Edmonton, Alberta, 2002. Google ScholarDigital Library
S. Tejada, C. A. Knoblock, and S. Minton. Learning ob ject identification rules for information integration. Information Systems Journal, 26(8):635--656, 2001. Google ScholarDigital Library
S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. Technical report, Stanford University, 2008.{4} I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In Proc. of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 2004. Google ScholarDigital Library
W. Winkler. Overview of record linkage and current research directions. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC, 2006.Google Scholar
W. E. Winkler. Approximate string comparator search strategies for very large administrative lists. Technical report, US Bureau of the Census, 2005.Google Scholar
W. Yancey. Bigmatch: A program for extracting probable matches from a large file for record linkage. Technical report, US Bureau of the Census, 2002.Google Scholar

Index Terms

Entity resolution with iterative blocking
1. General and reference
  1. Cross-computing tools and techniques
    1. Metrics
2. Information systems
  1. Data management systems
  2. Information systems applications

Recommendations

Blocking and Filtering Techniques for Entity Resolution: A Survey

Entity Resolution (ER), a core task of Data Integration, detects different entity profiles that correspond to the same real-world object. Due to its inherently quadratic complexity, a series of techniques accelerate it so that it scales to voluminous ...
Read More
Entity resolution framework using rough set blocking for heterogeneous web of data

Entity Resolution (ER) is the method of resolving two similar entities used in the process of data cleaning and data integration. However, existing ER Framework lead to exhaustive pairwise comparisons. The most efficient ER method is blocking, inherently ...
Read More
The role of transitive closure in evaluating blocking methods for dirty entity resolution
Abstract
Entity resolution (ER) is a process that identifies duplicate records referring to a real-world entity and links them together in one or more datasets. As a first step toward reducing the number of required record comparisons, blocking methods ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
June 2009
1168 pages
ISBN:9781605585512
DOI:10.1145/1559845
Editors:
Carsten Binnig,
Benoit Dageville,
General Chairs:
Uğur Çetintemel
Brown University, USA
,
Stan Zdonik
Brown University, USA
,
Program Chair:
Donald Kossmann
ETH Zurich, Switzerland
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
blocking
entity resolution
iterative blocking
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 149
  Total Citations
  View Citations
- 1,300
  Total Downloads
- Downloads (Last 12 months)32
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Entity resolution with iterative blocking

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Blocking and Filtering Techniques for Entity Resolution: A Survey

Entity resolution framework using rough set blocking for heterogeneous web of data

The role of transitive closure in evaluating blocking methods for dirty entity resolution

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Entity resolution with iterative blocking

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Blocking and Filtering Techniques for Entity Resolution: A Survey

Entity resolution framework using rough set blocking for heterogeneous web of data

The role of transitive closure in evaluating blocking methods for dirty entity resolution

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media