ABSTRACT
Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. Various blocking techniques can be used to enhance the performance of ER by dividing the records into blocks in multiple ways and only comparing records within the same block. However, most blocking techniques process blocks separately and do not exploit the results of other blocks. In this paper, we propose an iterative blocking framework where the ER results of blocks are reflected to subsequently processed blocks. Blocks are now iteratively processed until no block contains any more matching records. Compared to simple blocking, iterative blocking may achieve higher accuracy because reflecting the ER results of blocks to other blocks may generate additional record matches. Iterative blocking may also be more efficient because processing a block now saves the processing time for other blocks. We implement a scalable iterative blocking system and demonstrate that iterative blocking can be more accurate and efficient than blocking for large datasets.
- N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In FOCS, pages 238--, 2002. Google ScholarDigital Library
- R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Identification, 2003.Google Scholar
- O. Benjelloun, H. Garcia-Molina, D. Menestrina, S. E. Whang, Q. Su, and J. Widom. Swoosh: a generic approach to entity resolution. VLDB J., 2008. Google ScholarDigital Library
- I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In Proc. of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 2004. Google ScholarDigital Library
- M. Bilenko, B. Kamath, and R. Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, 2006. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In Proc. of ICDE, Tokyo, Japan, 2005. Google ScholarDigital Library
- X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005. Google ScholarDigital Library
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007. Google ScholarDigital Library
- L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, 2003.Google Scholar
- L. Gu and R. A. Baxter. Adaptive filtering for efficient record linkage. In SDM, 2004.Google ScholarCross Ref
- M. Hernandez and S. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998. Google ScholarDigital Library
- M. A. Hernáandez and S. J. Stolfo. The merge/purge problem for large databases. In Proc. of ACM SIGMOD, pages 127--138, 1995. Google ScholarDigital Library
- T. N. Herzog, F. J. Scheuren, and W. E. Winkler. Data Quality and Record Linkage Techniques. Springer, July 2007. Google ScholarDigital Library
- P. Indyk. A small approximately min-wise independent family of hash functions. J. Algorithms, 38(1):84--90, 2001. Google ScholarDigital Library
- A. K. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. of KDD, pages 169--178, Boston, MA, 2000. Google ScholarDigital Library
- M. Michelson and C. Knoblock. Learning blocking schemes for record linkage. In AAAI, 2006. Google ScholarDigital Library
- A. E. Monge and C. P. Elkan. An efficient domain independent algorithm for detecting approximately duplicate database records. In SIGMOD DMKD, 1997.Google Scholar
- H. B. Newcombe. Handbook of record linkage: methods for health and statistical studies, administration, and business. Oxford University Press, Inc., New York, NY, USA, 1988. Google ScholarDigital Library
- H. B. Newcombe and J. M. Kennedy. Record linkage: making maximum use of the discriminating power of identifying information. Commun. ACM, 5(11):563--566, 1962. Google ScholarDigital Library
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of ACM SIGKDD, Edmonton, Alberta, 2002. Google ScholarDigital Library
- S. Tejada, C. A. Knoblock, and S. Minton. Learning ob ject identification rules for information integration. Information Systems Journal, 26(8):635--656, 2001. Google ScholarDigital Library
- S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. Technical report, Stanford University, 2008.{4} I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In Proc. of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 2004. Google ScholarDigital Library
- W. Winkler. Overview of record linkage and current research directions. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC, 2006.Google Scholar
- W. E. Winkler. Approximate string comparator search strategies for very large administrative lists. Technical report, US Bureau of the Census, 2005.Google Scholar
- W. Yancey. Bigmatch: A program for extracting probable matches from a large file for record linkage. Technical report, US Bureau of the Census, 2002.Google Scholar
Index Terms
- Entity resolution with iterative blocking
Recommendations
Blocking and Filtering Techniques for Entity Resolution: A Survey
Entity Resolution (ER), a core task of Data Integration, detects different entity profiles that correspond to the same real-world object. Due to its inherently quadratic complexity, a series of techniques accelerate it so that it scales to voluminous ...
Entity resolution framework using rough set blocking for heterogeneous web of data
Entity Resolution (ER) is the method of resolving two similar entities used in the process of data cleaning and data integration. However, existing ER Framework lead to exhaustive pairwise comparisons. The most efficient ER method is blocking, inherently ...
The role of transitive closure in evaluating blocking methods for dirty entity resolution
AbstractEntity resolution (ER) is a process that identifies duplicate records referring to a real-world entity and links them together in one or more datasets. As a first step toward reducing the number of required record comparisons, blocking methods ...
Comments