skip to main content
10.1145/1559845.1559870acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Entity resolution with iterative blocking

Published:29 June 2009Publication History

ABSTRACT

Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. Various blocking techniques can be used to enhance the performance of ER by dividing the records into blocks in multiple ways and only comparing records within the same block. However, most blocking techniques process blocks separately and do not exploit the results of other blocks. In this paper, we propose an iterative blocking framework where the ER results of blocks are reflected to subsequently processed blocks. Blocks are now iteratively processed until no block contains any more matching records. Compared to simple blocking, iterative blocking may achieve higher accuracy because reflecting the ER results of blocks to other blocks may generate additional record matches. Iterative blocking may also be more efficient because processing a block now saves the processing time for other blocks. We implement a scalable iterative blocking system and demonstrate that iterative blocking can be more accurate and efficient than blocking for large datasets.

References

  1. N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In FOCS, pages 238--, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Identification, 2003.Google ScholarGoogle Scholar
  3. O. Benjelloun, H. Garcia-Molina, D. Menestrina, S. E. Whang, Q. Su, and J. Widom. Swoosh: a generic approach to entity resolution. VLDB J., 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In Proc. of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Bilenko, B. Kamath, and R. Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In Proc. of ICDE, Tokyo, Japan, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, 2003.Google ScholarGoogle Scholar
  10. L. Gu and R. A. Baxter. Adaptive filtering for efficient record linkage. In SDM, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  11. M. Hernandez and S. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. A. Hernáandez and S. J. Stolfo. The merge/purge problem for large databases. In Proc. of ACM SIGMOD, pages 127--138, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. N. Herzog, F. J. Scheuren, and W. E. Winkler. Data Quality and Record Linkage Techniques. Springer, July 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Indyk. A small approximately min-wise independent family of hash functions. J. Algorithms, 38(1):84--90, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. K. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. of KDD, pages 169--178, Boston, MA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Michelson and C. Knoblock. Learning blocking schemes for record linkage. In AAAI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. E. Monge and C. P. Elkan. An efficient domain independent algorithm for detecting approximately duplicate database records. In SIGMOD DMKD, 1997.Google ScholarGoogle Scholar
  18. H. B. Newcombe. Handbook of record linkage: methods for health and statistical studies, administration, and business. Oxford University Press, Inc., New York, NY, USA, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. B. Newcombe and J. M. Kennedy. Record linkage: making maximum use of the discriminating power of identifying information. Commun. ACM, 5(11):563--566, 1962. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of ACM SIGKDD, Edmonton, Alberta, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Tejada, C. A. Knoblock, and S. Minton. Learning ob ject identification rules for information integration. Information Systems Journal, 26(8):635--656, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. Technical report, Stanford University, 2008.{4} I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In Proc. of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. W. Winkler. Overview of record linkage and current research directions. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC, 2006.Google ScholarGoogle Scholar
  24. W. E. Winkler. Approximate string comparator search strategies for very large administrative lists. Technical report, US Bureau of the Census, 2005.Google ScholarGoogle Scholar
  25. W. Yancey. Bigmatch: A program for extracting probable matches from a large file for record linkage. Technical report, US Bureau of the Census, 2002.Google ScholarGoogle Scholar

Index Terms

  1. Entity resolution with iterative blocking

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
          June 2009
          1168 pages
          ISBN:9781605585512
          DOI:10.1145/1559845

          Copyright © 2009 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 29 June 2009

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader