skip to main content
10.1145/3105831.3105844acmotherconferencesArticle/Chapter ViewAbstractPublication PagesideasConference Proceedingsconference-collections
research-article

Efficient Density-Based Blocking for Record Matching

Authors Info & Claims
Published:12 July 2017Publication History

ABSTRACT

Record Matching in data engineering refers to searching for data records originating from the same entities across different data sources. In practice, the main challenge of record matching is that the amount of non-matches typically far exceeds the amount of matches. This is called imbalance problem, which notoriously affects efficiency and effectiveness of matching algorithms. To solve the imbalance problem, recently, density-based blocking algorithms have been studied and demonstrated an effective blocking performance. However, the efficiency of density-based blocking approaches is not good as their effectiveness. In this paper, we improve the efficiency of density-based blocking by exploiting the idea of pre-computing and pruning. Our approach optimizes the method of computing density to speed up the blocking process. Throughout experiments on real-world datasets, the proposed approach demonstrated a high performance on both blocking efficiency and blocking effectiveness.

References

  1. A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on, pages 459--468. IEEE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Arasu, M. Göotz, and R. Kaushik. On active learning of record matching packages. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 783--794. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Bilenko, S. Basil, and M. Sahami. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In Data Mining, Fifth IEEE International Conference on, pages 8--pp. IEEE, 2005.Google ScholarGoogle Scholar
  4. M. Bilenko and R. J. Mooney. On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 7--12, 2003.Google ScholarGoogle Scholar
  5. S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In Proceedings of the 33rd international conference on Very large data bases, pages 327--338. VLDB Endowment, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In ICDE, pages 865--876, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. N. N. Dalvi, V. Rastogi, A. Dasgupta, A. D. Sarma, and T. Sarlós. Optimal hashing schemes for entity matching. In WWW, pages 295--306, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Dou, D. Sun, Y. Chen, G. Li, and J. Liu. Probabilistic parallelisation of blocking non-matched records for big data. In 2016 IEEE International Conference on Big Data, BigData 2016, Washington DC, USA, December 5-8, 2016, pages 3465--3473, 2016. Google ScholarGoogle ScholarCross RefCross Ref
  9. C. Dou, D. Sun, and R. Wong. Unsupervised blocking of imbalanced datasets for record matching. In International Conference on Web Information Systems Engineering. Springer, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. L. Giles, K. Bollacker, and S. L. CiteSeer. An automatic citation indexing system. In Digital Libraries, pages 89--98.Google ScholarGoogle Scholar
  12. A. Guttman. R-trees: a dynamic index structure for spatial searching, volume 14. ACM, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Karapiperis, D. Vatsalan, V. S. Verykios, and P. Christen. Efficient record linkage using a compact hamming space. In EDBT, pages 209--220, 2016.Google ScholarGoogle Scholar
  14. D. Karapiperis and V. S. Verykios. An lsh-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage. IEEE Transactions on Knowledge and Data Engineering, 27(4):909--921, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. S. Michalski, J. G. Carbonell, and T. M. Mitchell. Machine learning: An artificial intelligence approach. Springer Science & Business Media, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. N. Minton, C. Nanjo, C. A. Knoblock, M. Michalowski, and M. Michelson. A heterogeneous field matching method for record linkage. In Data Mining, Fifth IEEE International Conference on, pages 8--pp. IEEE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Papadakis, G. Koutrika, T. Palpanas, and W. Nejdl. Metablocking: Taking entity resolutionto the next level. IEEE Transactions on Knowledge and Data Engineering, 26(8):1946--1960, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  19. S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, pages 269--278. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Tejada, C. A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 350--359. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45--66, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Q. Wang, M. Cui, and H. Liang. Semantic-aware blocking for entity resolution. IEEE Transactions on Knowledge and Data Engineering, 28(1):166--180, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In SIGMOD Conference, pages 219--232, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient Density-Based Blocking for Record Matching

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium
        July 2017
        338 pages
        ISBN:9781450352208
        DOI:10.1145/3105831

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 July 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        IDEAS '17 Paper Acceptance Rate38of102submissions,37%Overall Acceptance Rate74of210submissions,35%
      • Article Metrics

        • Downloads (Last 12 months)1
        • Downloads (Last 6 weeks)0

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader