ABSTRACT
Record Matching in data engineering refers to searching for data records originating from the same entities across different data sources. In practice, the main challenge of record matching is that the amount of non-matches typically far exceeds the amount of matches. This is called imbalance problem, which notoriously affects efficiency and effectiveness of matching algorithms. To solve the imbalance problem, recently, density-based blocking algorithms have been studied and demonstrated an effective blocking performance. However, the efficiency of density-based blocking approaches is not good as their effectiveness. In this paper, we improve the efficiency of density-based blocking by exploiting the idea of pre-computing and pruning. Our approach optimizes the method of computing density to speed up the blocking process. Throughout experiments on real-world datasets, the proposed approach demonstrated a high performance on both blocking efficiency and blocking effectiveness.
- A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on, pages 459--468. IEEE, 2006. Google ScholarDigital Library
- A. Arasu, M. Göotz, and R. Kaushik. On active learning of record matching packages. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 783--794. ACM, 2010. Google ScholarDigital Library
- M. Bilenko, S. Basil, and M. Sahami. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In Data Mining, Fifth IEEE International Conference on, pages 8--pp. IEEE, 2005.Google Scholar
- M. Bilenko and R. J. Mooney. On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 7--12, 2003.Google Scholar
- S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In Proceedings of the 33rd international conference on Very large data bases, pages 327--338. VLDB Endowment, 2007.Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In ICDE, pages 865--876, 2005. Google ScholarDigital Library
- N. N. Dalvi, V. Rastogi, A. Dasgupta, A. D. Sarma, and T. Sarlós. Optimal hashing schemes for entity matching. In WWW, pages 295--306, 2013. Google ScholarDigital Library
- C. Dou, D. Sun, Y. Chen, G. Li, and J. Liu. Probabilistic parallelisation of blocking non-matched records for big data. In 2016 IEEE International Conference on Big Data, BigData 2016, Washington DC, USA, December 5-8, 2016, pages 3465--3473, 2016. Google ScholarCross Ref
- C. Dou, D. Sun, and R. Wong. Unsupervised blocking of imbalanced datasets for record matching. In International Conference on Web Information Systems Engineering. Springer, 2016. Google ScholarDigital Library
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007. Google ScholarDigital Library
- C. L. Giles, K. Bollacker, and S. L. CiteSeer. An automatic citation indexing system. In Digital Libraries, pages 89--98.Google Scholar
- A. Guttman. R-trees: a dynamic index structure for spatial searching, volume 14. ACM, 1984. Google ScholarDigital Library
- D. Karapiperis, D. Vatsalan, V. S. Verykios, and P. Christen. Efficient record linkage using a compact hamming space. In EDBT, pages 209--220, 2016.Google Scholar
- D. Karapiperis and V. S. Verykios. An lsh-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage. IEEE Transactions on Knowledge and Data Engineering, 27(4):909--921, 2015. Google ScholarDigital Library
- R. S. Michalski, J. G. Carbonell, and T. M. Mitchell. Machine learning: An artificial intelligence approach. Springer Science & Business Media, 2013.Google ScholarDigital Library
- S. N. Minton, C. Nanjo, C. A. Knoblock, M. Michalowski, and M. Michelson. A heterogeneous field matching method for record linkage. In Data Mining, Fifth IEEE International Conference on, pages 8--pp. IEEE, 2005. Google ScholarDigital Library
- K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.Google ScholarDigital Library
- G. Papadakis, G. Koutrika, T. Palpanas, and W. Nejdl. Metablocking: Taking entity resolutionto the next level. IEEE Transactions on Knowledge and Data Engineering, 26(8):1946--1960, 2014. Google ScholarCross Ref
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, pages 269--278. ACM, 2002. Google ScholarDigital Library
- S. Tejada, C. A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 350--359. ACM, 2002. Google ScholarDigital Library
- S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45--66, 2001.Google ScholarDigital Library
- Q. Wang, M. Cui, and H. Liang. Semantic-aware blocking for entity resolution. IEEE Transactions on Knowledge and Data Engineering, 28(1):166--180, 2016. Google ScholarDigital Library
- S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In SIGMOD Conference, pages 219--232, 2009. Google ScholarDigital Library
Index Terms
- Efficient Density-Based Blocking for Record Matching
Recommendations
Unsupervised blocking and probabilistic parallelisation for record matching of distributed big data
Record Matching refers to identifying pairs of records that relate to the same entities across different data sources. In many applications of data mining, record matching is usually associated to quadratic complexity. In practice, the number of non-...
Unsupervised Blocking of Imbalanced Datasets for Record Matching
WISE 2016: Proceedings of the 17th International Conference on Web Information Systems Engineering - Volume 10042Record matching in data engineering refers to searching for data records originating from same entities across different data sources. The solutions for record matching usually employ learning algorithms to train a classifier that labels record pairs as ...
Record Matching over Query Results from Multiple Web Databases
Record matching, which identifies the records that represent the same real-world entity, is an important step for data integration. Most state-of-the-art record matching methods are supervised, which requires the user to provide training data. These ...
Comments