research-article

Efficient Density-Based Blocking for Record Matching

Authors:
Chenxiao Dou

University of New South Wales, Sydney, NSW, Australia

University of New South Wales, Sydney, NSW, Australia
View Profile

,
Ruoyu Wang

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Daniel Sun

CSIRO, University of New South Wales, Shanghai Jiao Tong University, ACT, Australia

CSIRO, University of New South Wales, Shanghai Jiao Tong University, ACT, Australia
View Profile

,
Muhammad Atif

National Computational Infrastructure, ACT, Australia

National Computational Infrastructure, ACT, Australia
View Profile

IDEAS '17: Proceedings of the 21st International Database Engineering & Applications SymposiumJuly 2017Pages 118–126https://doi.org/10.1145/3105831.3105844

Published:12 July 2017Publication History

IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

Pages 118–126

ABSTRACT

Record Matching in data engineering refers to searching for data records originating from the same entities across different data sources. In practice, the main challenge of record matching is that the amount of non-matches typically far exceeds the amount of matches. This is called imbalance problem, which notoriously affects efficiency and effectiveness of matching algorithms. To solve the imbalance problem, recently, density-based blocking algorithms have been studied and demonstrated an effective blocking performance. However, the efficiency of density-based blocking approaches is not good as their effectiveness. In this paper, we improve the efficiency of density-based blocking by exploiting the idea of pre-computing and pruning. Our approach optimizes the method of computing density to speed up the blocking process. Throughout experiments on real-world datasets, the proposed approach demonstrated a high performance on both blocking efficiency and blocking effectiveness.

References

A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on, pages 459--468. IEEE, 2006. Google ScholarDigital Library
A. Arasu, M. Göotz, and R. Kaushik. On active learning of record matching packages. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 783--794. ACM, 2010. Google ScholarDigital Library
M. Bilenko, S. Basil, and M. Sahami. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In Data Mining, Fifth IEEE International Conference on, pages 8--pp. IEEE, 2005.Google Scholar
M. Bilenko and R. J. Mooney. On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 7--12, 2003.Google Scholar
S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In Proceedings of the 33rd international conference on Very large data bases, pages 327--338. VLDB Endowment, 2007.Google ScholarDigital Library
S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In ICDE, pages 865--876, 2005. Google ScholarDigital Library
N. N. Dalvi, V. Rastogi, A. Dasgupta, A. D. Sarma, and T. Sarlós. Optimal hashing schemes for entity matching. In WWW, pages 295--306, 2013. Google ScholarDigital Library
C. Dou, D. Sun, Y. Chen, G. Li, and J. Liu. Probabilistic parallelisation of blocking non-matched records for big data. In 2016 IEEE International Conference on Big Data, BigData 2016, Washington DC, USA, December 5-8, 2016, pages 3465--3473, 2016. Google ScholarCross Ref
C. Dou, D. Sun, and R. Wong. Unsupervised blocking of imbalanced datasets for record matching. In International Conference on Web Information Systems Engineering. Springer, 2016. Google ScholarDigital Library
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007. Google ScholarDigital Library
C. L. Giles, K. Bollacker, and S. L. CiteSeer. An automatic citation indexing system. In Digital Libraries, pages 89--98.Google Scholar
A. Guttman. R-trees: a dynamic index structure for spatial searching, volume 14. ACM, 1984. Google ScholarDigital Library
D. Karapiperis, D. Vatsalan, V. S. Verykios, and P. Christen. Efficient record linkage using a compact hamming space. In EDBT, pages 209--220, 2016.Google Scholar
D. Karapiperis and V. S. Verykios. An lsh-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage. IEEE Transactions on Knowledge and Data Engineering, 27(4):909--921, 2015. Google ScholarDigital Library
R. S. Michalski, J. G. Carbonell, and T. M. Mitchell. Machine learning: An artificial intelligence approach. Springer Science & Business Media, 2013.Google ScholarDigital Library
S. N. Minton, C. Nanjo, C. A. Knoblock, M. Michalowski, and M. Michelson. A heterogeneous field matching method for record linkage. In Data Mining, Fifth IEEE International Conference on, pages 8--pp. IEEE, 2005. Google ScholarDigital Library
K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.Google ScholarDigital Library
G. Papadakis, G. Koutrika, T. Palpanas, and W. Nejdl. Metablocking: Taking entity resolutionto the next level. IEEE Transactions on Knowledge and Data Engineering, 26(8):1946--1960, 2014. Google ScholarCross Ref
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, pages 269--278. ACM, 2002. Google ScholarDigital Library
S. Tejada, C. A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 350--359. ACM, 2002. Google ScholarDigital Library
S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45--66, 2001.Google ScholarDigital Library
Q. Wang, M. Cui, and H. Liang. Semantic-aware blocking for entity resolution. IEEE Transactions on Knowledge and Data Engineering, 28(1):166--180, 2016. Google ScholarDigital Library
S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In SIGMOD Conference, pages 219--232, 2009. Google ScholarDigital Library

Index Terms

Efficient Density-Based Blocking for Record Matching
1. Information systems
  1. Data management systems
  2. Information systems applications

Recommendations

Unsupervised blocking and probabilistic parallelisation for record matching of distributed big data

Record Matching refers to identifying pairs of records that relate to the same entities across different data sources. In many applications of data mining, record matching is usually associated to quadratic complexity. In practice, the number of non-...
Read More
Unsupervised Blocking of Imbalanced Datasets for Record Matching
WISE 2016: Proceedings of the 17th International Conference on Web Information Systems Engineering - Volume 10042

Record matching in data engineering refers to searching for data records originating from same entities across different data sources. The solutions for record matching usually employ learning algorithms to train a classifier that labels record pairs as ...
Read More
Record Matching over Query Results from Multiple Web Databases

Record matching, which identifies the records that represent the same real-world entity, is an important step for data integration. Most state-of-the-art record matching methods are supervised, which requires the user to provide training data. These ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium
July 2017
338 pages
ISBN:9781450352208
DOI:10.1145/3105831
Editors:
Bipin C. Desai
Concordia University
,
Jun Hong
UWE, Bristol
,
Richard McClatchey
UWE, Bristol
,
General Chair:
Bipin C. Desai
Concordia University
,
Program Chair:
Jun Hong
UWE, Bristol
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 July 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Density
Record Matching
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
IDEAS '17 Paper Acceptance Rate38of102submissions,37%Overall Acceptance Rate74of210submissions,35%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 61
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient Density-Based Blocking for Record Matching

IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

ABSTRACT

References

Cited By

Index Terms

Recommendations

Unsupervised blocking and probabilistic parallelisation for record matching of distributed big data

Unsupervised Blocking of Imbalanced Datasets for Record Matching

Record Matching over Query Results from Multiple Web Databases

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient Density-Based Blocking for Record Matching

IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

ABSTRACT

References

Cited By

Index Terms

Recommendations

Unsupervised blocking and probabilistic parallelisation for record matching of distributed big data

Unsupervised Blocking of Imbalanced Datasets for Record Matching

Record Matching over Query Results from Multiple Web Databases

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media