A Scalable and Efficient Subgroup Blocking Scheme for Multidatabase Record Linkage

Ranbaduge, Thilina; Vatsalan, Dinusha; Christen, Peter

doi:10.1007/978-3-319-93040-4_2

A Scalable and Efficient Subgroup Blocking Scheme for Multidatabase Record Linkage

Thilina Ranbaduge¹⁹,
Dinusha Vatsalan^19,20 &
Peter Christen¹⁹

Conference paper
First Online: 17 June 2018

3413 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10939))

Abstract

Record linkage is a commonly used task in data integration to facilitate the identification of matching records that refer to the same entity from different databases. The scalability of multidatabase record linkage (MDRL) is significantly challenged with the increase of both the sizes and the number of databases that are to be linked. Identifying matching records across subgroups of databases is an important aspect in MDRL that has not been addressed so far. We propose a scalable subgroup blocking approach for MDRL that uses an efficient search over a graph structure to identify similar blocks of records that need to be compared across subgroups of multiple databases. We provide an analysis of our technique in terms of complexity and blocking quality. We conduct an empirical study on large real-world datasets that shows our approach is scalable with the size of subgroups and the number of databases, and outperforms an existing state-of-the-art blocking technique for MDRL.

This work was funded by the Australian Research Council under Discovery Projects DP130101801 and DP160101934. The authors would also like to thank Vassilios Verykios for his valuable feedback.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Aggarwal, C., Wang, H.: Managing and Mining Graph Data. Springer, New York (2010). https://doi.org/10.1007/978-1-4419-6045-0
Book MATH Google Scholar
Boyd, J., Ferrante, A., O’Keefe, C., et al.: Data linkage infrastructure for cross-jurisdictional health-related research in Australia. BMC Health Serv. Res. 12, 480 (2012)
Article Google Scholar
Christen, P.: Data Matching. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Book Google Scholar
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE TKDE 19, 1–16 (2007)
Google Scholar
Fellegi, I., Sunter, A.: A theory for record linkage. JASA 64, 1183–1210 (1969)
Article Google Scholar
Fu, Z., Christen, P., Zhou, J.: A graph matching method for historical census household linkage. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014. LNCS (LNAI), vol. 8443, pp. 485–496. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06608-0_40
Chapter Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Theory of Computing (1998)
Google Scholar
Inokuchi, A., Washio, T., Motoda, H.: An apriori-based algorithm for mining frequent substructures from graph data. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 13–23. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45372-5_2
Chapter Google Scholar
Kong, C., Gao, M., Xu, C., Qian, W., Zhou, A.: Entity matching across multiple heterogeneous data sources. In: ACM DASFAA (2016)
Chapter Google Scholar
Papadakis, G., Svirsky, J., et al.: Comparative analysis of approximate blocking techniques for entity resolution. VLDB Endow. 9, 684–695 (2016)
Article Google Scholar
Ranbaduge, T., Vatsalan, D., Christen, P.: Scalable block scheduling for efficient multi-database record linkage. In: IEEE ICDM (2016)
Google Scholar
Ranbaduge, T., Vatsalan, D., Christen, P., Verykios, V.: Hashing-based distributed multi-party blocking for privacy-preserving record linkage. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J.Z., Wang, R. (eds.) PAKDD 2016. LNCS (LNAI), vol. 9652, pp. 415–427. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31750-2_33
Chapter Google Scholar
Randall, S., Ferrante, A., Boyd, J., Semmens, J.: The effect of data cleaning on record linkage quality. BMC Med. Inform. Decis. Mak. 13, 64 (2013)
Article Google Scholar
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach (2009)
Google Scholar
Sadinle, M., Fienberg, S.: A generalized Fellegi-Sunter framework for multiple record linkage with application to homicide record systems. JASA 108, 385–397 (2013)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Research School of Computer Science, The Australian National University, Canberra, ACT, Australia
Thilina Ranbaduge, Dinusha Vatsalan & Peter Christen
Data61, CSIRO, Eveleigh, NSW, Australia
Dinusha Vatsalan

Authors

Thilina Ranbaduge
View author publications
You can also search for this author in PubMed Google Scholar
Dinusha Vatsalan
View author publications
You can also search for this author in PubMed Google Scholar
Peter Christen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thilina Ranbaduge .

Editor information

Editors and Affiliations

Deakin University, Geelong, Victoria, Australia
Dinh Phung
National Chiao Tung University, Hsinchu City, Taiwan
Vincent S. Tseng
Monash University, Clayton, Victoria, Australia
Geoffrey I. Webb
Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Bao Ho
University of Melbourne, Melbourne, Victoria, Australia
Mohadeseh Ganji
University of Melbourne, Melbourne, Victoria, Australia
Lida Rashidi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ranbaduge, T., Vatsalan, D., Christen, P. (2018). A Scalable and Efficient Subgroup Blocking Scheme for Multidatabase Record Linkage. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10939. Springer, Cham. https://doi.org/10.1007/978-3-319-93040-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-93040-4_2
Published: 17 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93039-8
Online ISBN: 978-3-319-93040-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics