Abstract
Record linkage is the process of integrating information from the same underlying entity across disparate data sets. This process, which is increasingly utilized to build accurate representations of individuals and organizations for a variety of applications, ranging from credit worthiness assessments to continuity of medical care, can be computationally intensive because it requires comparing large quantities of records over a range of attributes. To reduce the amount of computation in record linkage in big data settings, blocking methods, which are designed to limit the number of record pair comparisons that needs to be performed, are critical for scaling up the record linkage process. These methods group together potential matches into blocks, often using a subset of attributes before a final comparator function predicts which record pairs within the blocks correspond to matches. Yet data corruption and missing values adversely influence the performance of blocking methods (e.g., it may cause some matching records not to be placed in the same block). While there has been some investigation into the impact of missing values on general record linkage techniques (e.g., the comparator function), no study has addressed the impact of the missing values on blocking methods. To address this issue, in this work, we systematically perform a detailed empirical analysis of the individual and joint impact of missing values and data corruption on different blocking methods using realistic data sets. Our results show that blocking approaches that do not depend on one type of blocking attributes are more robust against missing values. In addition, our results indicate that blocking parameters must be chosen carefully for different blocking techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Florida Voter Registration Records. http://flvoters.com/downloads.html. Accessed 10 July 2018
North Carolina Voter Registration Records. https://dl.ncsbe.gov/index.html?prefix=data/Snapshots. Accessed 10 July 2018
Aizawa, A.N., Oyama, K.: A fast linkage detection scheme for multi-source information integration. In: Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration, pp. 30–39 (2005)
Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 25–27 (2003)
Christen, P.: Febrl-a open source data cleaning, deduplication and record linkage system with a graphical user interface. In: Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1065–1068 (2008)
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)
Dusetzina, S.B., Tyree, S., Meyer, A.M., Meyer, A., Green, L., Carpenter, W.R.: Linking data for health services research: a framework and instructional guide. Agency for Healthcare Research and Quality (US), Rockville (MD) (2014)
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Mini. Knowl. Discov. 2(1), 9–37 (1998)
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Proceedings of International Conference on Database Systems for Advanced Applications, pp. 137–146. IEEE (2003)
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 169–178 (2000)
Ong, T.C., Mannino, M.V., Schilling, L.M., Kahn, M.G.: Improving record linkage performance in the presence of missing linkage data. J. Biomed. Inf. 52, 43–54 (2014)
Prasad, K.H., Chaturvedi, S., Faruquie, T.A., Subramaniam, L.V., Mohania, M.K.: Automated selection of blocking columns for record linkage. In: Proceedings of International Conference on Service Operations and Logistics, and Informatics, pp. 78–83. IEEE (2012)
Tran, K.N., Vatsalan, D., Christen, P.: GeCo: an online personal data generator and corruptor. In: Proceedings of ACM International Conference on Information and Knowledge Management, pp. 2473–2476 (2013)
Winkler, W.E.: Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, vol. 667, p. 671 (1988)
Acknowledgements
The research reported herein was supported in part by NIH awards 1R01HG006844, RM1HG009034, NSF awards CICI- 1547324, IIS-1633331, CNS-1837627, OAC-1828467 and ARO award W911NF-17-1-0356.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Anindya, I.C., Kantarcioglu, M., Malin, B. (2019). Determining the Impact of Missing Values on Blocking in Record Linkage. In: Yang, Q., Zhou, ZH., Gong, Z., Zhang, ML., Huang, SJ. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2019. Lecture Notes in Computer Science(), vol 11441. Springer, Cham. https://doi.org/10.1007/978-3-030-16142-2_21
Download citation
DOI: https://doi.org/10.1007/978-3-030-16142-2_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16141-5
Online ISBN: 978-3-030-16142-2
eBook Packages: Computer ScienceComputer Science (R0)