Skip to main content

Determining the Impact of Missing Values on Blocking in Record Linkage

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11441))

Included in the following conference series:

Abstract

Record linkage is the process of integrating information from the same underlying entity across disparate data sets. This process, which is increasingly utilized to build accurate representations of individuals and organizations for a variety of applications, ranging from credit worthiness assessments to continuity of medical care, can be computationally intensive because it requires comparing large quantities of records over a range of attributes. To reduce the amount of computation in record linkage in big data settings, blocking methods, which are designed to limit the number of record pair comparisons that needs to be performed, are critical for scaling up the record linkage process. These methods group together potential matches into blocks, often using a subset of attributes before a final comparator function predicts which record pairs within the blocks correspond to matches. Yet data corruption and missing values adversely influence the performance of blocking methods (e.g., it may cause some matching records not to be placed in the same block). While there has been some investigation into the impact of missing values on general record linkage techniques (e.g., the comparator function), no study has addressed the impact of the missing values on blocking methods. To address this issue, in this work, we systematically perform a detailed empirical analysis of the individual and joint impact of missing values and data corruption on different blocking methods using realistic data sets. Our results show that blocking approaches that do not depend on one type of blocking attributes are more robust against missing values. In addition, our results indicate that blocking parameters must be chosen carefully for different blocking techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Florida Voter Registration Records. http://flvoters.com/downloads.html. Accessed 10 July 2018

  2. North Carolina Voter Registration Records. https://dl.ncsbe.gov/index.html?prefix=data/Snapshots. Accessed 10 July 2018

  3. Aizawa, A.N., Oyama, K.: A fast linkage detection scheme for multi-source information integration. In: Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration, pp. 30–39 (2005)

    Google Scholar 

  4. Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 25–27 (2003)

    Google Scholar 

  5. Christen, P.: Febrl-a open source data cleaning, deduplication and record linkage system with a graphical user interface. In: Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1065–1068 (2008)

    Google Scholar 

  6. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)

    Article  Google Scholar 

  7. Dusetzina, S.B., Tyree, S., Meyer, A.M., Meyer, A., Green, L., Carpenter, W.R.: Linking data for health services research: a framework and instructional guide. Agency for Healthcare Research and Quality (US), Rockville (MD) (2014)

    Google Scholar 

  8. Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Mini. Knowl. Discov. 2(1), 9–37 (1998)

    Article  Google Scholar 

  9. Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Proceedings of International Conference on Database Systems for Advanced Applications, pp. 137–146. IEEE (2003)

    Google Scholar 

  10. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 169–178 (2000)

    Google Scholar 

  11. Ong, T.C., Mannino, M.V., Schilling, L.M., Kahn, M.G.: Improving record linkage performance in the presence of missing linkage data. J. Biomed. Inf. 52, 43–54 (2014)

    Article  Google Scholar 

  12. Prasad, K.H., Chaturvedi, S., Faruquie, T.A., Subramaniam, L.V., Mohania, M.K.: Automated selection of blocking columns for record linkage. In: Proceedings of International Conference on Service Operations and Logistics, and Informatics, pp. 78–83. IEEE (2012)

    Google Scholar 

  13. Tran, K.N., Vatsalan, D., Christen, P.: GeCo: an online personal data generator and corruptor. In: Proceedings of ACM International Conference on Information and Knowledge Management, pp. 2473–2476 (2013)

    Google Scholar 

  14. Winkler, W.E.: Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, vol. 667, p. 671 (1988)

    Google Scholar 

Download references

Acknowledgements

The research reported herein was supported in part by NIH awards 1R01HG006844, RM1HG009034, NSF awards CICI- 1547324, IIS-1633331, CNS-1837627, OAC-1828467 and ARO award W911NF-17-1-0356.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Imrul Chowdhury Anindya .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Anindya, I.C., Kantarcioglu, M., Malin, B. (2019). Determining the Impact of Missing Values on Blocking in Record Linkage. In: Yang, Q., Zhou, ZH., Gong, Z., Zhang, ML., Huang, SJ. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2019. Lecture Notes in Computer Science(), vol 11441. Springer, Cham. https://doi.org/10.1007/978-3-030-16142-2_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-16142-2_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-16141-5

  • Online ISBN: 978-3-030-16142-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics