Abstract
Health data sources across healthcare service deliveries are notoriously disconnected hampering good use of data. Hospitals, family doctors, pharmacists, and health insurers all have their own data, while the data may contain information about the same patients. Also, industries offering healthcare and wellness services host and maintain their own data repositories about the patients on their services. Lastly, governmental organizations collect register and survey data on public health, healthcare utilization, and health outcome. Linking health data of individuals, events, and locations at various aggregation levels from different sources can be extremely insightful. More information can be pulled from linked data than from every data source separately. Bringing patient data together for which unique personal identifiers exist, such as social security numbers, is rather straightforward. In many practices, such identifiers are simply lacking meaning that one has to resort to variables that are not necessarily unique to a person which makes the task of linking data far more challenging. To make things worse, these linking variables come with errors due to misspellings, coding differences, or transcription mistakes. Nevertheless, data linkage needs to be done flawlessly as connecting the wrong patient records or missing valuable connections between patient records can result in biased analyses on linked datasets. This chapter provides a state-of-the-art survey in data linkage technology within healthcare. It will give (1) an overview of the various methods in data linkage including deterministic and probabilistic approaches (2) and a synthesis of healthcare use cases in which data linkage is essential with a discussion on the legal and privacy challenges of using data linkage in healthcare.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Batini, C., Scannapieco, M.: Data and Information Quality. Data-Centric Systems and Applications, Chapter 8. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-24106-7_8
Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: First Workshop on Data Cleaning, Record Linkage and Object Consolidation, CMIS Technical Report 03/139, KDD 2003, Washington DC, 24–27 Aug 2003
Blakely, T., Salmond, C.: Probabilistic record linkage and a method to calculate the positive predictive value. Int. J. Epidemiol. 31, 1246–1252 (2002)
Christen, P., Goiser K.: Quality and complexity measures for data linkage and deduplication. In: Guillet, F.J., Hamilton, H.J. (eds.) Quality Measures in Data Mining. Studies in Computational Intelligence, vol. 43, pp. 127–151. Springer, Berlin (2007)
Cook, L.J., Olson, L.M., Dean, J.M.: Probabilistic record linkage: relationships between file sizes, identifiers and match weights. Methods Inf. Med. 40, 196–203 (2001)
Contiero, P., Tittarelli, A., Tagliabue, G., Maghini, A., Fabiano, S., Crosignani, P., Tessandori, R.: The EpiLink record linkage software: presentation and results of linkage test on cancer registry files. Methods Inf. Med. 44(1), 66–71 (2005)
Definition of limited data set. https://www.hopkinsmedicine.org/institutional_review_board/hipaa_research/limited_data_set.html. Accessed 26 Jan 2016
Dey, D., Sarkar, S., De, P.: Entity matching in heterogeneous databases: a distance-based decision model. Institute of Electrical and Electronics Engineers Computer Society (1998). https://www.computer.org/csdl/proceedings/hicss/1998/8251/07/82510305.pdf. Accessed 21 Jan 2019
Dusetzina, S.B., Tyree S., Meyer, A.-M., Meyer, A., Green, L., Carpenter, W.R.: Linking Data for Health Services Research: A Framework and Instructional Guide. The University of North Carolina at Chapel Hill, Rockville (MD)/Agency for Healthcare Research and Quality (US), report no.: 14-EHC033-EF (2014)
General Data Protection Regulation (GDPR) http://ec.europa.eu/justice/data-protection/reform/files/regulation_oj_en.pdf. Accessed 26 Jan 2016
Goldreich, O., Warning, A.: Secure multi-party computation (1998)
Goldstein, H., Harron, K., Wade, A.: The analysis of record linked data using multiple imputation with data value priors. Stat. Med. 31(28), 3481–3493 (2012)
Goldstein, H., Harron, K., Cortina-Borja, M.: A scaling approach to record linkage. Stat. Med. 36, 2514–2521 (2016). https://doi.org/10.1002/sim.7287
Government data-matching: Office of the Australian Information Commissioner—OAIC. https://www.oaic.gov.au/privacy-law/other-legislation/government-data-matching. Accessed 26 Jan 2018
Harron, K., Goldstein, H., Dibben, C. (eds.): Methodological Developments in Data Linkage. Wiley, Chichester (2015)
Harron, K., Doidge, J.C., Knight, H.E., Gilbert, R.E., Goldstein, H., Cromwell, D.A., Van der Meulen, J.H.: A guide to evaluating linkage quality for the analysis of linked data. Int. J. Epidemiol. 46(5), 1699–1710 (2017)
Hendriks, P., Reynaert, M., van der Sijs, N.: Transcriptor, language and speech technology technical report series, Radboud University, Nijmegen (2016)
HIPAA for Professionals. https://www.hhs.gov/hipaa/for-professionals/index.html. Accessed 26 Jan 2016
HIPAA PHI: List of 18 Identifiers and Definition of PHI. https://cphs.berkeley.edu/hipaa/hipaa18.html. Accessed 21 Jan 2019
Jaro, M.A.: Probabilistic linkage of large public health data files, Match Ware Technologies. Stat. Med. 14, 491–498 (1995)
Jiang, R., Rafael, E., Li, B., Li, H.: Evaluating and combining named entity recognition systems. In: Proceedings of the Sixth Named Entity Workshop, joint with 54th ACL, Berlin, 12 August 2016, pp. 21–27
Krewski, D.A., Wang, Y., Bartlett, S., et al.: The effect of record linkage errors on risk estimates in cohort mortality studies. Surv. Methodol. 31(1), 13–21 (2005)
Kum, H.-C., Krishnamurthy, A., Machanavajjhala, A., et al.: Privacy preserving interactive record linkage (PPIRL). J. Am. Med. Inform. Assoc. 21, 212–220 (2014)
Linking social care, housing & health data, Data linking: social care, housing & health: Paper 1, Data Linkage literature review (2010)
Marrero, M., Sánchez-Cuadrado, S., Lara, J.M., Andreadakis, G.: Evaluation of named entity extraction systems. In: Advances in Computational Linguistics, Research in Computing Science, pp. 41–47 (2009)
Mendes, R., Vilela, J.: Privacy-preserving data mining: methods, metrics, and applications. IEEE Access. 5, 10562–10582 (2017). https://doi.org/10.1109/ACCESS.2017.2706947
Queensland Data Linkage Framework, Published by the State of Queensland (Queensland Health) (2014)
Sadinle, M.: Bayesian estimation of bipartite matchings for record linkage. J. Am. Stat. Assoc. 112(518), 600–612 (2017). https://doi.org/10.1080/01621459.2016.1148612
Statistical Data Integration involving Commonwealth Data, National Statistical Service, Australian Government. https://toolkit.data.gov.au/index.php/Statistical_Data_Integration. Accessed 21 Jan 2019
Van der Sijs, N., Hendriks, P.: Al-Kadafi and Tsjechov: Waarom de spelling van namen ertoe doet. Onze Taal 11, 10–14 (2017)
Verykios, V.S., Elmagarmid, A.K., Moustakides, G.V.: Cost optimal record/entity matching. Purdue e-Pubs, Purdue University, report number: 01-014 (2001)
Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage, Bureau of the Census* Statistical Research Division, Rm 3000-4, Washington, DC 20223 (1990)
Winkler, W.E.: Methods for record linkage and Bayesian networks. In: Proceedings of the Section on Survey Research Methods, pp. 3743–3748. ASA, Boston (2002)
Yuan, Y.C.: Multiple imputation for missing data: concepts and new development. In: Statistics and Data Analytics. SAS Institute, Rockville, Paper 267-25 (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Kostadinovska, A., Asim, M., Pletea, D., Pauws, S. (2019). Overview of Data Linkage Methods for Integrating Separate Health Data Sources. In: Consoli, S., Reforgiato Recupero, D., Petković, M. (eds) Data Science for Healthcare. Springer, Cham. https://doi.org/10.1007/978-3-030-05249-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-05249-2_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05248-5
Online ISBN: 978-3-030-05249-2
eBook Packages: Computer ScienceComputer Science (R0)