Predicting Missing Values in Medical Data Via XGBoost Regression

Zhang, Xinmeng; Yan, Chao; Gao, Cheng; Malin, Bradley A.; Chen, You

doi:10.1007/s41666-020-00077-1

Predicting Missing Values in Medical Data Via XGBoost Regression

Research Article
Published: 03 August 2020

Volume 4, pages 383–394, (2020)
Cite this article

Journal of Healthcare Informatics Research Aims and scope Submit manuscript

Xinmeng Zhang¹,
Chao Yan¹,
Cheng Gao²,
Bradley A. Malin² &
…
You Chen²

3377 Accesses
Explore all metrics

Abstract

The data in a patient’s laboratory test result is a notable resource to support clinical investigation and enhance medical research. However, for a variety of reasons, this type of data often contains a non-trivial number of missing values. For example, physicians may neglect to order tests or document the results. Such a phenomenon reduces the degree to which this data can be utilized to learn efficient and effective predictive models. To address this problem, various approaches have been developed to impute missing laboratory values; however, their performance has been limited. This is due, in part, to the fact no approaches effectively leverage the contextual information (1) in individual or (2) between laboratory test variables. We introduce an approach to combine an unsupervised prefilling strategy with a supervised machine learning approach, in the form of extreme gradient boosting (XGBoost), to leverage both types of context for imputation purposes. We evaluated the methodology through a series of experiments on approximately 8200 patients’ records in the MIMIC-III dataset. The results demonstrate that the new model outperforms baseline and state-of-the-art models on 13 commonly collected laboratory test variables. In terms of the normalized root mean square derivation (nRMSD), our model exhibits an imputation improvement by over 20%, on average. Missing data imputation on the temporal variables can be largely improved via prefilling strategy and the supervised training technique, which leverages both the longitudinal and cross-sectional context simultaneously.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Combined Interpolation and Weighted K-Nearest Neighbours Approach for the Imputation of Longitudinal ICU Laboratory Data

Article 02 March 2020

Imputation of Missing Data in Electronic Health Records Based on Patients’ Similarities

Article 07 May 2020

A dynamic ensemble approach to robust classification in the presence of missing data

Article 20 October 2015

Data Availability

The dataset is sampled from Medical Information Mart for Intensive Care III. (MIMIC-III) (https://mimic.physionet.org/).

Notes

Data challenge details at http://www.ieee-ichi.org/challenge.html

References

Evans RS (2016) Electronic health records: then, now, and in the future. International Medical Informatics Association (IMIA) 1:S48–S61
Richesson RL, Horvath MM, Rusincovitch SA (2014) Clinical research informatics and electronic health record data. International Medical Informatics Association (IMIA) 23(1):215–223
Köpcke F, Trinczek B, Majeed RW, Schreiweis B, Wenk J, Leusch T, Ganslandt T, Ohmann C, Bergh B, Röhrig R, Dugas M, Prokosch HU (2013) Evaluation of data completeness in the electronic health record for the purpose of patient recruitment into clinical trials: a retrospective analysis of element presence. BioMed Central (BMC) 13(1):37
Hu Z, Melton GB, Arsoniadis EG, Wang Y, Kwaan MR, Simon GJ (2017) Strategies for handling missing clinical data for automated surgical site infection detection from the electronic health record. J Biomed Inform 68:112–120
Article Google Scholar
Beaulieu-Jones BK, Moore JH (2017) Missing data imputation in the electronic health record using deeply learned autoencoders. In Proceedings of the Pacific Symposium on Biocomputing. 207–218
Weber GM, Adams WG, Bernstam EV, Bickel JP, Fox KP, Marsolo K, Raghavan VA, Turchin A, Zhou X, Murphy SN, Mandl KD (2017) Biases introduced by filtering electronic health records for patients with “complete data.”. J Am Med Inform Assoc 24:1134–1141
Article Google Scholar
Beaulieu-Jones BK, Lavage DR, Snyder JW, Moore JH, Pendergrass SA, Bauer CR (2018) Characterizing and managing missing structured data in electronic health records: data analysis. JMIR Med Inform 6:e11
Article Google Scholar
Pivovarov R, Albers DJ, Sepulveda JL, Elhadad N (2014) Identifying and mitigating biases in EHR laboratory tests. J Biomed Inform 51:24–34
Article Google Scholar
Waljee AK, Mukherjee A, Singal AG, Zhang Y, Warren J, Balis U, Marrero J, Zhu J, Higgins PDR (2013) Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3(8):e002847
Article Google Scholar
Buuren SV, Groothuis-Oudshoorn K (2011) MICE: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–68
Article Google Scholar
Luo Y, Szolovits P, Dighe AS et al (2017) 3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data. J Am Med Inform Assoc 25:645–653
Article Google Scholar
Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG (2016) MIMIC-III, a freely accessible critical care database. Scientific Data 3:160035
Article Google Scholar
Wells BJ, Chagin KM, Nowacki AS, Kattan MW (2013) Strategies for handling missing data in electronic health record derived data. EGEMS. 1(3):1035
Article Google Scholar
Cismondi F, Fialho AS, Vieira SM, Reti SR, Sousa JMC, Finkelstein SN (2013) Missing data in medical databases: impute, delete or classify. Artif Intell Med 58(1):63–72
Article Google Scholar
Li P, Stuart EA, Allison DB (2015) Multiple imputation: a flexible tool for handling missing data. JAMA. 314(18):1966–1967
Article Google Scholar
Donders AR, Van Der Heijden GJ, Stijnen T et al (2006) A gentle introduction to imputation of missing values. J Clin Epidemiol 59(10):1087–1091
Article Google Scholar
Schmitt P, Mandel J, Guedj M (2015) A comparison of six methods for missing data imputation. Journal of Biometrics & Biostatistics 6(1):1000224
Google Scholar
Shrive FM, Stuart H, Quan H, Ghali WA (2006) Dealing with missing data in a multi-question de- pression scale: a comparison of imputation methods. BMC Med Res Methodol 6(1):57
Article Google Scholar
Troyanskaya O, Cantor M, Sherlock G, Hastie T, Tibshirani R, Botstein D, Altman R (2001) Missing value estimation methods for DNA microarrays. Bioinformatics. 17:520–525
Article Google Scholar
Deng Y, Chang C, Ido MS, Long Q (2016) Multiple imputation for general missing data patterns in the presence of high-dimensional data. Sci Rep 6(1):21689
Article Google Scholar
Zhang G, Little R (2009) Extensions of the penalized spline of propensity prediction method of imputation. Biometrics. 65(3):911–918
Article MathSciNet Google Scholar
Luo Y, Szolovits P, Dighe AS, Baron JM (2016) Using machine learning to predict laboratory test results. American Journal of Clinical Pathology 145(6):7787–7788
Article Google Scholar
Little R, An H (2004) Robust likelihood-based analysis of multivariate data with missing values. Stat Sin 149(3):949–968
MathSciNet MATH Google Scholar
Buuren SV, Boshuizen HC, Knook DL (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 18(6):681–694
Article Google Scholar
Stekhoven DJ, Bühlmann P (2012) MissForest - non-parametric missing value imputation for mixed-type data. Bioinformatics 28:112–118
Article Google Scholar
Tang F, Ishwaran H (2017) Random forest missing data algorithms. Stat Anal Data Min 10(6):363–377
Article MathSciNet Google Scholar
Hastie T, Mazumder R, Lee JD, Zadeh R (2015) Matrix completion and low-rank SVD via fast alternating least squares. J Mach Learn Res 16:3367–3402
MathSciNet MATH Google Scholar
Mazumder R, Hastie T, Tibshirani R (2010) Spectral regularization algorithms for learning large incomplete matrices. J Mach Learn Res 11(80):2287–2322
MathSciNet MATH Google Scholar
Liao Z, Lu X, Yang T, Wang H (2009) Missing data imputation: a fuzzy K-means clustering algorithm over sliding window. In Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery. 133–137
Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794
PythonAPIReference. https://xgboost.readthedocs.io/en/latest/python/pythonapi.html. Accessed Aug 9 2019

Download references

Acknowledgments

We thank the 7th IEEE International Conference on Healthcare Informatics (ICHI) Data Analytics Challenge on Missing data Imputation (DACMI) challenge committee for the dataset used in this study.

Funding

This research was supported, in part, by the National Library of Medicine of the National Institutes of Health under Award Number R01LM012854. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

Vanderbilt University, Nashville, TN, USA
Xinmeng Zhang & Chao Yan
Vanderbilt University Medical Center, Nashville, TN, USA
Cheng Gao, Bradley A. Malin & You Chen

Authors

Xinmeng Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Chao Yan
View author publications
You can also search for this author inPubMed Google Scholar
Cheng Gao
View author publications
You can also search for this author inPubMed Google Scholar
Bradley A. Malin
View author publications
You can also search for this author inPubMed Google Scholar
You Chen
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Xinmeng Zhang.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Code Availability

Source code at https://github.com/yanchao0222/SMILES.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, X., Yan, C., Gao, C. et al. Predicting Missing Values in Medical Data Via XGBoost Regression. J Healthc Inform Res 4, 383–394 (2020). https://doi.org/10.1007/s41666-020-00077-1

Download citation

Received: 03 September 2019
Revised: 29 May 2020
Accepted: 27 July 2020
Published: 03 August 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s41666-020-00077-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Predicting Missing Values in Medical Data Via XGBoost Regression

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Combined Interpolation and Weighted K-Nearest Neighbours Approach for the Imputation of Longitudinal ICU Laboratory Data

Imputation of Missing Data in Electronic Health Records Based on Patients’ Similarities

A dynamic ensemble approach to robust classification in the presence of missing data

Data Availability

Notes

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Code Availability

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now