Abstract
The data in a patient’s laboratory test result is a notable resource to support clinical investigation and enhance medical research. However, for a variety of reasons, this type of data often contains a non-trivial number of missing values. For example, physicians may neglect to order tests or document the results. Such a phenomenon reduces the degree to which this data can be utilized to learn efficient and effective predictive models. To address this problem, various approaches have been developed to impute missing laboratory values; however, their performance has been limited. This is due, in part, to the fact no approaches effectively leverage the contextual information (1) in individual or (2) between laboratory test variables. We introduce an approach to combine an unsupervised prefilling strategy with a supervised machine learning approach, in the form of extreme gradient boosting (XGBoost), to leverage both types of context for imputation purposes. We evaluated the methodology through a series of experiments on approximately 8200 patients’ records in the MIMIC-III dataset. The results demonstrate that the new model outperforms baseline and state-of-the-art models on 13 commonly collected laboratory test variables. In terms of the normalized root mean square derivation (nRMSD), our model exhibits an imputation improvement by over 20%, on average. Missing data imputation on the temporal variables can be largely improved via prefilling strategy and the supervised training technique, which leverages both the longitudinal and cross-sectional context simultaneously.




Similar content being viewed by others
Data Availability
The dataset is sampled from Medical Information Mart for Intensive Care III. (MIMIC-III) (https://mimic.physionet.org/).
Notes
Data challenge details at http://www.ieee-ichi.org/challenge.html
References
Evans RS (2016) Electronic health records: then, now, and in the future. International Medical Informatics Association (IMIA) 1:S48–S61
Richesson RL, Horvath MM, Rusincovitch SA (2014) Clinical research informatics and electronic health record data. International Medical Informatics Association (IMIA) 23(1):215–223
Köpcke F, Trinczek B, Majeed RW, Schreiweis B, Wenk J, Leusch T, Ganslandt T, Ohmann C, Bergh B, Röhrig R, Dugas M, Prokosch HU (2013) Evaluation of data completeness in the electronic health record for the purpose of patient recruitment into clinical trials: a retrospective analysis of element presence. BioMed Central (BMC) 13(1):37
Hu Z, Melton GB, Arsoniadis EG, Wang Y, Kwaan MR, Simon GJ (2017) Strategies for handling missing clinical data for automated surgical site infection detection from the electronic health record. J Biomed Inform 68:112–120
Beaulieu-Jones BK, Moore JH (2017) Missing data imputation in the electronic health record using deeply learned autoencoders. In Proceedings of the Pacific Symposium on Biocomputing. 207–218
Weber GM, Adams WG, Bernstam EV, Bickel JP, Fox KP, Marsolo K, Raghavan VA, Turchin A, Zhou X, Murphy SN, Mandl KD (2017) Biases introduced by filtering electronic health records for patients with “complete data.”. J Am Med Inform Assoc 24:1134–1141
Beaulieu-Jones BK, Lavage DR, Snyder JW, Moore JH, Pendergrass SA, Bauer CR (2018) Characterizing and managing missing structured data in electronic health records: data analysis. JMIR Med Inform 6:e11
Pivovarov R, Albers DJ, Sepulveda JL, Elhadad N (2014) Identifying and mitigating biases in EHR laboratory tests. J Biomed Inform 51:24–34
Waljee AK, Mukherjee A, Singal AG, Zhang Y, Warren J, Balis U, Marrero J, Zhu J, Higgins PDR (2013) Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3(8):e002847
Buuren SV, Groothuis-Oudshoorn K (2011) MICE: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–68
Luo Y, Szolovits P, Dighe AS et al (2017) 3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data. J Am Med Inform Assoc 25:645–653
Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG (2016) MIMIC-III, a freely accessible critical care database. Scientific Data 3:160035
Wells BJ, Chagin KM, Nowacki AS, Kattan MW (2013) Strategies for handling missing data in electronic health record derived data. EGEMS. 1(3):1035
Cismondi F, Fialho AS, Vieira SM, Reti SR, Sousa JMC, Finkelstein SN (2013) Missing data in medical databases: impute, delete or classify. Artif Intell Med 58(1):63–72
Li P, Stuart EA, Allison DB (2015) Multiple imputation: a flexible tool for handling missing data. JAMA. 314(18):1966–1967
Donders AR, Van Der Heijden GJ, Stijnen T et al (2006) A gentle introduction to imputation of missing values. J Clin Epidemiol 59(10):1087–1091
Schmitt P, Mandel J, Guedj M (2015) A comparison of six methods for missing data imputation. Journal of Biometrics & Biostatistics 6(1):1000224
Shrive FM, Stuart H, Quan H, Ghali WA (2006) Dealing with missing data in a multi-question de- pression scale: a comparison of imputation methods. BMC Med Res Methodol 6(1):57
Troyanskaya O, Cantor M, Sherlock G, Hastie T, Tibshirani R, Botstein D, Altman R (2001) Missing value estimation methods for DNA microarrays. Bioinformatics. 17:520–525
Deng Y, Chang C, Ido MS, Long Q (2016) Multiple imputation for general missing data patterns in the presence of high-dimensional data. Sci Rep 6(1):21689
Zhang G, Little R (2009) Extensions of the penalized spline of propensity prediction method of imputation. Biometrics. 65(3):911–918
Luo Y, Szolovits P, Dighe AS, Baron JM (2016) Using machine learning to predict laboratory test results. American Journal of Clinical Pathology 145(6):7787–7788
Little R, An H (2004) Robust likelihood-based analysis of multivariate data with missing values. Stat Sin 149(3):949–968
Buuren SV, Boshuizen HC, Knook DL (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 18(6):681–694
Stekhoven DJ, Bühlmann P (2012) MissForest - non-parametric missing value imputation for mixed-type data. Bioinformatics 28:112–118
Tang F, Ishwaran H (2017) Random forest missing data algorithms. Stat Anal Data Min 10(6):363–377
Hastie T, Mazumder R, Lee JD, Zadeh R (2015) Matrix completion and low-rank SVD via fast alternating least squares. J Mach Learn Res 16:3367–3402
Mazumder R, Hastie T, Tibshirani R (2010) Spectral regularization algorithms for learning large incomplete matrices. J Mach Learn Res 11(80):2287–2322
Liao Z, Lu X, Yang T, Wang H (2009) Missing data imputation: a fuzzy K-means clustering algorithm over sliding window. In Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery. 133–137
Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794
PythonAPIReference. https://xgboost.readthedocs.io/en/latest/python/pythonapi.html. Accessed Aug 9 2019
Acknowledgments
We thank the 7th IEEE International Conference on Healthcare Informatics (ICHI) Data Analytics Challenge on Missing data Imputation (DACMI) challenge committee for the dataset used in this study.
Funding
This research was supported, in part, by the National Library of Medicine of the National Institutes of Health under Award Number R01LM012854. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Code Availability
Source code at https://github.com/yanchao0222/SMILES.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, X., Yan, C., Gao, C. et al. Predicting Missing Values in Medical Data Via XGBoost Regression. J Healthc Inform Res 4, 383–394 (2020). https://doi.org/10.1007/s41666-020-00077-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41666-020-00077-1