Abstract
Missing values are ubiquitous in many real-world datasets. In scenarios where a dataset is not very large, addressing its missing values by utilizing appropriate data imputation methods benefits analysis significantly. In this paper, we leveraged and evaluated a new imputation approach called k-Nearest Neighbour with Most Significant Features and incomplete cases (KNNI\(_\mathrm{MSF}\)) to impute missing values in a healthcare dataset. This algorithm leverages k-Nearest Neighbour (kNN) and ReliefF feature selection techniques to address incomplete cases in the dataset. The merit of imputation is measured by comparing the classification performance of data models trained with the dataset with imputation and without imputation. We used a real-world dataset, “very low birth weight infants”, to predict the survival outcome of infants with low birth weights. Five different classifiers were used in the experiments. The comparison of multiple performance metrics shows that classifiers built on imputed dataset produce much better outcomes. KNNI\(_\mathrm{MSF}\) outperformed in general than the k-Nearest Neighbour Imputation using the Random Forest feature weights (KNNI\(_\mathrm{RF}\)) algorithm with respect to the balanced accuracy and specificity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Han, J., Kamber, M., Pei, J.: 3 - Data preprocessing. In: Han, J., Kamber, M., Pei, J. (eds.) Data Mining, 3rd edn., pp. 83–124. Morgan Kaufmann (2012). https://doi.org/10.1016/B978-0-12-381479-1.00003-4
Schmidt, D., Niemann, M., Lindemann-Von Trzebiatowski, G.: The handling of missing values in medical domains with respect to pattern mining algorithms. In: CEUR Workshop Proceedings, vol. 1492 (2015)
Enders, C.K., Craig, K.: Applied Missing Data Analysis. The Guilford Press. New York, London (2010)
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976). https://doi.org/10.1093/biomet/63.3.581
Bartlett, J.W., Harel, O., Carpenter, J.R.: Asymptotically unbiased estimation of exposure odds ratios in complete records logistic regression. Am. J. Epidemiol. 182(8), 730–736 (2014). https://doi.org/10.1093/aje/kwv114
Jadhav, A., Pramod, D., Ramanathan, K.: Comparison of performance of data imputation methods for numeric dataset. Appl. Artif. Intell. 33(10), 913–933 (2019). https://doi.org/10.1080/08839514.2019.1637138
Orczyk, T., Porwik, P.: Influence of missing data imputation method on the classification accuracy of the medical data. J. Med. Inform.Technol. 22, 111–116 (2013)
Chowdhury, M.H., Islam, M.K. Khan, Islam, S.: Imputation of missing healthcare data. In: IEEE 2017 20th International Conference of Computer and Information Technology (ICCIT) - Dhaka, Bangladesh, 22.12.2017–24-12-2017, pp. 1–6 (2017). https://doi.org/10.1109/ICCITECHN.2017.8281805
Kowarik, A., Templ, M.: Imputation with the R package VIM. J. Stat. Softw. 74 (2016). https://doi.org/10.18637/jss.v074.i07
Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57868-4
Gower, J.C.A.: General coefficient of similarity and some of its properties. Biometrics 27(4) (1971). https://doi.org/10.2307/2528823
O’Shea, M., Savitz, D.A., Hage, M.L., Feinstein, K.A.: Prenatal events and the risk of subependymal/intraventricular haemorrhage in very low birthweight neonates. Paediatr Perinat Epidemiol. 6(3), 352–62 (1992). https://doi.org/10.1111/j.1365-3016.1992.tb00775.x
Mostafizu , R., Davis, D.N.: Machine learning based missing value imputation method for clinical datasets. IAENG Trans. Eng. Technol. 229 (2012). https://doi.org/10.1007/978-94-007-6190-2_19
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Thomas, T., Rajabi, E. (2021). Addressing Missing Data in a Healthcare Dataset Using an Improved kNN Algorithm. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2021. ICCS 2021. Lecture Notes in Computer Science(), vol 12746. Springer, Cham. https://doi.org/10.1007/978-3-030-77977-1_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-77977-1_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-77976-4
Online ISBN: 978-3-030-77977-1
eBook Packages: Computer ScienceComputer Science (R0)