Addressing Missing Data in a Healthcare Dataset Using an Improved kNN Algorithm

Thomas, Tressy; Rajabi, Enayat

doi:10.1007/978-3-030-77977-1_17

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12746))

Included in the following conference series:

International Conference on Computational Science

2262 Accesses
1 Citations
3 Altmetric

Abstract

Missing values are ubiquitous in many real-world datasets. In scenarios where a dataset is not very large, addressing its missing values by utilizing appropriate data imputation methods benefits analysis significantly. In this paper, we leveraged and evaluated a new imputation approach called k-Nearest Neighbour with Most Significant Features and incomplete cases (KNNI\(_\mathrm{MSF}\)) to impute missing values in a healthcare dataset. This algorithm leverages k-Nearest Neighbour (kNN) and ReliefF feature selection techniques to address incomplete cases in the dataset. The merit of imputation is measured by comparing the classification performance of data models trained with the dataset with imputation and without imputation. We used a real-world dataset, “very low birth weight infants”, to predict the survival outcome of infants with low birth weights. Five different classifiers were used in the experiments. The comparison of multiple performance metrics shows that classifiers built on imputed dataset produce much better outcomes. KNNI\(_\mathrm{MSF}\) outperformed in general than the k-Nearest Neighbour Imputation using the Random Forest feature weights (KNNI\(_\mathrm{RF}\)) algorithm with respect to the balanced accuracy and specificity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Han, J., Kamber, M., Pei, J.: 3 - Data preprocessing. In: Han, J., Kamber, M., Pei, J. (eds.) Data Mining, 3rd edn., pp. 83–124. Morgan Kaufmann (2012). https://doi.org/10.1016/B978-0-12-381479-1.00003-4
Schmidt, D., Niemann, M., Lindemann-Von Trzebiatowski, G.: The handling of missing values in medical domains with respect to pattern mining algorithms. In: CEUR Workshop Proceedings, vol. 1492 (2015)
Google Scholar
Enders, C.K., Craig, K.: Applied Missing Data Analysis. The Guilford Press. New York, London (2010)
Google Scholar
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976). https://doi.org/10.1093/biomet/63.3.581
Bartlett, J.W., Harel, O., Carpenter, J.R.: Asymptotically unbiased estimation of exposure odds ratios in complete records logistic regression. Am. J. Epidemiol. 182(8), 730–736 (2014). https://doi.org/10.1093/aje/kwv114
Article Google Scholar
Jadhav, A., Pramod, D., Ramanathan, K.: Comparison of performance of data imputation methods for numeric dataset. Appl. Artif. Intell. 33(10), 913–933 (2019). https://doi.org/10.1080/08839514.2019.1637138
Orczyk, T., Porwik, P.: Influence of missing data imputation method on the classification accuracy of the medical data. J. Med. Inform.Technol. 22, 111–116 (2013)
Google Scholar
Chowdhury, M.H., Islam, M.K. Khan, Islam, S.: Imputation of missing healthcare data. In: IEEE 2017 20th International Conference of Computer and Information Technology (ICCIT) - Dhaka, Bangladesh, 22.12.2017–24-12-2017, pp. 1–6 (2017). https://doi.org/10.1109/ICCITECHN.2017.8281805
Kowarik, A., Templ, M.: Imputation with the R package VIM. J. Stat. Softw. 74 (2016). https://doi.org/10.18637/jss.v074.i07
Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57868-4
Chapter Google Scholar
Gower, J.C.A.: General coefficient of similarity and some of its properties. Biometrics 27(4) (1971). https://doi.org/10.2307/2528823
O’Shea, M., Savitz, D.A., Hage, M.L., Feinstein, K.A.: Prenatal events and the risk of subependymal/intraventricular haemorrhage in very low birthweight neonates. Paediatr Perinat Epidemiol. 6(3), 352–62 (1992). https://doi.org/10.1111/j.1365-3016.1992.tb00775.x
Article Google Scholar
Mostafizu , R., Davis, D.N.: Machine learning based missing value imputation method for clinical datasets. IAENG Trans. Eng. Technol. 229 (2012). https://doi.org/10.1007/978-94-007-6190-2_19

Download references

Author information

Authors and Affiliations

Shannon School of Business, Cape Breton University, Sydney, NS, Canada
Tressy Thomas & Enayat Rajabi

Authors

Tressy Thomas
View author publications
You can also search for this author in PubMed Google Scholar
Enayat Rajabi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

AGH University of Science and Technology, Krakow, Poland
Maciej Paszynski
Ludwig-Maximilians-Universität München, Munich, Germany
Dieter Kranzlmüller
University of Amsterdam, Amsterdam, The Netherlands
Valeria V. Krzhizhanovskaya
University of Tennessee at Knoxville, Knoxville, TN, USA
Jack J. Dongarra
University of Amsterdam, Amsterdam, The Netherlands
Peter M.A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Thomas, T., Rajabi, E. (2021). Addressing Missing Data in a Healthcare Dataset Using an Improved kNN Algorithm. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2021. ICCS 2021. Lecture Notes in Computer Science(), vol 12746. Springer, Cham. https://doi.org/10.1007/978-3-030-77977-1_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-77977-1_17
Published: 09 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-77976-4
Online ISBN: 978-3-030-77977-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics