Skip to main content

Addressing Missing Data in a Healthcare Dataset Using an Improved kNN Algorithm

  • Conference paper
  • First Online:
Computational Science – ICCS 2021 (ICCS 2021)

Abstract

Missing values are ubiquitous in many real-world datasets. In scenarios where a dataset is not very large, addressing its missing values by utilizing appropriate data imputation methods benefits analysis significantly. In this paper, we leveraged and evaluated a new imputation approach called k-Nearest Neighbour with Most Significant Features and incomplete cases (KNNI\(_\mathrm{MSF}\)) to impute missing values in a healthcare dataset. This algorithm leverages k-Nearest Neighbour (kNN) and ReliefF feature selection techniques to address incomplete cases in the dataset. The merit of imputation is measured by comparing the classification performance of data models trained with the dataset with imputation and without imputation. We used a real-world dataset, “very low birth weight infants”, to predict the survival outcome of infants with low birth weights. Five different classifiers were used in the experiments. The comparison of multiple performance metrics shows that classifiers built on imputed dataset produce much better outcomes. KNNI\(_\mathrm{MSF}\) outperformed in general than the k-Nearest Neighbour Imputation using the Random Forest feature weights (KNNI\(_\mathrm{RF}\)) algorithm with respect to the balanced accuracy and specificity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Han, J., Kamber, M., Pei, J.: 3 - Data preprocessing. In: Han, J., Kamber, M., Pei, J. (eds.) Data Mining, 3rd edn., pp. 83–124. Morgan Kaufmann (2012). https://doi.org/10.1016/B978-0-12-381479-1.00003-4

  2. Schmidt, D., Niemann, M., Lindemann-Von Trzebiatowski, G.: The handling of missing values in medical domains with respect to pattern mining algorithms. In: CEUR Workshop Proceedings, vol. 1492 (2015)

    Google Scholar 

  3. Enders, C.K., Craig, K.: Applied Missing Data Analysis. The Guilford Press. New York, London (2010)

    Google Scholar 

  4. Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976). https://doi.org/10.1093/biomet/63.3.581

  5. Bartlett, J.W., Harel, O., Carpenter, J.R.: Asymptotically unbiased estimation of exposure odds ratios in complete records logistic regression. Am. J. Epidemiol. 182(8), 730–736 (2014). https://doi.org/10.1093/aje/kwv114

    Article  Google Scholar 

  6. Jadhav, A., Pramod, D., Ramanathan, K.: Comparison of performance of data imputation methods for numeric dataset. Appl. Artif. Intell. 33(10), 913–933 (2019). https://doi.org/10.1080/08839514.2019.1637138

  7. Orczyk, T., Porwik, P.: Influence of missing data imputation method on the classification accuracy of the medical data. J. Med. Inform.Technol. 22, 111–116 (2013)

    Google Scholar 

  8. Chowdhury, M.H., Islam, M.K. Khan, Islam, S.: Imputation of missing healthcare data. In: IEEE 2017 20th International Conference of Computer and Information Technology (ICCIT) - Dhaka, Bangladesh, 22.12.2017–24-12-2017, pp. 1–6 (2017). https://doi.org/10.1109/ICCITECHN.2017.8281805

  9. Kowarik, A., Templ, M.: Imputation with the R package VIM. J. Stat. Softw. 74 (2016). https://doi.org/10.18637/jss.v074.i07

  10. Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57868-4

    Chapter  Google Scholar 

  11. Gower, J.C.A.: General coefficient of similarity and some of its properties. Biometrics 27(4) (1971). https://doi.org/10.2307/2528823

  12. O’Shea, M., Savitz, D.A., Hage, M.L., Feinstein, K.A.: Prenatal events and the risk of subependymal/intraventricular haemorrhage in very low birthweight neonates. Paediatr Perinat Epidemiol. 6(3), 352–62 (1992). https://doi.org/10.1111/j.1365-3016.1992.tb00775.x

    Article  Google Scholar 

  13. Mostafizu , R., Davis, D.N.: Machine learning based missing value imputation method for clinical datasets. IAENG Trans. Eng. Technol. 229 (2012). https://doi.org/10.1007/978-94-007-6190-2_19

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Thomas, T., Rajabi, E. (2021). Addressing Missing Data in a Healthcare Dataset Using an Improved kNN Algorithm. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2021. ICCS 2021. Lecture Notes in Computer Science(), vol 12746. Springer, Cham. https://doi.org/10.1007/978-3-030-77977-1_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-77977-1_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-77976-4

  • Online ISBN: 978-3-030-77977-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics