Skip to main content
Log in

Clustering based imputation algorithm using unsupervised neural network for enhancing the quality of healthcare data

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Historical and real-time healthcare data sets are valuable sources of information for predictive data analytics. However, most of the historical healthcare data sets are overloaded with challenges. One of the most frequently faced challenge is the problem of missing values, occurring because of the inaccuracies in data transmission or data entry processes. An appropriate technique for handling missing values is required to generate good quality data sets for achieving better prediction results. Removing the records with missing values, known as marginalization, poses an easy way out to this challenge. But, this will lessen the data volume of the historical data set and disturb the class balance of the data set. An alternative to marginalization is replacing missing values with plausible values, known as imputation. This paper proposes a missing value imputation technique, CLUSTIMP, using an unsupervised neural network Adaptive Resonance Theory 2 (ART2). The efficiency of the proposed imputation method is evaluated on the incomplete Mammographic mass data set and Hepatocellular Carcinoma data set (HCC) from the UCI repository considering Root Mean Squared Error (RMSE) rate and classification accuracy as the evaluation metrics. The proposed CLUSTIMP imputation algorithm outperforms existing state-of-the-art imputation methods by reducing classifiers error rates between 2 and 11%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Almeida RJ, Kaymak U, Sousa JM (2010) A new approach to dealing with missing values in data-driven fuzzy modeling. In: International conference on fuzzy systems, pp. 1–7. IEEE

  • Armentano R, Bhadoria RS, Chatterjee P, Deka GC (2017) The internet of things: foundation for smart cities, EHealth, and ubiquitous computing. CRC Press, Boca Raton

    Book  Google Scholar 

  • Arslanturk S, Siadat M-R, Ogunyemi T, Killinger K, Diokno A (2016) Analysis of incomplete and inconsistent clinical survey data. Knowl Inform Syst 46(3):731–750

    Article  Google Scholar 

  • Beaulieu-Jones BK, Moore JH (2017) Missing data imputation in the electronic health record using deeply learned autoencoders. In: Pacific Symposium on Biocomputing 2017, pp. 207–218. World Scientific

  • Bhadoria RS, Bajpai D (2019) Stabilizing sensor data collection for control of environment-friendly clean technologies using internet of things. Wirel Personal Commun 108(1):493–510

    Article  Google Scholar 

  • Carpenter GA, Grossberg S (2017) Adaptive resonance theory. Springer, Berlin

    Book  Google Scholar 

  • Chan LS, Dunn OJ (1972) The treatment of missing values in discriminant analysisi. the sampling experiment. J Am Stat Assoc 67(338):473–477

    MATH  Google Scholar 

  • Chen M, Hao Y, Hwang K, Wang L, Wang L (2017) Disease prediction by machine learning over big data from healthcare communities. Ieee Access 5:8869–8879

    Article  Google Scholar 

  • Davis D, Rahman M (2016) Missing value imputation using stratified supervised learning for cardiovascular data. J. Inf. Data Min 1(2):1–13

    Google Scholar 

  • Elter M, Schulz-Wendtland R, Wittenberg T (2007) The prediction of breast cancer biopsy outcomes using two cad approaches that both emphasize an intelligible decision process. Med Phys 34(11):4164–4172

    Article  Google Scholar 

  • Ford BL (1983) An overview of hot-deck procedures. Incomplete Data Sample Surv 2(Part IV):185–207

    Google Scholar 

  • Haji-Maghsoudi S, Rastegari A, Garrusi B, Baneshi MR (2018) Addressing the problem of missing data in decision tree modeling. J Appl Stat 45(3):547–557

    Article  MathSciNet  Google Scholar 

  • Imani F, Cheng C, Chen R, Yang H (2019) Nested gaussian process modeling and imputation of high-dimensional incomplete data under uncertainty. IISE Trans Healthc Syst Eng 9(4):315–326

    Article  Google Scholar 

  • Jerez JM, Molina I, García-Laencina PJ, Alba E, Ribelles N, Martín M, Franco L (2010) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intell Med 50(2):105–115

    Article  Google Scholar 

  • Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmospheric Environ 38(18):2895–2907

    Article  Google Scholar 

  • Kayal CK, Bagchi S, Dhar D, Maitra T, Chatterjee S (2019) Hepatocellular carcinoma survival prediction using deep neural network. In: Proceedings of international ethical hacking conference 2018, pp. 349–358. Springer

  • Kurt I, Ture M, Kurum AT (2008) Comparing performances of logistic regression, classification and regression tree, and neural networks for predicting coronary artery disease. Expert Syst Appl 34(1):366–374

    Article  Google Scholar 

  • LaFreniere D, Zulkernine F, Barber D, Martin K (2016) Using machine learning to predict hypertension from a clinical dataset. In: 2016 IEEE symposium series on computational intelligence (SSCI), pp. 1–7. IEEE

  • Mazumder RS, Bhadoria RS, Deka GC (eds) (2017) Distributed computing in big data analytics. Concepts, technologies and applications. Springer, Cham

  • Momeni A, Pincus M, Libien J (2018) Imputation and missing data. In: Introduction to statistical methods in pathology. Springer, Cham, pp 185–200

    Chapter  Google Scholar 

  • Nguyen DV, Wang N, Carroll RJ (2004) Evaluation of missing value estimation for microarray data. J Data Sci 2(4):347–370

    Google Scholar 

  • Penny KI, Chesney T (2006) Imputation methods to deal with missing values when data mining trauma injury data. In: 28th international conference on information technology interfaces, 2006, pp. 213–218. IEEE

  • Rahman MM (2014) Machine learning based data pre-processing for the purpose of medical data mining and decision support. PhD thesis, University of Hull

  • Rubin DB (2004) Multiple imputation for nonresponse in surveys, vol 81. Wiley, Hoboken

    MATH  Google Scholar 

  • Santos MS, Abreu PH, García-Laencina PJ, Simão A, Carvalho A (2015) A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58:49–59

    Article  Google Scholar 

  • Sen S, Das M, Chatterjee R (2018) Estimation of incomplete data in mixed dataset. In: Progress in intelligent computing techniques: theory, practice, and applications. Springer, Singapore, pp 483–492

    Chapter  Google Scholar 

  • Shobha K, Nickolas S (2019) Imputation of multivariate attribute values in big data. In: Smart intelligent computing and applications. Springer, Singapore, pp 53–60

    Chapter  Google Scholar 

  • Sokat KY, Dolinskaya IS, Smilowitz K, Bank R (2018) Incomplete information imputation in limited data environments with application to disaster response. Europ J Oper Res 269(2):466–485

    Article  Google Scholar 

  • Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for dna microarrays. Bioinformatics 17(6):520–525

    Article  Google Scholar 

  • Turabieh H, Salem AA, Abu-El-Rub N (2018) Dynamic l-rnn recovery of missing data in iomt applications. Future Generation Comput Syst 89:575–583

    Article  Google Scholar 

  • Tutz G, Ramzan S (2015) Improved methods for the imputation of missing data by nearest neighbor methods. Comput Stat Data Anal 90:84–99

    Article  MathSciNet  Google Scholar 

  • Van der Heijden GJ, Donders ART, Stijnen T, Moons KG (2006) Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol 59(10):1102–1109

    Article  Google Scholar 

  • Verma H, Kumar S (2019) An accurate missing data prediction method using lstm based deep learning for health care. In: Proceedings of the 20th international conference on distributed computing and networking, pp. 371–376. ACM

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. Shobha.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shobha, K., Savarimuthu, N. Clustering based imputation algorithm using unsupervised neural network for enhancing the quality of healthcare data. J Ambient Intell Human Comput 12, 1771–1781 (2021). https://doi.org/10.1007/s12652-020-02250-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-020-02250-1

Keywords

Navigation