Abstract
Missing data (MD) is a common and inevitable problem facing data mining (DM)–based decision systems in e-health since many medical historical datasets contain a huge number of missing values. Therefore, a pre-processing stage is usually required to deal with missing values before building any DM–based decision system. The purpose of this paper is to evaluate the impact of MD techniques on classification systems in cardiovascular dysautonomias diagnosis. We analyzed and compared the accuracy rates of four classification techniques: random forest (RF), support vector machines (SVM), C4.5 decision tree, and Naive Bayes (NB), using two MD techniques: deletion or imputation with k-nearest neighbors (KNN). A total of 216 experiments were therefore carried out using three missingness mechanisms (MCAR: missing completely at random, MAR: missing at random and NMAR: not missing at random), two MD techniques (deletion and KNN imputation), nine MD percentages from 10 to 90% over a dataset collected from the autonomic nervous system (ANS) unit of the University Hospital Avicenne in Morocco. The results obtained suggest that using KNN imputation rather than deletion enhances the accuracy rates of the four classifiers. Moreover, the MD percentages have a negative impact on the performance of classification techniques regardless of the MD mechanisms and MD techniques used. In fact, the accuracy rates of the four classifiers decrease as the MD percentage increases.

Graphical abstract







Similar content being viewed by others
References
Gaziano T, Reddy KS, Paccaud F et al (2006) Cardiovascular disease. disease control priorities in developing countries, 2nd edn. World Bank, Washington (DC)
World Health Organization (2017) http://www.who.int/. Acessed 02 Mar 2017
Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases. AI Mag 17:37–54
Kadi I, Idri A, Fernandez-Aleman JL (2017) Knowledge discovery in cardiology: a systematic literature review. Int J Med Inform 97:12–32
Liou DM, Chang WP (2014) Applying data mining for the analysis of breast cancer data. Data Mining in Clinical Medicine, Volume of the series. Methods Mol Biol 1246:175–189
Marinov M, Mosa AM, Yoo I, Boren SA (2011) Data-mining technologies for diabetes: a systematic review. J Diabetes Sci Technol 5:1549–1556
Kadi I, Idri A, Fernandez-Aleman JL (2017) Systematic mapping study of data mining-based empirical studies in cardiology. Health Inf J 1–30
Han J, Kamber M (2011) Data mining: concepts and techniques. 2nd edition, The Morgan Kaufmann Series in “Data Management Systems”, Morgan Kaufmann Publishers
Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23:3–13
Lenzerini M (2002) Data integration: a theoretical perspective. PODS 233–246
Familia A, Shen WM, Weber R, Simoudis E (1997) Data preprocessing and intelligent data analysis. Intell Data Anal 1:3–23
Cismondi F, Fialhoa AS, Vieira SM, Reti SR, Sousa JMC, Finkelstein SN (2013) Missing data in medical databases: impute, delete or classify? Artif Intell Med 58:63–72
Kaiser J (2014) Dealing with missing values in data. J Syst Integr 5:42–51
Idri A, Abnane I, Abran A (2016) Missing data techniques in analogy-based software development effort estimation. J Syst Softw 117:595–611
Abnane I. and Idri A (2016) Evaluating fuzzy analogy on incomplete software projects data. IEEE Symposium Series on Computational Intelligence (SSCI)
Fichman M, Cummings JN (2003) Multiple imputation for missing data: making the most of what you know. Organ Res Methods 6:282–295
Newman DA (2003) Longitudinal modeling with randomly and systematically missing data: a simulation of ad hoc, maximum likelihood, and multiple imputation techniques. Organ Res Methods 6:328–339
Stinebrickner TR (1999) Estimation of a duration model in the presence of missing data. Rev Econ Stat 81:529–546
Idri A, Abnane I, Abran A (2015) Systematic mapping study of missing values techniques in software engineering data. In: International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), pp 1–8
Bhat VH, Rao PG, Krishna S, Shenoy PD, Venugopal KR, Patnaik LM (2011) An efficient framework for prediction in healthcare data using soft computing techniques. Commun Comput Inf Sci 192
Grzymala-Busse JW, Hu M (2005) A comparison of several approaches to missing attribute values in data mining. In: Rough Sets and Current Trends in Computing, pp 378–385
Setiawan NA, Venkatachalam PA, Hani AFM (2007) Missing data estimation on heart disease using artificial neural network and rough set theory, International Conference on Intelligent and Advanced Systems, Kuala Lumpur, Malaysia
Zhang Y, Kambhampati C, Davis DN, Goode K, Cleland JGF (2012) A comparative study of missing value imputation with multiclass classification for clinical heart failure data. 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery
Poolsawad N, Moore L, Kambhampati C, Cleland JGF (2012) Handling missing values in data mining - a case study of heart failure dataset. 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery
Al Shalabi L, Najjar M, Al Kayed A (2006) A framework to deal with missing data in data sets. J Comput Sci 2:740–745
Blankers M, Koeter MWJ, Schippers GM (2010) Missing data approaches in eHealth Research: simulation study and a tutorial for nonmathematically inclined researchers. J Med Internet Res 12:e54
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Little RJA, Rubin D (1987) Statistical analysis with missing data. Wiley, New York
Li J, Ruhe G, Al-Emran A, Richter MM (2007) A flexible method for soft- ware effort estimation by analogy. Empir Softw Eng 12:65–106
Song Q, Shepperd M, Chen X, Liu J (2008) Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation. J Syst Softw 81:2361–2370
Batista GE, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17:519–533
Grzymala-Busse JW, Grzymala-Busse WJ (2005) Handling missing attribute values. In: Data Mining and Knowledge Discovery Handbook, pp 37–57
Yenduri S (2005) An empirical study of imputation techniques for software data sets. Louisiana State
Setiawan NA, Venkatachalam PA, Hani AFM (2008) A comparative study of imputation methods to predict missing attribute values in coronary heart disease data set. In: 4th Kuala Lumpur International Conference on Biomedical Engineering 21, IFMBE Proceedings, Springer
Idri A, Kadi I (2015) Evaluating a decision tree-based approach for cardiovascular dysautonomias diagnosis. SpringerPlus 5:81
Kadi I, Idri A (2016) Cardiovascular dysautonomias diagnosis using crisp and fuzzy decision tree: a comparative study. Stud Health Technol Inf 223:1–8
Chawla NV (2010) Data mining for imbalanced datasets: an overview. Data Mining and Knowledge Discovery Handbook, pp 853–867
Quinlan JR (1993) C4.5 Programs for Machine Learning. Morgan Kaufmann, CA, pp 1–302
Quinlan JR (1986) Induction of decision trees. Mach. Learn. 1, p. 81–106RUBIN, D. B., 1976. Inference and missing data. Biometrika 63:581–592
Vapnik V (1982) Estimation of dependences based on empirical data. Springer, Verlag
Pappu V, Pardalos PM (2014) High-dimensional data classification. In: Clusters, orders, and trees: methods and applications 92:119–150
Ho TM (2001) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 1998(20):832–844
Breiman L Random forests. Mach Learn 45:5–32
Song Q, Ni J, Wang G (2013) A fast clustering based feature selection algorithm for high dimensional data. IEEE Trans Knowl Data Eng 25(1)
Tan PN et al. (2006) Introduction to data mining, Pearson Education.
Salzberg SL (1997) On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min Knowl Disc 1:317–327
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874
Sheskin D (1997) Handbook of parametric and non-parametric procedures. CRC Press
Abdi H (2010) 1 Overview 2 Preliminary : the different meanings of alpha. Encycl Res Des:1–8. https://doi.org/10.4135/9781412961288.n178
Liu-Peng LL (2005) A review of missing data treatment methods. Int J Intell Inf Syst Tech 412–419
Soley-Bori M (2013) Dealing with missing data: key assumptions and methods for applied analysis. Boston University School of Public Health, Boston
Acknowledgment
This work was conducted within the research project MPHR- PPR1/09-2015-2018. The authors would like to thank the Moroccan MESRSFC and CNRST for their support.
Funding
This work is also part of the GINSENG-UMU (TIN2015-70259-C2-2-R) projects, supported by the Spanish Ministry of Economy, Industry and Competitiveness and European FEDER funds.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Idri, A., Kadi, I., Abnane, I. et al. Missing data techniques in classification for cardiovascular dysautonomias diagnosis. Med Biol Eng Comput 58, 2863–2878 (2020). https://doi.org/10.1007/s11517-020-02266-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11517-020-02266-x