Missing data (MD) is a common and inevitable problem facing data mining (DM)–based decision systems in e-health since many medical historical datasets contain a huge number of missing values. Therefore, a pre-processing stage is usually required to deal with missing values before building any DM–based decision system. The purpose of this paper is to evaluate the impact of MD techniques on classification systems in cardiovascular dysautonomias diagnosis. We analyzed and compared the accuracy rates of four classification techniques: random forest (RF), support vector machines (SVM), C4.5 decision tree, and Naive Bayes (NB), using two MD techniques: deletion or imputation with k-nearest neighbors (KNN). A total of 216 experiments were therefore carried out using three missingness mechanisms (MCAR: missing completely at random, MAR: missing at random and NMAR: not missing at random), two MD techniques (deletion and KNN imputation), nine MD percentages from 10 to 90% over a dataset collected from the autonomic nervous system (ANS) unit of the University Hospital Avicenne in Morocco. The results obtained suggest that using KNN imputation rather than deletion enhances the accuracy rates of the four classifiers. Moreover, the MD percentages have a negative impact on the performance of classification techniques regardless of the MD mechanisms and MD techniques used. In fact, the accuracy rates of the four classifiers decrease as the MD percentage increases.

Graphical abstract

This work was conducted within the research project MPHR- PPR1/09-2015-2018. The authors would like to thank the Moroccan MESRSFC and CNRST for their support.
This work is also part of the GINSENG-UMU (TIN2015-70259-C2-2-R) projects, supported by the Spanish Ministry of Economy, Industry and Competitiveness and European FEDER funds.
