Decision tree and artificial immune systems for stroke prediction in imbalanced data
Introduction
Despite the scientific advances related to the care of stroke patients in recent years, stroke remains a worldwide public health problem and is among the leading causes of adult death and disabilities (Benjamin et al., 2018, Thrift et al., 2014). There are more than 43 million global cases reported in 2015 (Benjamin et al., 2018) and this amount tends to increase with the growth of the elderly population (Simpkins et al., 2020). In addition, the prevalence of stroke has also increased in the younger population (GBD, 2018). Usually, stroke patients undergo an initial period in the hospital for treatment. In the next stage, they remain an extended period at home for recovering their physical, speech, and cognitive functions (Chen et al., 2019), due to sequels of stroke such as depression and imbalance or loss of physical features (Alghwiri, 2016).
The introduction of early treatment is a way for minimizing sequels of stroke once more than 90% of metabolic risk factors are controllable (O’Donnell et al., 2016). Clinical exams indicate the stroke diagnostics that can be confirmed by a computed tomography scan, where the gold standard to distinguish the disease’s subtypes is the non-contrast computed tomography scan (Wardlaw et al., 2004). However, these image exams can be expensive and inaccessible in regions with difficult access such as rural areas (Leira, Hess, Torner, & Adams, 2008); in such cases, it is possible to use weighted clinical score systems to improve the rapid diagnosis of stroke subtypes (Jin et al., 2016). Other alternatives for diagnosis of stroke include increasing state investments or using Machine Learning (ML) techniques to provide an early and low-cost diagnosis (García-Temza, Risco-Martín, Ayala, Roselló, & Camarasaltas, 2019). ML techniques are interesting because they emulate the human way of thinking and making decisions (El Naqa & Murphy, 2015), analyzes large data sets containing many characteristics in a reasonable time, and can handle complex relationships between data sets, making them more accurate than human specialists in some specific situations (Deo, 2015).
The use of ML techniques for health-related diagnostics tasks meets some challenges; one of them resides in the fact that, compared with healthy subjects, patients with a given disease are generally a small part of the total population. This disproportion in the representation of health and non-healthy subjects is known as the problem of imbalanced data sets, where the class with the highest prevalence is called the majority class, while the rarest class is called the minority class (Haixiang et al., 2017). The challenge in applying ML techniques in handling imbalanced data sets is that they tend to rank all instances in the majority class and none in the minority class, which is generally characterized as the event of most significant interest (Li et al., 2017).
Several papers in the literature used ML techniques for predicting stroke. However, most of them ignore the imbalance of the classes while, in clinical practice, the stroke data set is naturally imbalanced (Liu, Fan, & Wu, 2019). In Colak, Karaman, and Turtay (2015), for example, the authors used Artificial Neural Networks (RNA) or Support Vector Machine (SVM) and a knowledge discovery process to predict stroke. A data set with 167 healthy patients and 130 stroke patients, described by eight clinical variables, was used for training and evaluation of the models. SVMs and Margin-based Censored Regression (MCR) are used as learning algorithms for an automatic feature selection procedure proposed in Khosla et al. (2010) to predict stroke. A comparison of several ML methods that have been applied to predict ischemic stroke is made in Arslan, Colak, and Sarihan (2016). The experiments were performed using a data set with 112 healthy patients and 80 sick patients with SVM presenting best accuracy values.
In Liu et al. (2019), a hybrid approach is described for stroke prediction based on physiological data from a highly imbalanced data set (1.18% of cases of stroke). The hybrid approach is executed in three distinct steps: (i) a data imputation process based on Random Forests (Breiman, 2001) is executed; (ii) the data set is balanced using a methodology that combines Principal components Analysis (PCA) and k-Means clustering methods; (iii) the classification operation is performed by a deep Neural Network with hyperparameters automatically adjusted.
The approach detailed in Liu et al. (2019) presented satisfactory sensitivity and poor specificity. Thus, strategies for improving mainly specificity value without reducing sensitivity value should be investigated. Also, the RNA for prediction is not interpretable, i.e., its results present incomprehensible human terms. In health-related applications, it is interesting to adopt interpretable ML techniques, as they facilitate the problem investigation, generate new insights for solving it, and improve specialists’ understanding (Caruana et al., 2015).
The adoption of ML tools in clinical practice requires a careful confirmation of their performance before its use. When the results of a diagnosis test are binary, the discrimination performance is usually measured through sensitivity and specificity (Park & Han, 2018). Sensitivity is defined as the proportion of sick individuals correctly identified with the disease. The specificity, on the other hand, refers to the proportion of non-sick people that are correctly identified without the disease (Park, Choi, & Byeon, 2021).
Therefore, in this work, we propose an alternative approach for stroke prediction on highly imbalanced data sets. The approach, illustrated by Fig. 1, combines both Immune/Neural (D’Angelo et al., 2016) and One-Sided Selection (OSS) (Kubat, Matwin, et al., 1997) techniques to balance the training data and uses Decision Trees (DT) induced by Genetic Programming (GP) (Koza, 1992) for the classification operation. In Fig. 1, identifies the imbalanced training data, which is summarized in by the proposed balancing procedure. The GP algorithm uses for evolving a population of DTs. The best decision tree (Decision Tree*), returned by the GP algorithm is used to classify unknown instances.
In this work, we use GP in the induction process instead of traditional strategies such as CART (Breiman, Friedman, Stone, & Olshen, 1984) and C4.5 (Quinlan, 2014) due to their ability for global optimization. These traditional strategies use greedy search in the tree generation process which can lead to sub-optimal solutions. Furthermore, the recursive partitioning in the data set can result in data sets too small for attribute selection in deeper nodes of a tree, overfitting the data (Barros, Basgalupp, De Carvalho, & Freitas, 2011).
In summary, this paper focuses on two main challenges. First, in previous studies using ML for stroke prediction, the data sets used do not suffer from class imbalance. In this situation, the performance of the methods in terms of sensitivity and specificity is heavily compromised. In response to this, we propose a new method for balancing the data set through One Sided Selection and Artificial Immune Systems. This new balancing mechanism is associated with Decision Trees to improve the results of stroke prediction in a highly unbalanced data set when compared to the state-of-the-art in terms of specificity and sensitivity. Second, the algorithms generally applied to stroke prediction problem do not allow the development of models considered interpretable; this type of model is important in health problems because it allows the emergence of new hypotheses related to the problem and their validation by specialists knowledge. Thus, we also present a new simplification operator that reduces the complexity of trees induced by GP increasing interpretability in the resulting models. The remainder of this paper is organized as follows. Section 2 describes the new proposed approach. Section 3 presents the experiments and the results as well as the used data set. Finally, the conclusions are presented in Section 4.
Section snippets
Immune/neural approach
The Artificial Immune Systems (AIS) are adaptive systems whose development is inspired by theoretical immunology and the known immune functions (Timmis, Hone, Stibor, & Clark, 2008). The AIS constitutes an area in the bio-inspired computation in which abstract components of the immune system are proposed to solve engineering problems (Castro, 2002). Among the immune functions implemented by these components, the basic principles of clonal selection can be used for pattern recognition and
Data set
In order to evaluate the proposed approach, the present work uses the same data set evaluated in Liu et al. (2019). The full data set is provided in Liu (2019). The data set is composed of 43,400 instances with ten features, as described in Table 1. In this work, all cases with missing values for at least one feature were removed. The remaining data set is a typical imbalanced data set containing 29,063 instances, with 1.89% of stroke occurrences.
Experimental setup
Two experiments are proposed to evaluate the
Conclusion
In this paper, we have presented a novel approach for stroke prediction based on decision trees generated through GP aided by an immune/neural AIS. The proposed approach was evaluated in a highly imbalanced data set composed of sick and non-sick patients’ physiological data. The main objective was to present a technique capable of dealing with the imbalance present in the data set while providing a solution that can be interpreted by human specialists. The results have illustrated the achieved
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This study was supported by grants from the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Brazil (Grant Number s: 307933/2018-0 and 309909/2019-8), Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), Brazil and the Fundação de Amparo a Pesquisa do Estado de Minas Gerais (FAPEMIG), Brazil (Grant Number: PPM-00053-17).
All Authors contributed equally to this work.
References (57)
The correlation between depression, balance, and physical functioning post stroke
Journal of Stroke and Cerebrovascular Diseases
(2016)- et al.
Different medical data mining approaches based prediction of ischemic stroke
Computer Methods and Programs in Biomedicine
(2016) - et al.
Home-based technologies for stroke rehabilitation: A systematic review
International Journal of Medical Informatics
(2019) - et al.
Application of knowledge discovery process on the prediction of stroke
Computer Methods and Programs in Biomedicine
(2015) - et al.
A new fault classification approach applied to tennessee eastman benchmark process
Applied Soft Computing
(2016) - et al.
Unbalanced breast cancer data classification using novel fitness functions in genetic programming
Expert Systems with Applications
(2020) - et al.
Learning from class-imbalanced data: Review of methods and applications
Expert Systems with Applications
(2017) - et al.
A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical dataset
Artificial Intelligence in Medicine
(2019) - et al.
Immune/neural approach to characterize salivary gland neoplasms (SGN)
Applied Soft Computing
(2020) - et al.
Global and regional effects of potentially modifiable risk factors associated with acute stroke in 32 countries (INTERSTROKE): A case-control study
The Lancet
(2016)
Theoretical advances in artificial immune systems
Theoretical Computer Science
A multi-objective genetic programming approach to developing Pareto optimal decision trees
Decision Support Systems
Neuro-evolutionary models for imbalanced classification problems
Journal of King Saud University - Computer and Information Sciences
Diagnostic accuracy of clinical tools for assessment of acute stroke: A systematic review
BMC Emergency Medicine
Genetic programming: an introduction on the automatic evolution of computer programs and its applications
A survey of evolutionary algorithms for decision-tree induction
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)
Heart disease and stroke statistics—2018 update: A report from the American heart association
Circulation
Random forests
Machine Learning
Classification and regression trees
Intelligible models for healthcare: Predicting pneumonia risk and hospital 30 day readmission
Artificial immune systems: A new computational intelligence approach
Data mining for imbalanced datasets: An overview
Learning and optimization using the clonal selection principle
IEEE Transactions on Evolutionary Computation
Induction of decision trees via evolutionary programming
Journal of Chemical Information and Computer Sciences
Machine learning in medicine
Circulation
Evolving boolean functions with conjunctions and disjunctions via genetic programming
What is machine learning?
Cited by (20)
A multilayer stacking method base on RFE-SHAP feature selection strategy for recognition of driver's mental load and emotional state
2024, Expert Systems with ApplicationsTowards improving decision tree induction by combining split evaluation measures
2023, Knowledge-Based SystemsA brain stroke detection model using soft voting based ensemble machine learning classifier
2023, Measurement: Sensors