Evolutionary-based selection of generalized instances for imbalanced classification
Introduction
The class imbalance classification problem is one of the current challenges in data mining [49]. It appears when the number of instances of one class is much lower than the instances of the other class(es) [11]. Since standard learning algorithms are developed to minimize the global measure of error, which is independent of the class distribution, in this context this causes a bias towards the majority class in the training of classifiers and results in a lower sensitivity in detecting the minority class examples. Imbalance in class distribution is pervasive in a variety of real-world applications, including but not limited to telecommunications, WWW, finance, biology and medicine [39].
A large number of approaches have been previously proposed to deal with this problem [32], which can be mainly categorized into two groups: the internal approaches which create new algorithms or modify existing ones to take the class imbalance problem into consideration [41], [30], [19] and external approaches which pre-process the data in order to diminish the effect caused by their class imbalance [5], [10]. Imbalanced classification is also very related to cost-sensitive classification [10], [50].
Exemplar-based learning was originally proposed by Medin and Schaffer [40] and revisited by Aha et al. [1] and considers a set of methods widely used in machine learning and data mining [34]. A similar scheme for learning from examples is based on the Nested Generalized Exemplar (NGE) theory. It was introduced by Salzberg [43] and makes several significant modifications to the exemplar-based learning model. The most important one is that it retains the notion of storing verbatim examples in memory but also allows examples to be generalized. They are strongly related to the nearest neighbor classifier (NN) [13] and were proposed in order to extend it.
In NGE theory, generalizations take the form of hyperrectangles in an Euclidean n-space [35]. Several works argue the benefits of using generalized instances together with instances to form the classification rule [48], [16], [38]. With respect to instance-based classification [1], the use of generalizations increases the comprehension of the data stored to perform classification of unseen data and the achievement of a substantial compression of the data, reducing the storage requirements. Considering rule induction [23], [20], the ability of modeling decision surfaces by hybridizations between distance-based methods (Voronoi diagrams) and parallel axis separators could improve the performance of the models in domains with clusters of exemplars or exemplars strung out along a curve. In addition, NGE learning allows to capture generalizations with exceptions.
A main process in data mining is the one known as data reduction [42]. In classification, it aims to reduce the size of the training set mainly to increase the efficiency of the training phase (by removing redundant data) and even to reduce the classification error rate (by removing noisy data). Instance Selection (IS) is one of the most known data reduction techniques in data mining [37].
The problem of yielding an optimal number of generalized examples for classifying a set of points is NP-hard. A large but finite subset of them can be easily obtained following a simple heuristic algorithm acting over the training data. However, almost all generalized examples produced could be irrelevant and, as a result, the most influential ones must be distinguished. Evolutionary Algorithms (EAs) [17] have been used in data mining with promising results [22]. They have been successfully used for descriptive [8] and predictive tasks [2], nearest neighbor classification [47], [46], feature selection [31], [44], [36], IS [7], [24], simultaneous instance and feature selection [15] and under-sampling for imbalanced learning [25], [29]. NGE is also directly related to clustering and EAs have been extensively used for this problem [33].
In this paper, we propose the use of EAs for generalized instances selection in imbalanced classification domains. Our objective is to increase the accuracy of this type of representation by means of selecting the best suitable set of generalized examples to enhance its classification performance over imbalanced domains. We compare our approach with the most representative models of NGE learning: BNGE [48], RISE [16] and INNER [38], and two well-known rule induction learning methods: RIPPER [12] and PART [21].
We have selected a large collection of imbalanced data sets from KEEL-dataset repository1 [3] for developing our experimental analysis. In order to deal with the problem of imbalanced data sets we will include an study that involves the use of a preprocessing technique, the “Synthetic Minority Over-sampling Technique” (SMOTE) [9], to balance the distribution of training examples in both classes. The empirical study has been checked via non-parametrical statistical testing [14], [28], [27], and the results show an improvement of accuracy for our approach whereas the number of generalized examples stored in the final subset is much lower.
The rest of this paper is organized as follows: Section 2 gives an explanation of NGE learning. In Section 3, we introduce some issues of imbalanced classification: the SMOTE pre-processing technique and the evaluation metric used for this scenario. Section 4 explains all topics concerning the approach proposed to tackle the imbalanced classification problem. Sections 5 Experimental framework, 6 Results and analysis describe the experimental framework used and the analysis of results, respectively. Finally, in Section 7, we point out the conclusions achieved.
Section snippets
NGE learning
NGE is a learning paradigm based on class exemplars, where an induced hypothesis has the graphical shape of a set of hyperrectangles in an M-dimensional Euclidean space. Exemplars of classes are either hyperrectangles or single instances [43]. The input of an NGE system is a set of training examples, each described as a vector of pairs numeric_attribute/value and an associated class. Attributes can either be numerical or categorical. Numerical attributes are usually normalized in the [0, 1]
Imbalanced data sets in classification
In this section, we address some important issues related to imbalanced classification by describing the pre-processing technique applied to deal with the imbalance problem: the SMOTE algorithm [9]. Also, we will present the evaluation metric mainly used for this type of classification problem.
Evolutionary selection of generalized examples for imbalanced classification
The approach proposed in this paper, named Evolutionary Generalized Instance Selection by CHC (EGIS-CHC), is fully explained in this section. Firstly, we introduce the CHC model used as an EA to perform generalized instance selection in Section 4.1. Secondly, the specific issues regarding representation and fitness function are specified in Section 4.2. Finally, Section 4.3 describes the process for generating the initial set of generalized examples.
Experimental framework
This section describes the methodology followed in the experimental study of the generalized examples based learning approaches. We will explain the configuration of the experiment: imbalanced data sets used and parameters for the algorithms.
Results and analysis
In this section we will carry out a complete experimental analysis in order to show three important issues:
- •
First, the performance of the algorithms when they are applied over the original data sets (Section 6.1).
- •
Secondly, the comparison of using or not SMOTE previous to EGIS-CHC and the performance of the algorithms when they are applied over SMOTE-processed data sets (Section 6.2).
- •
Then, the analysis of complexity of the models obtained by means of the computation of the number of generalized
Concluding remarks
The purpose of this paper is to present EGIS-CHC, an evolutionary model to improve imbalanced classification based on the nested generalized example learning. The proposal performs an optimized selection of previously defined generalized examples obtained by a simple and fast heuristic.
The results show that the use of generalized exemplar selection based on evolutionary algorithms can obtain promising results to optimize the performance in imbalanced domains. It was compared with classical
Acknowledgement
This work was supported by TIN2008-06681-C06-01 and TIN2008-06681-C06-02. J. Derrac holds a research scholarship from the University of Granada.
References (50)
- et al.
Differential evolution for learning the classification method proaftn
Knowledge-Based Systems
(2010) The use of the area under the roc curve in the evaluation of machine learning algorithms
Pattern Recognition
(1997)- et al.
Mining associative classification rules with stock trading data – a ga-based method
Knowledge-Based Systems
(2010) - et al.
IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule
Pattern Recognition
(2010) - et al.
On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets
Information Sciences
(2010) - et al.
A memetic algorithm for evolutionary prototype selection: a scaling up approach
Pattern Recognition
(2008) - et al.
Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems
Applied Soft Computing
(2009) - et al.
Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power
Information Sciences
(2010) - et al.
Local distance-based classification
Knowledge-Based Systems
(2008) - et al.
An effective feature selection method for hyperspectral image classification based on genetic algorithm and support vector machine
Knowledge-Based Systems
(2011)
Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance
Neural Networks
A novel hybrid feature selection via symmetrical uncertainty ranking based local memetic search algorithm
Knowledge-Based Systems
Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification
Pattern Recognition
Cost-sensitive classification with respect to waiting cost
Knowledge-Based Systems
Instance-based learning algorithms
Machine Learning
KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework
Journal of Multiple-Valued Logic and Soft Computing
Keel: a software tool to assess evolutionary algorithms for data mining problems
Soft Computing
A study of the behavior of several methods for balancing machine learning training data
Sigkdd Explorations
Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study
IEEE Transactions on Evolutionary Computation
Smote: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
Automatically countering imbalance and its empirical relationship to cost
Data Mining and Knowledge Discovery
Editorial: special issue on learning from imbalanced data sets
SIGKDD Explorations Newsletter
Nearest neighbor pattern classification
IEEE Transactions on Information Theory
Statistical comparisons of classifiers over multiple data sets
Journal of Machine Learning Research
Cited by (125)
Change detection with incorporating multi-constraints and loss weights
2024, Engineering Applications of Artificial IntelligenceA software defect prediction method based on learnable three-line hybrid feature fusion
2024, Expert Systems with ApplicationsAWGAN: An adaptive weighting GAN approach for oversampling imbalanced datasets
2024, Information SciencesClass-overlap undersampling based on Schur decomposition for Class-imbalance problems
2023, Expert Systems with ApplicationsClass-imbalanced positive instances augmentation via three-line hybrid
2022, Knowledge-Based SystemsAn improved and random synthetic minority oversampling technique for imbalanced data
2022, Knowledge-Based Systems