Evolutionary-based selection of generalized instances for imbalanced classification

https://doi.org/10.1016/j.knosys.2011.01.012Get rights and content

Abstract

In supervised classification, we often encounter many real world problems in which the data do not have an equitable distribution among the different classes of the problem. In such cases, we are dealing with the so-called imbalanced data sets. One of the most used techniques to deal with this problem consists of preprocessing the data previously to the learning process. This paper proposes a method belonging to the family of the nested generalized exemplar that accomplishes learning by storing objects in Euclidean n-space. Classification of new data is performed by computing their distance to the nearest generalized exemplar. The method is optimized by the selection of the most suitable generalized exemplars based on evolutionary algorithms. An experimental analysis is carried out over a wide range of highly imbalanced data sets and uses the statistical tests suggested in the specialized literature. The results obtained show that our evolutionary proposal outperforms other classic and recent models in accuracy and requires to store a lower number of generalized examples.

Introduction

The class imbalance classification problem is one of the current challenges in data mining [49]. It appears when the number of instances of one class is much lower than the instances of the other class(es) [11]. Since standard learning algorithms are developed to minimize the global measure of error, which is independent of the class distribution, in this context this causes a bias towards the majority class in the training of classifiers and results in a lower sensitivity in detecting the minority class examples. Imbalance in class distribution is pervasive in a variety of real-world applications, including but not limited to telecommunications, WWW, finance, biology and medicine [39].

A large number of approaches have been previously proposed to deal with this problem [32], which can be mainly categorized into two groups: the internal approaches which create new algorithms or modify existing ones to take the class imbalance problem into consideration [41], [30], [19] and external approaches which pre-process the data in order to diminish the effect caused by their class imbalance [5], [10]. Imbalanced classification is also very related to cost-sensitive classification [10], [50].

Exemplar-based learning was originally proposed by Medin and Schaffer [40] and revisited by Aha et al. [1] and considers a set of methods widely used in machine learning and data mining [34]. A similar scheme for learning from examples is based on the Nested Generalized Exemplar (NGE) theory. It was introduced by Salzberg [43] and makes several significant modifications to the exemplar-based learning model. The most important one is that it retains the notion of storing verbatim examples in memory but also allows examples to be generalized. They are strongly related to the nearest neighbor classifier (NN) [13] and were proposed in order to extend it.

In NGE theory, generalizations take the form of hyperrectangles in an Euclidean n-space [35]. Several works argue the benefits of using generalized instances together with instances to form the classification rule [48], [16], [38]. With respect to instance-based classification [1], the use of generalizations increases the comprehension of the data stored to perform classification of unseen data and the achievement of a substantial compression of the data, reducing the storage requirements. Considering rule induction [23], [20], the ability of modeling decision surfaces by hybridizations between distance-based methods (Voronoi diagrams) and parallel axis separators could improve the performance of the models in domains with clusters of exemplars or exemplars strung out along a curve. In addition, NGE learning allows to capture generalizations with exceptions.

A main process in data mining is the one known as data reduction [42]. In classification, it aims to reduce the size of the training set mainly to increase the efficiency of the training phase (by removing redundant data) and even to reduce the classification error rate (by removing noisy data). Instance Selection (IS) is one of the most known data reduction techniques in data mining [37].

The problem of yielding an optimal number of generalized examples for classifying a set of points is NP-hard. A large but finite subset of them can be easily obtained following a simple heuristic algorithm acting over the training data. However, almost all generalized examples produced could be irrelevant and, as a result, the most influential ones must be distinguished. Evolutionary Algorithms (EAs) [17] have been used in data mining with promising results [22]. They have been successfully used for descriptive [8] and predictive tasks [2], nearest neighbor classification [47], [46], feature selection [31], [44], [36], IS [7], [24], simultaneous instance and feature selection [15] and under-sampling for imbalanced learning [25], [29]. NGE is also directly related to clustering and EAs have been extensively used for this problem [33].

In this paper, we propose the use of EAs for generalized instances selection in imbalanced classification domains. Our objective is to increase the accuracy of this type of representation by means of selecting the best suitable set of generalized examples to enhance its classification performance over imbalanced domains. We compare our approach with the most representative models of NGE learning: BNGE [48], RISE [16] and INNER [38], and two well-known rule induction learning methods: RIPPER [12] and PART [21].

We have selected a large collection of imbalanced data sets from KEEL-dataset repository1 [3] for developing our experimental analysis. In order to deal with the problem of imbalanced data sets we will include an study that involves the use of a preprocessing technique, the “Synthetic Minority Over-sampling Technique” (SMOTE) [9], to balance the distribution of training examples in both classes. The empirical study has been checked via non-parametrical statistical testing [14], [28], [27], and the results show an improvement of accuracy for our approach whereas the number of generalized examples stored in the final subset is much lower.

The rest of this paper is organized as follows: Section 2 gives an explanation of NGE learning. In Section 3, we introduce some issues of imbalanced classification: the SMOTE pre-processing technique and the evaluation metric used for this scenario. Section 4 explains all topics concerning the approach proposed to tackle the imbalanced classification problem. Sections 5 Experimental framework, 6 Results and analysis describe the experimental framework used and the analysis of results, respectively. Finally, in Section 7, we point out the conclusions achieved.

Section snippets

NGE learning

NGE is a learning paradigm based on class exemplars, where an induced hypothesis has the graphical shape of a set of hyperrectangles in an M-dimensional Euclidean space. Exemplars of classes are either hyperrectangles or single instances [43]. The input of an NGE system is a set of training examples, each described as a vector of pairs numeric_attribute/value and an associated class. Attributes can either be numerical or categorical. Numerical attributes are usually normalized in the [0, 1]

Imbalanced data sets in classification

In this section, we address some important issues related to imbalanced classification by describing the pre-processing technique applied to deal with the imbalance problem: the SMOTE algorithm [9]. Also, we will present the evaluation metric mainly used for this type of classification problem.

Evolutionary selection of generalized examples for imbalanced classification

The approach proposed in this paper, named Evolutionary Generalized Instance Selection by CHC (EGIS-CHC), is fully explained in this section. Firstly, we introduce the CHC model used as an EA to perform generalized instance selection in Section 4.1. Secondly, the specific issues regarding representation and fitness function are specified in Section 4.2. Finally, Section 4.3 describes the process for generating the initial set of generalized examples.

Experimental framework

This section describes the methodology followed in the experimental study of the generalized examples based learning approaches. We will explain the configuration of the experiment: imbalanced data sets used and parameters for the algorithms.

Results and analysis

In this section we will carry out a complete experimental analysis in order to show three important issues:

  • First, the performance of the algorithms when they are applied over the original data sets (Section 6.1).

  • Secondly, the comparison of using or not SMOTE previous to EGIS-CHC and the performance of the algorithms when they are applied over SMOTE-processed data sets (Section 6.2).

  • Then, the analysis of complexity of the models obtained by means of the computation of the number of generalized

Concluding remarks

The purpose of this paper is to present EGIS-CHC, an evolutionary model to improve imbalanced classification based on the nested generalized example learning. The proposal performs an optimized selection of previously defined generalized examples obtained by a simple and fast heuristic.

The results show that the use of generalized exemplar selection based on evolutionary algorithms can obtain promising results to optimize the performance in imbalanced domains. It was compared with classical

Acknowledgement

This work was supported by TIN2008-06681-C06-01 and TIN2008-06681-C06-02. J. Derrac holds a research scholarship from the University of Granada.

References (50)

  • M.A. Mazurowski et al.

    Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance

    Neural Networks

    (2008)
  • S. Senthamarai Kannan et al.

    A novel hybrid feature selection via symmetrical uncertainty ranking based local memetic search algorithm

    Knowledge-Based Systems

    (2010)
  • I. Triguero et al.

    Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification

    Pattern Recognition

    (2011)
  • S. Zhang

    Cost-sensitive classification with respect to waiting cost

    Knowledge-Based Systems

    (2010)
  • D.W. Aha et al.

    Instance-based learning algorithms

    Machine Learning

    (1991)
  • J. Alcalá-Fdez et al.

    KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework

    Journal of Multiple-Valued Logic and Soft Computing

    (2011)
  • J. Alcala-Fdez et al.

    Keel: a software tool to assess evolutionary algorithms for data mining problems

    Soft Computing

    (2009)
  • G.E.A.P.A. Batista et al.

    A study of the behavior of several methods for balancing machine learning training data

    Sigkdd Explorations

    (2004)
  • J.R. Cano et al.

    Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study

    IEEE Transactions on Evolutionary Computation

    (2003)
  • N. Chawla et al.

    Smote: synthetic minority over-sampling technique

    Journal of Artificial Intelligence Research

    (2002)
  • N. Chawla et al.

    Automatically countering imbalance and its empirical relationship to cost

    Data Mining and Knowledge Discovery

    (2008)
  • N.V. Chawla et al.

    Editorial: special issue on learning from imbalanced data sets

    SIGKDD Explorations Newsletter

    (2004)
  • W.W. Cohen, Fast effective rule induction, in: Proceedings of the Twelfth International Conference on Machine Learning,...
  • T.M. Cover et al.

    Nearest neighbor pattern classification

    IEEE Transactions on Information Theory

    (1967)
  • J. Demšar

    Statistical comparisons of classifiers over multiple data sets

    Journal of Machine Learning Research

    (2006)
  • Cited by (125)

    • Change detection with incorporating multi-constraints and loss weights

      2024, Engineering Applications of Artificial Intelligence
    View all citing articles on Scopus
    View full text