Addressing imbalanced classification with instance generation techniques: IPADE-ID

doi:10.1016/j.neucom.2013.01.050

Neurocomputing

Volume 126, 27 February 2014, Pages 15-28

https://doi.org/10.1016/j.neucom.2013.01.050 Get rights and content

Abstract

A wide number of real word applications presents a class distribution where examples belonging to one class heavily outnumber the examples in the other class. This is an arduous situation where standard classification techniques usually decrease their performance, creating a handicap to correctly identify the minority class, which is precisely the case under consideration in these applications.

In this work, we propose the usage of the Iterative Instance Adjustment for Imbalanced Domains (IPADE-ID) algorithm. It is an evolutionary framework, which uses an instance generation technique, designed to face the existing imbalance modifying the original training set. The method, iteratively learns the appropriate number of examples that represent the classes and their particular positioning. The learning process contains three key operations in its design: a customized initialization procedure, an evolutionary optimization of the positioning of the examples and a selection of the most representative examples for each class.

An experimental analysis is carried out with a wide range of highly imbalanced datasets over the proposal and recognized solutions to the problem. The results obtained, which have been contrasted through non-parametric statistical tests, show that our proposal outperforms previously proposed methods.

Introduction

Classification with imbalanced datasets is a challenging data mining problem that has attracted a lot of attention in the last years [1], [2]. This problem is extremely important since it is predominant in many real-world data mining applications including, but not limited to, medical diagnosis, fraud detection, finances, network intrusion and so on. These applications feature samples from one class which are greatly outnumbered by the samples of the other class. Usually, the minority class is the most interesting class from the learning point of view and implies a higher cost of making errors [3], [4].

Imbalanced datasets have become an important difficulty to most classifiers, which assume a nearly balanced class distribution [5]. Standard classifiers are developed to minimize a global measure of error, which is independent of the class distribution and causes a bias towards the majority class, paying less attention to the minority class. Consequently, classifying the minority class is more error prone than classifying the majority class, as a huge portion of errors are concentrated in the minority class [6]. Furthermore, the examples of the minority class can be treated as noise and they might be completely ignored by the classifier.

Numerous approaches have been suggested to tackle the problem of classification with imbalanced datasets [1], [2], [7]. These approaches are developed at both data and algorithm levels. Solutions at the algorithm level modify existing learning algorithms conducting its operations on the improvement of the learning on the minority class [8], [9]. Solutions at the data level, also known as data sampling, try to modify the original class distribution in order to obtain a more or less balanced dataset that can be used to correctly identify each class with standard classifiers [10], [11], [12].

The use of instance reduction methods [13], which were originally designed for other preprocessing purposes (speed up, noise tolerance and reduction of storage requirements of learning methods [14]), can also be applied to imbalanced datasets [15], [16] as a data level solution that is used to find a balance between the minority and the majority classes. It is important that instance reduction methods adapt their bias to this situation to obtain high performances.

An instance reduction process is devoted to find the best reduced set that represents the original training data with a lesser number of instances. This methodology can be divided into Instance Selection (IS) [13], [17], [18] and Instance Generation (IG) depending on how it creates the reduced set [19], [20]. The former process attempts to choose an appropriate subset of the original training data, while the latter can also build new artificial instances to better adjust the decision boundaries of the classes. In this manner, the IG process fills some regions in the domain of the problem, which have no representative examples in the original dataset. IS methods have been applied to imbalanced datasets with promising results [15], [16], [21], however, to the best of our knowledge, IG techniques have not been used yet to deal with imbalanced classification problems.

Following the idea of IG techniques, we propose the usage of the Iterative Instance Adjustment for Imbalanced Domains (IPADE-ID) algorithm to deal with highly imbalanced datasets. IPADE-ID is a method inspired by the IG technique IPADE [22], [23], that tries to obtain an adequate synthetic training set from the original training set following an incremental approach to determine the most appropriate number of instances per class. The proposal is based in three fundamental operations: a customized initialization procedure, an evolutionary adjustment of the prototypes and the selection of the most representative examples to define the classes. The initialization procedure should be befitting to the specific learning algorithm used with IPADE-ID.

In this work, we choose the Nearest Neighbor (NN) rule [24] and the C4.5 algorithm [25] as learning methods. In this way, we provide suitable initialization procedures for IPADE-ID that matches these learning approaches. At each step, an optimization procedure, based on an adaptive differential evolution algorithm [26], [27], [28], adjusts the positioning of the instances generated up to now, and a selection procedure adds new instances if needed. This selection procedure has been particularly designed to consider the existing imbalanced scenario focusing on the performance of the minority class. This informed and organized combination of techniques, leads us to a hybrid artificial intelligent system [29], [30] that is able to cope with imbalanced datasets.

In order to analyze the performance of the proposal, we focus on highly imbalanced binary classification problems, having selected a benchmark of 44 problems from KEEL dataset repository¹ [31]. We will perform our experimental analysis focusing on the precision of the models using the Area Under the ROC curve (AUC) [32]. This study will be carried out using non-parametric statistical tests to check whether there are significant differences among the results [33], [34].

The rest of the paper is organized as follows. In Section 2, some background about classification with imbalanced datasets and instance generation techniques is given. Next, Section 3 introduces the proposed approach. 4 Experimental framework, 5 Experimental results and analysis describe the experimental framework used and the analysis of results, respectively. Finally, the conclusions achieved in this work are shown in Section 6.

Section snippets

Background

This section purpose is to provide the background information needed to describe our proposal. It is divided in two parts: a description of instance generation techniques (Section 2.1) and an introduction to the problem of classification with imbalanced datasets (Section 2.2).

Iterative instance adjustment for imbalanced domains: IPADE-ID

In this section, we present and describe the proposed approach in depth, denoted as IPADE for Imbalanced Domains (IPADE-ID). IPADE-ID is influenced by the IG algorithm IPADE, having some features in common with it like its iterative way of working or the usage of adaptive evolutionary techniques to optimize the instances generated up to now. Nevertheless, IPADE-ID features several differences from its predecessor: IPADE-ID presents a new initialization of the prototypes procedure, specifically

Experimental framework

In this section, we present the set up of the experimental framework used to develop the analysis of our proposal. We will mention the algorithms selected for the comparison together with their configuration parameters, the imbalanced datasets selected and we will introduce the necessity of the usage of statistical tests.

Experimental results and analysis

In this section, we present the empirical analysis of the proposed IPADE-ID algorithm in order to determine its robustness in a scenario of highly imbalanced datasets. We divide the study in several parts: a first one devoted to the results of IPADE-ID using the NN rule in its way of working (Section 5.1), and a second part with the results of the proposal using the C4.5 decision tree as classifier (Section 5.2). Finally, a study on the impact of the data modification that some of the

Concluding remarks

In this paper, we have presented IPADE-ID, a new approach to deal with the problem of classification with highly imbalanced datasets. The proposal provides a solution that modifies the training set using a IG technique based on differential evolution as base for the procedure, adapting its way of working to this imbalanced scenario. As learning methods, we have selected the NN rule and the C4.5 decision tree and we have adapted the IPADE-ID approach according to these methods behavior.

The

Acknowledgments

This work was partially supported by the Spanish Ministry of Science and Technology under project TIN2011-28488 and the Andalusian Research Plans P11-TIC-7765 and P10-TIC-6858. V. López holds a FPU scholarship from Spanish Ministry of Education.

Victoria López received her M.Sc. degree in Computer Science from the University of Granada, Granada, Spain, in 2009. She is currently a Ph.D. student in the Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. Her research interests include data mining, classification in imbalanced domains, fuzzy rule learning and evolutionary algorithms.

References (71)

V. López et al.
Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics
Expert Systems with Applications
(2012)
T. Yu et al.
VQSVMa case study for incorporating prior domain knowledge into inductive machine learning
Neurocomputing
(2010)
S.-H. Oh
Error back-propagation algorithm for classification of imbalanced data
Neurocomputing
(2011)
S. García et al.
Evolutionary-based selection of generalized instances for imbalanced classification
Knowledge-Based Systems
(2012)
S. García et al.
Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems
Applied Soft Computing
(2009)
J. Derrac et al.
IFS-CoCoinstance and feature selection based on cooperative coevolution with nearest neighbor rule
Pattern Recognition
(2010)
H.A. Fayed et al.
Self-generating prototypes for pattern classification
Pattern Recognition
(2007)
E. Corchado et al.
Hybrid intelligent algorithms and applications
Information Sciences
(2010)
E. Corchado et al.
New trends and applications on hybrid artificial intelligence systems
Neurocomputing
(2012)
J.S. Sánchez et al.
Analysis of new techniques to obtain quality training sets
Pattern Recognition Letters
(2003)

J.S. Sánchez

High training set size reduction by space partitioning and prototype abstraction

Pattern Recognition

(2004)

I. Triguero et al.

Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification

Pattern Recognition

(2011)

I. Brown et al.

An experimental comparison of classification algorithms for imbalanced credit scoring data sets

Expert Systems with Applications

(2012)

J. Xiao et al.

Dynamic classifier ensemble model for customer classification with imbalanced class distribution

Expert Systems with Applications

(2012)

W. Khreich et al.

Iterative boolean combination of classifiers in the ROC spacean application to anomaly detection with HMMs

Pattern Recognition

(2010)

N. García-Pedrajas et al.

Class imbalance methods for translation initiation site recognition in dna sequences

Knowledge-Based Systems

(2012)

J.G. Moreno-Torres et al.

A unifying view on dataset shift in classification

Pattern Recognition

(2012)

A.P. Bradley

The use of the area under the ROC curve in the evaluation of machine learning algorithms

Pattern Recognition

(1997)

M. Lozano et al.

Experimental study on prototype optimisation algorithms for prototype-based classification in vector spaces

Pattern Recognition

(2006)

R. Barandela et al.

Strategies for learning in class imbalance problems

Pattern Recognition

(2003)

Q. Gao et al.

Center-based nearest neighbor classifier

Pattern Recognition

(2007)

J. Wang et al.

Improving nearest neighbor rule with a simple adaptative distance measure

Pattern Recognition Letters

(2007)

Y. Sun et al.

Classification of imbalanced dataa review

International Journal of Pattern Recognition and Artificial Intelligence

(2009)

H. He et al.

Learning from imbalanced data

IEEE Transactions on Knowledge and Data Engineering

(2009)

C. Elkan, The foundations of cost-sensitive learning, in: Proceedings of the 17th IEEE International Joint Conference...

B. Zadrozny, J. Langford, N. Abe, Cost-sensitive learning by cost-proportionate example weighting, in: Proceedings of...

G.M. Weiss

Mining with raritya unifying framework

SIGKDD Explorations

(2004)

N. Japkowicz et al.

The class imbalance problema systematic study

Intelligent Data Analysis Journal

(2002)

N.V. Chawla et al.

SMOTEsynthetic minority over-sampling technique

Journal of Artificial Intelligent Research

(2002)

G.E.A.P.A. Batista et al.

A study of the behaviour of several methods for balancing machine learning training data

SIGKDD Explorations

(2004)

D.R. Wilson et al.

Reduction techniques for instance-based learning algorithms

Machine Learning

(2000)

I. Kononenko et al.

Machine Learning and Data MiningIntroduction to Principles and Algorithms

(2007)

A. de Haro-Garcia, N. Garcia-Pedrajas, A scalable method for instance selection for class-imbalance datasets, in:...

S. García et al.

Prototype selection for nearest neighbor classificationtaxonomy and empirical study

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2012)

I. Triguero et al.

A taxonomy and experimental study on prototype generation for nearest neighbor classification

IEEE Transactions on Systems, Man, and Cybernetics-Part CApplications and Reviews

(2012)

Cited by (51)

Self-adaptive oversampling method based on the complexity of minority data in imbalanced datasets classification
2023, Knowledge-Based Systems
Learning from imbalanced datasets is a nontrivial task for supervised learning community. Traditional classifiers may have difficulties to learn the concept related to the minority class when addressing imbalanced classification and the issues can become more deteriorated in the presence of other complicated aspects: overlapping, outliers and small disjuncts, etc. In this paper, we propose a self-adaptive oversampling algorithm based on the complexity of minority data for dealing with imbalanced datasets classification problems. In the proposed algorithm, various hyperspheres with different radii determined by imbalance ratio and the distances to the nearest enemy neighbors are firstly generated to cover all minority instances provided that they cannot contain any majority instance. Subsequently, the oversampling process is conducted only within these hyperspheres and thus the generated synthetic minority instances cannot intervene within the majority space, eventually avoiding overlapping issues during achieving between-class balance. In addition, a self-adaptive assignment strategy of oversampling sizes is developed based on the minority data complexity, where the hyperspheres with small radii and few instances in them are provided more chances to be oversampled. The strategy will favor addressing the outliers and small disjuncts issues since the hyperspheres covering the outliers and small disjuncts are usually of small sizes and contain few instances, which makes them have more chances to generate synthetic instances and thus eliminate within-class imbalance due to lack of density. Moreover, since the hyperspheres covering boundary minority instances are relatively small and thus are assigned with larger oversampling sizes, the proposed approach can also strengthen the boundary information of minority class, thus favoring the later learning tasks. The extensive experimental results on various simulated and real-world imbalanced datasets show that the proposed method significantly outperforms other state-of-the-art oversampling ones.
An extension of Synthetic Minority Oversampling Technique based on Kalman filter for imbalanced datasets
2022, Machine Learning with Applications
More often than not, data collected in real-time tends to be imbalanced i.e., the samples belonging to a particular class are significantly more than the others. This degrades the performance of the predictor. One of the most notable algorithms to handle such an imbalance in the dataset by fabricating synthetic data, is the “Synthetic Minority Oversampling Technique (SMOTE)”. However, data imbalance is not solely responsible for the poor performance of the classifier. Certain research works have demonstrated that noisy samples can have a significant role in misclassifying the dataset. Also, handling large data is computationally expensive. Hence, data reduction is imperative. In this work, we put forth a novel extension of SMOTE by integrating it with the Kalman filter. The proposed method, Kalman-SMOTE (KSMOTE), filters out the noisy samples in the final dataset after SMOTE, which includes both the raw data and the synthetically generated samples, thereby reducing the size of the dataset. Our model is validated with a wide range of datasets. An experimental analysis of the results shows that our model outperforms the presently available techniques.
Equalization ensemble for large scale highly imbalanced data classification
2022, Knowledge-Based Systems
The class-imbalance problem has been widely distributed in various research fields. The larger the data scale and the higher the data imbalance, the more difficult the proper classification. For large-scale highly imbalanced data sets, the ensemble method based on under-sampling is one of the most competitive techniques among the existing techniques. However, it is susceptible to improperly sampling strategies, easy to lose the useful information of the majority class, and not easy to generalize the learning model. To overcome these limitations, we propose an equalization ensemble method (EASE) with two new schemes. First, we propose an equalization under-sampling scheme to generate a balanced data set for each base classifier, which can reduce the impact of class imbalance on the base classifiers; Second, we design a weighted integration scheme, where the G-mean scores obtained by base classifiers on the original imbalanced data set are used as the weights. These weights can not only make the better-performed base-classifiers dominate the final classification decision, but also adapt to a variety of imbalanced data sets with different scales while avoiding the occurrence of some extremely bad situations. Experimental results on three metrics show that EASE increases the diversity of base classifiers and outperforms twelve state-of-the-art methods on the imbalanced data sets with different scales.
SOUL: Scala Oversampling and Undersampling Library for imbalance classification
2021, SoftwareX
The improvements in technology and computation have promoted a global adoption of Data Science. It is devoted to extracting significant knowledge from high amounts of information by means of the application of Artificial Intelligence and Machine Learning tools. Among the different tasks within Data Science, classification is probably the most widespread overall.
Focusing on the classification scenario, we often face some datasets in which the number of instances for one of the classes is much lower than that of the remaining ones. This issue is known as the imbalanced classification problem, and it is mainly related to the need for boosting the recognition of the minority class examples.
In spite of a large number of solutions that were proposed in the specialized literature to address imbalanced classification, there is a lack of open-source software that compiles the most relevant ones in an easy-to-use and scalable way. In this paper, we present a novel software approach named as SOUL, which stands for Scala Oversampling and Undersampling Library for imbalanced classification. The main capabilities of this new library include a large number of different data preprocessing techniques, efficient execution of these approaches, and a graphical environment to contrast the output for the different preprocessing solutions.
RSMOTE: A self-adaptive robust SMOTE for imbalanced problems with label noise
2021, Information Sciences
Citation Excerpt :
Lpez and Victoria, et al. proposed the usage of the Iterative Instance Adjustment for Imbalanced Domains (IPADE-ID) algorithm. An evolutionary adjustment step for the prototypes is introduced to optimize the position of the generated examples in [47]. It uses differential evolution as the base of the procedure.
Imbalanced classification is an important task in supervised learning, and Synthetic Minority Over-sampling Technique (SMOTE) is the most common method to address it. However, the performance of SMOTE deteriorates in the presence of label noise. Current generalizations of SMOTE try to tackle this problem by either selecting some samples in minority class as seed samples or combining SMOTE with a certain noise filter. Unfortunately, the former approach usually introduces extra parameters difficult to be optimized, and the latter one relies heavily on the performance of certain specific noise filter. In this paper, a self-adaptive robust SMOTE, called RSMOTE, is proposed for imbalanced classification with label noise. In RSMOTE, relative density has been introduced to measure the local density of every minority sample, and the non-noisy minority samples are divided into the borderline samples and safe samples adaptively basing their distinguishing characteristics of relative density. In addition, we reweigh the number that needs to be generated by every minority samples based on its chaotic level. Furthermore, we generate new samples within in the borderline area and safe area respectively to enhance the separability of the boundary. RSMOTE does not rely on any specific noise filter nor introduce any extra parameters. The experimental results demonstrate that the proposed approach performs better than the comparison methods in terms of several metrics, including Precision, Recall, Area Under the Curve (AUC), F1-measure, and G-mean. The implementation of the proposed RSMOTE in programming language Python is available at https://github.com/syxiaa/RSMOTE.
An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets
2019, Applied Soft Computing Journal
Learning and mining from imbalanced datasets gained increased interest in recent years. One simple but efficient way to increase the performance of standard machine learning techniques on imbalanced datasets is the synthetic generation of minority samples. In this paper, a detailed, empirical comparison of 85 variants of minority oversampling techniques is presented and discussed involving 104 imbalanced datasets for evaluation. The goal of the work is to set a new baseline in the field, determine the oversampling principles leading to the best results under general circumstances, and also give guidance to practitioners on which techniques to use with certain types of datasets.

View all citing articles on Scopus

Isaac Triguero received the M.Sc. degree in Computer Science from the University of Granada, Granada, Spain, in 2009. He is currently a Ph.D. student in the Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. His research interests include data mining, semisupervised learning, data reduction and evolutionary algorithms.

Cristóbal José Carmona received the M.Sc. and Ph.D. degrees in computer science from the University of Jaén, Spain, in 2006 and 2011, respectively. He is a researcher in the Department of Computer Science, University of Jaén, Spain. Currently, he is working with Intelligent Systems and Data Mining Research Group of Jaén. His research interest includes supervised descriptive rule discovery, subgroup discovery, contrast set mining, emerging pattern mining, evolutionary fuzzy systems, evolutionary algorithm and data mining.

Salvador García received the M.Sc. and Ph.D. degrees in Computer Science from the University of Granada, Granada, Spain, in 2004 and 2008, respectively. He is currently an Associate Professor in the Department of Computer Science, University of Jaén, Jaén, Spain. He has had more than 25 papers published in international journals. He has co-edited two special issues of international journals on different Data Mining topics. His research interests include data mining, data reduction, data complexity, imbalanced learning, semi-supervised learning, statistical inference and evolutionary algorithms.

Francisco Herrera received his M.Sc. in Mathematics in 1988 and Ph.D. in Mathematics in 1991, both from the University of Granada, Spain.

He is currently a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada. He has had more than 200 papers published in international journals. He is coauthor of the book “Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases” (World Scientific, 2001).

He currently acts as Editor in Chief of the international journal “Progress in Artificial Intelligence (Springer) and serves as area editor of the Journal Soft Computing (area of evolutionary and bioinspired algorithms) and International Journal of Computational Intelligence Systems (area of information systems). He acts as associated editor of the journals: IEEE Transactions on Fuzzy Systems, Information Sciences, Advances in Fuzzy Systems, and International Journal of Applied Metaheuristics Computing; and he serves as member of several journal editorial boards, among others: Fuzzy Sets and Systems, Applied Intelligence, Knowledge and Information Systems, Information Fusion, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems, Memetic Computation, Swarm and Evolutionary Computation.

He received the following honors and awards: ECCAI Fellow 2009, 2010 Spanish National Award on Computer Science ARITMEL to the “Spanish Engineer on Computer Science”, and International Cajastur “Mamdani” Prize for Soft Computing (Fourth Edition, 2010).

His current research interests include computing with words and decision making, data mining, bibliometrics, data preparation, instance selection, fuzzy rule based systems, genetic fuzzy systems, knowledge extraction based on evolutionary algorithms, memetic algorithms and genetic algorithms.

View full text

Addressing imbalanced classification with instance generation techniques: IPADE-ID

Abstract

Introduction

Section snippets

Background

Iterative instance adjustment for imbalanced domains: IPADE-ID

Experimental framework

Experimental results and analysis

Concluding remarks

Acknowledgments

Expert Systems with Applications

Neurocomputing

Neurocomputing

Knowledge-Based Systems

Applied Soft Computing

Pattern Recognition

Pattern Recognition

Information Sciences

Neurocomputing

Pattern Recognition Letters

Pattern Recognition

Pattern Recognition

Expert Systems with Applications

Expert Systems with Applications

Pattern Recognition

Knowledge-Based Systems

Pattern Recognition

Pattern Recognition

Pattern Recognition

Pattern Recognition

Pattern Recognition

Pattern Recognition Letters

Classification of imbalanced dataa review

International Journal of Pattern Recognition and Artificial Intelligence

Learning from imbalanced data

IEEE Transactions on Knowledge and Data Engineering

Mining with raritya unifying framework

SIGKDD Explorations

The class imbalance problema systematic study

Intelligent Data Analysis Journal

SMOTEsynthetic minority over-sampling technique

Journal of Artificial Intelligent Research

A study of the behaviour of several methods for balancing machine learning training data

SIGKDD Explorations

Reduction techniques for instance-based learning algorithms

Machine Learning

Machine Learning and Data MiningIntroduction to Principles and Algorithms

Prototype selection for nearest neighbor classificationtaxonomy and empirical study

IEEE Transactions on Pattern Analysis and Machine Intelligence

A taxonomy and experimental study on prototype generation for nearest neighbor classification

IEEE Transactions on Systems, Man, and Cybernetics-Part CApplications and Reviews