Addressing imbalanced classification with instance generation techniques: IPADE-ID
Introduction
Classification with imbalanced datasets is a challenging data mining problem that has attracted a lot of attention in the last years [1], [2]. This problem is extremely important since it is predominant in many real-world data mining applications including, but not limited to, medical diagnosis, fraud detection, finances, network intrusion and so on. These applications feature samples from one class which are greatly outnumbered by the samples of the other class. Usually, the minority class is the most interesting class from the learning point of view and implies a higher cost of making errors [3], [4].
Imbalanced datasets have become an important difficulty to most classifiers, which assume a nearly balanced class distribution [5]. Standard classifiers are developed to minimize a global measure of error, which is independent of the class distribution and causes a bias towards the majority class, paying less attention to the minority class. Consequently, classifying the minority class is more error prone than classifying the majority class, as a huge portion of errors are concentrated in the minority class [6]. Furthermore, the examples of the minority class can be treated as noise and they might be completely ignored by the classifier.
Numerous approaches have been suggested to tackle the problem of classification with imbalanced datasets [1], [2], [7]. These approaches are developed at both data and algorithm levels. Solutions at the algorithm level modify existing learning algorithms conducting its operations on the improvement of the learning on the minority class [8], [9]. Solutions at the data level, also known as data sampling, try to modify the original class distribution in order to obtain a more or less balanced dataset that can be used to correctly identify each class with standard classifiers [10], [11], [12].
The use of instance reduction methods [13], which were originally designed for other preprocessing purposes (speed up, noise tolerance and reduction of storage requirements of learning methods [14]), can also be applied to imbalanced datasets [15], [16] as a data level solution that is used to find a balance between the minority and the majority classes. It is important that instance reduction methods adapt their bias to this situation to obtain high performances.
An instance reduction process is devoted to find the best reduced set that represents the original training data with a lesser number of instances. This methodology can be divided into Instance Selection (IS) [13], [17], [18] and Instance Generation (IG) depending on how it creates the reduced set [19], [20]. The former process attempts to choose an appropriate subset of the original training data, while the latter can also build new artificial instances to better adjust the decision boundaries of the classes. In this manner, the IG process fills some regions in the domain of the problem, which have no representative examples in the original dataset. IS methods have been applied to imbalanced datasets with promising results [15], [16], [21], however, to the best of our knowledge, IG techniques have not been used yet to deal with imbalanced classification problems.
Following the idea of IG techniques, we propose the usage of the Iterative Instance Adjustment for Imbalanced Domains (IPADE-ID) algorithm to deal with highly imbalanced datasets. IPADE-ID is a method inspired by the IG technique IPADE [22], [23], that tries to obtain an adequate synthetic training set from the original training set following an incremental approach to determine the most appropriate number of instances per class. The proposal is based in three fundamental operations: a customized initialization procedure, an evolutionary adjustment of the prototypes and the selection of the most representative examples to define the classes. The initialization procedure should be befitting to the specific learning algorithm used with IPADE-ID.
In this work, we choose the Nearest Neighbor (NN) rule [24] and the C4.5 algorithm [25] as learning methods. In this way, we provide suitable initialization procedures for IPADE-ID that matches these learning approaches. At each step, an optimization procedure, based on an adaptive differential evolution algorithm [26], [27], [28], adjusts the positioning of the instances generated up to now, and a selection procedure adds new instances if needed. This selection procedure has been particularly designed to consider the existing imbalanced scenario focusing on the performance of the minority class. This informed and organized combination of techniques, leads us to a hybrid artificial intelligent system [29], [30] that is able to cope with imbalanced datasets.
In order to analyze the performance of the proposal, we focus on highly imbalanced binary classification problems, having selected a benchmark of 44 problems from KEEL dataset repository1 [31]. We will perform our experimental analysis focusing on the precision of the models using the Area Under the ROC curve (AUC) [32]. This study will be carried out using non-parametric statistical tests to check whether there are significant differences among the results [33], [34].
The rest of the paper is organized as follows. In Section 2, some background about classification with imbalanced datasets and instance generation techniques is given. Next, Section 3 introduces the proposed approach. 4 Experimental framework, 5 Experimental results and analysis describe the experimental framework used and the analysis of results, respectively. Finally, the conclusions achieved in this work are shown in Section 6.
Section snippets
Background
This section purpose is to provide the background information needed to describe our proposal. It is divided in two parts: a description of instance generation techniques (Section 2.1) and an introduction to the problem of classification with imbalanced datasets (Section 2.2).
Iterative instance adjustment for imbalanced domains: IPADE-ID
In this section, we present and describe the proposed approach in depth, denoted as IPADE for Imbalanced Domains (IPADE-ID). IPADE-ID is influenced by the IG algorithm IPADE, having some features in common with it like its iterative way of working or the usage of adaptive evolutionary techniques to optimize the instances generated up to now. Nevertheless, IPADE-ID features several differences from its predecessor: IPADE-ID presents a new initialization of the prototypes procedure, specifically
Experimental framework
In this section, we present the set up of the experimental framework used to develop the analysis of our proposal. We will mention the algorithms selected for the comparison together with their configuration parameters, the imbalanced datasets selected and we will introduce the necessity of the usage of statistical tests.
Experimental results and analysis
In this section, we present the empirical analysis of the proposed IPADE-ID algorithm in order to determine its robustness in a scenario of highly imbalanced datasets. We divide the study in several parts: a first one devoted to the results of IPADE-ID using the NN rule in its way of working (Section 5.1), and a second part with the results of the proposal using the C4.5 decision tree as classifier (Section 5.2). Finally, a study on the impact of the data modification that some of the
Concluding remarks
In this paper, we have presented IPADE-ID, a new approach to deal with the problem of classification with highly imbalanced datasets. The proposal provides a solution that modifies the training set using a IG technique based on differential evolution as base for the procedure, adapting its way of working to this imbalanced scenario. As learning methods, we have selected the NN rule and the C4.5 decision tree and we have adapted the IPADE-ID approach according to these methods behavior.
The
Acknowledgments
This work was partially supported by the Spanish Ministry of Science and Technology under project TIN2011-28488 and the Andalusian Research Plans P11-TIC-7765 and P10-TIC-6858. V. López holds a FPU scholarship from Spanish Ministry of Education.
Victoria López received her M.Sc. degree in Computer Science from the University of Granada, Granada, Spain, in 2009. She is currently a Ph.D. student in the Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. Her research interests include data mining, classification in imbalanced domains, fuzzy rule learning and evolutionary algorithms.
References (71)
- et al.
Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics
Expert Systems with Applications
(2012) - et al.
VQSVMa case study for incorporating prior domain knowledge into inductive machine learning
Neurocomputing
(2010) Error back-propagation algorithm for classification of imbalanced data
Neurocomputing
(2011)- et al.
Evolutionary-based selection of generalized instances for imbalanced classification
Knowledge-Based Systems
(2012) - et al.
Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems
Applied Soft Computing
(2009) - et al.
IFS-CoCoinstance and feature selection based on cooperative coevolution with nearest neighbor rule
Pattern Recognition
(2010) - et al.
Self-generating prototypes for pattern classification
Pattern Recognition
(2007) - et al.
Hybrid intelligent algorithms and applications
Information Sciences
(2010) - et al.
New trends and applications on hybrid artificial intelligence systems
Neurocomputing
(2012) - et al.
Analysis of new techniques to obtain quality training sets
Pattern Recognition Letters
(2003)
High training set size reduction by space partitioning and prototype abstraction
Pattern Recognition
Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification
Pattern Recognition
An experimental comparison of classification algorithms for imbalanced credit scoring data sets
Expert Systems with Applications
Dynamic classifier ensemble model for customer classification with imbalanced class distribution
Expert Systems with Applications
Iterative boolean combination of classifiers in the ROC spacean application to anomaly detection with HMMs
Pattern Recognition
Class imbalance methods for translation initiation site recognition in dna sequences
Knowledge-Based Systems
A unifying view on dataset shift in classification
Pattern Recognition
The use of the area under the ROC curve in the evaluation of machine learning algorithms
Pattern Recognition
Experimental study on prototype optimisation algorithms for prototype-based classification in vector spaces
Pattern Recognition
Strategies for learning in class imbalance problems
Pattern Recognition
Center-based nearest neighbor classifier
Pattern Recognition
Improving nearest neighbor rule with a simple adaptative distance measure
Pattern Recognition Letters
Classification of imbalanced dataa review
International Journal of Pattern Recognition and Artificial Intelligence
Learning from imbalanced data
IEEE Transactions on Knowledge and Data Engineering
Mining with raritya unifying framework
SIGKDD Explorations
The class imbalance problema systematic study
Intelligent Data Analysis Journal
SMOTEsynthetic minority over-sampling technique
Journal of Artificial Intelligent Research
A study of the behaviour of several methods for balancing machine learning training data
SIGKDD Explorations
Reduction techniques for instance-based learning algorithms
Machine Learning
Machine Learning and Data MiningIntroduction to Principles and Algorithms
Prototype selection for nearest neighbor classificationtaxonomy and empirical study
IEEE Transactions on Pattern Analysis and Machine Intelligence
A taxonomy and experimental study on prototype generation for nearest neighbor classification
IEEE Transactions on Systems, Man, and Cybernetics-Part CApplications and Reviews
Cited by (51)
Self-adaptive oversampling method based on the complexity of minority data in imbalanced datasets classification
2023, Knowledge-Based SystemsAn extension of Synthetic Minority Oversampling Technique based on Kalman filter for imbalanced datasets
2022, Machine Learning with ApplicationsEqualization ensemble for large scale highly imbalanced data classification
2022, Knowledge-Based SystemsRSMOTE: A self-adaptive robust SMOTE for imbalanced problems with label noise
2021, Information SciencesCitation Excerpt :Lpez and Victoria, et al. proposed the usage of the Iterative Instance Adjustment for Imbalanced Domains (IPADE-ID) algorithm. An evolutionary adjustment step for the prototypes is introduced to optimize the position of the generated examples in [47]. It uses differential evolution as the base of the procedure.
An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets
2019, Applied Soft Computing Journal
Victoria López received her M.Sc. degree in Computer Science from the University of Granada, Granada, Spain, in 2009. She is currently a Ph.D. student in the Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. Her research interests include data mining, classification in imbalanced domains, fuzzy rule learning and evolutionary algorithms.
Isaac Triguero received the M.Sc. degree in Computer Science from the University of Granada, Granada, Spain, in 2009. He is currently a Ph.D. student in the Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. His research interests include data mining, semisupervised learning, data reduction and evolutionary algorithms.
Cristóbal José Carmona received the M.Sc. and Ph.D. degrees in computer science from the University of Jaén, Spain, in 2006 and 2011, respectively. He is a researcher in the Department of Computer Science, University of Jaén, Spain. Currently, he is working with Intelligent Systems and Data Mining Research Group of Jaén. His research interest includes supervised descriptive rule discovery, subgroup discovery, contrast set mining, emerging pattern mining, evolutionary fuzzy systems, evolutionary algorithm and data mining.
Salvador García received the M.Sc. and Ph.D. degrees in Computer Science from the University of Granada, Granada, Spain, in 2004 and 2008, respectively. He is currently an Associate Professor in the Department of Computer Science, University of Jaén, Jaén, Spain. He has had more than 25 papers published in international journals. He has co-edited two special issues of international journals on different Data Mining topics. His research interests include data mining, data reduction, data complexity, imbalanced learning, semi-supervised learning, statistical inference and evolutionary algorithms.
Francisco Herrera received his M.Sc. in Mathematics in 1988 and Ph.D. in Mathematics in 1991, both from the University of Granada, Spain.
He is currently a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada. He has had more than 200 papers published in international journals. He is coauthor of the book “Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases” (World Scientific, 2001).
He currently acts as Editor in Chief of the international journal “Progress in Artificial Intelligence (Springer) and serves as area editor of the Journal Soft Computing (area of evolutionary and bioinspired algorithms) and International Journal of Computational Intelligence Systems (area of information systems). He acts as associated editor of the journals: IEEE Transactions on Fuzzy Systems, Information Sciences, Advances in Fuzzy Systems, and International Journal of Applied Metaheuristics Computing; and he serves as member of several journal editorial boards, among others: Fuzzy Sets and Systems, Applied Intelligence, Knowledge and Information Systems, Information Fusion, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems, Memetic Computation, Swarm and Evolutionary Computation.
He received the following honors and awards: ECCAI Fellow 2009, 2010 Spanish National Award on Computer Science ARITMEL to the “Spanish Engineer on Computer Science”, and International Cajastur “Mamdani” Prize for Soft Computing (Fourth Edition, 2010).
His current research interests include computing with words and decision making, data mining, bibliometrics, data preparation, instance selection, fuzzy rule based systems, genetic fuzzy systems, knowledge extraction based on evolutionary algorithms, memetic algorithms and genetic algorithms.