ACO Resampling: Enhancing the performance of oversampling methods for class imbalance classification
Introduction
In many real-world applications, samples (i.e. examples, instances, observations, cases) distributions are highly skewed because the representatives of some classes rarely appear. The minority class is usually more interesting from a learning point of view because it implies a great cost when it is not well classified. In supervised machine learning, significant differences in prior class probabilities render the classification of minority or rare classes difficult. This situation is known as the class imbalance problem [1], [2], [3]. The class imbalance problem exists in many real-world domains, such as medical applications [4], risk management [5], detection of fraudulent telephone calls [6], and biological data analysis [7], [8]. For these tasks, obtaining a classifier with high accuracy for the minority class without severely jeopardizing the accuracy of the majority class is a key point [2].
Many techniques have been proposed to alleviate class imbalance in more than ten years [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20]. These techniques are grouped into three categories [10]: (1) data-level approaches [11], [12], [13], [14] that focus on balancing the distribution between the majority class and minority class examples by sampling. (2) algorithm-level approaches that concentrate on adapting existing learners to alleviate their bias toward the majority class, such as cost-sensitive approaches [15], [16], [17], [18]. (3) hybrid methods [19], [20] that combine the advantages of the two abovementioned groups.
The most significant approach to solve the imbalance problem involves sampling methods as they manage imbalanced learning in a straightforward manner [21], [22], [23]. We herein introduce a novel hybrid proposal named ant colony optimization resampling (ACOR) to overcome class imbalance classification. ACOR primarily includes two steps: first, it rebalances an imbalanced dataset by a specific oversampling algorithm; next, it finds an (sub)optimal subset from the balanced dataset by ant colony optimization. The proposed ACOR algorithm is a general preprocessing framework that can enhance the performance of existing oversampling algorithms for imbalanced datasets. Compared with other oversampling techniques, ACOR does not focus on the mechanics of generating new samples. The main advantage of ACOR is that existing oversampling algorithms can be fully utilized and an ideal training set can be obtained by ant colony optimization. Compared with existing oversampling algorithms, ACOR effectively reduces the number of samples and renders a training set more suitable for a specified classifier.
To validate the effectiveness of ACOR, based on four popular sampling algorithms of Synthetic Minority Oversampling Technique (SMOTE) [24], borderline-SMOTE (BSO) [25], random oversampling (ROS), and Adaptive Synthetic Sampling (ADASYN) [26], we implemented ACOR-SMOTE, ACOR-BSO, ACOR-ROS, and ACOR-ADASYN, respectively. Extensive experiments were performed for comparing the four pairs of algorithms, i.e., ACOR-SMOTE vs. SMOTE, ACOR-BSO vs. BSO, ACOR-ROS vs. ROS, and ACOR-ADASYN vs. ADASYN. The experimental results demonstrated that the ACOR algorithm can statistically enhance the performance of the compared sampling algorithms.
The remainder of the paper is organized as follows. In Section 2, we briefly present the most related works. In Section 3, we describe the motivations of the proposed ACOR algorithm. Section 4 presents the concept and procedure of the ACOR algorithm in detail. In Section 5, the experimental setup, results, and analysis are presented. Finally, Section 6 presents the conclusions and outlines future research.
Section snippets
Related works
In this section, some related works are briefly reviewed. In Section 2.1, we present some most popular sampling techniques. Then, in Section 2.2, we present some common assessment metrics in imbalanced domains.
Motivation
The basic theory of supervised learning assumes that the involved data obey some probabilistic distribution , where represents the feature vectors and the class labels. For any, a class label is associated with it. In classification learning, a sample set is used to train the classifier, whose purpose is to obtain a classification model that is later used to classify unknown objects. In practice, to validate the classification model, a test sample set is often used to
Ant colony optimization resampling (ACO-resampling)
Ant Colony Optimization (ACO), which was developed by Colorni et al. [52], is a style simulating evolution algorithm. ACO is inspired by the foraging behavior of real ant colonies, and recently, it has been applied to a wide range of combinatorial problems, including the traveling salesman problem [53], routing in telecommunications networks [54], and feature selection [55].
We herein present a novel resampling algorithm named ACOR. Briefly, ACOR primarily includes two steps: first, it
Experiments
A comprehensive performance study was conducted to evaluate our proposed ACOR algorithm. We first present the experimental framework, including the benchmark datasets, classification algorithms, compared methods, and assessment metrics. The results and discussions are presented subsequently.
Conclusions
Numerous sampling-based preprocessing methods have been proposed to solve the problem of class imbalanced classification. The fundamental principle of these methods is to rebalance an imbalanced dataset by a concrete strategy. The main contribution of this study is the proposed ACOR algorithm, which is a general preprocessing framework for enhancing the performance of existing oversampling algorithms for the imbalanced problem. The main advantage of ACOR is that existing oversampling algorithms
CRediT authorship contribution statement
Min Li: Conceptualization, Methodology, Investigation, Software, Investigation, Writing - original draft. An Xiong: Data curation, Software, Visualization. Lei Wang: Funding acquisition, Validation, Formal analysis, Visualization. Shaobo Deng: Resources, Writing - review & editing, Supervision. Jun Ye: Resources, Writing - review & editing, Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
Research on this work was partially supported by the grants from the National Science Foundation of China (No. 61562061, 61363047), by the funds from Jiangxi Education Department (No. GJJ151126), and by the funds from Science and Technology Support Foundation of Jiangxi Province, PR China (No. 20161BBE50050, 20161BBE50051).
References (64)
- et al.
ROSEFW-RF: The winner algorithm for the ECBDL’14 Big Data Competition: An extremely imbalanced big data bioinformatics problem
Knowl.-Based Syst.
(2015) - et al.
Online feature selection for high-dimensional class-imbalanced data
Knowl.-Based Syst.
(2017) - et al.
Multiple extreme learning machines for a two-class imbalance corporate life cycle prediction
Knowl.-Based Syst.
(2013) - et al.
PBC4cip: A new contrast pattern-based classifier for class imbalance problems
Knowl.-Based Syst.
(2017) - et al.
A survey of multiple classifier systems as hybrid systems
Inf. Fusion
(2014) - et al.
ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data
Neurocomputing
(2013) - et al.
An experimental comparison of performance measures for classification
Pattern Recognit. Lett.
(2009) An introduction to ROC analysis
Pattern Recognit. Lett.
(2006)- et al.
An improved feature selection method based on ant colony optimization (ACO) evaluated on face recognition system
Appl. Math. Comput.
(2008) - et al.
Class dependent feature scaling method using Naive Bayes classifier for text datamining
Pattern Recognit. Lett.
(2009)
Editorial: special issue on learning from imbalanced data sets
SIGKDD Explor.
Learning from imbalanced data
IEEE Trans. Knowl. Data Eng.
Classification of imbalanced data: a review
Int. J. Pattern Recognit. Artif. Intell.
Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem
Nonlinear Anal. Real World Appl.
Combining data mining and machine learning for effective user profile
Gene selection for cancer classification using support vector machines
Mach. Learn.
Learning from imbalanced data: open challenges and future directions
Prog. Artif. Intell.
Automatically countering imbalance and its empirical relationship to cost
Data Min. Knowl. Discov.
Evolutionary undersampling for classification with imbalanced data sets: proposals and taxonomy
Evol. Comput.
MWMOTE–Majority weighted minority oversampling technique for imbalanced data set learning
IEEE Trans. Knowl. Data Eng.
Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods
Knowl.-Based Syst.
Entropy-based fuzzy support vector machine for imbalanced datasets
Knowl.-Based Syst.
On multi-class cost-sensitive learning
Comput. Intell.
RAMOBoost: Ranked minority oversampling in boosting
IEEE Trans. Neural Netw.
A study of the behavior of several methods for balancing machine learning training data
SIGKDD Explor.
Editorial: special issue on learning from imbalanced datasets
ACM SIGKDD Explor. Newsl.
Imbalanced Learning: Foundations, Algorithms, and Applications
Boosted classification trees and class probability/quantile estimation
J. Mach. Learn. Res.
SMOTE: Synthetic minority oversampling technique
J. Artificial Intelligence Res.
Automatically countering imbalance and its empirical relationship to cost
Data Min. Knowl. Discov.
Cited by (33)
FCM-CSMOTE: Fuzzy C-Means Center-SMOTE
2024, Expert Systems with ApplicationsWRND: A weighted oversampling framework with relative neighborhood density for imbalanced noisy classification
2024, Expert Systems with ApplicationsCost-sensitive continuous ensemble kernel learning for imbalanced data streams with concept drift
2024, Knowledge-Based SystemsSynthesizing credit data using autoencoders and generative adversarial networks
2023, Knowledge-Based SystemsOptimal Entropy Genetic Fuzzy-C-Means SMOTE (OEGFCM-SMOTE)
2023, Knowledge-Based SystemsCitation Excerpt :Then, it provides a parameter optimization method for Random Forest (RF) that uses a heuristic optimization algorithm to dynamically determine the best parameters for different unbalanced data sets. Ant Colony Optimization Resampling (ACOR) [57] is an oversampling method based on the newly introduced intelligent swarm optimization. This method has two main steps: first, it generates a set of synthetic samples using a classical method (e.g., SMOTE); second, it finds an optimal subset from the data set obtained using ant colony optimization.
Instance weighted SMOTE by indirectly exploring the data distribution
2022, Knowledge-Based SystemsCitation Excerpt :Undersampling algorithms are prone to losing significant classification information, resulting in deceased classification performance. Some previous studies have indicated that oversampling algorithms are generally more robust than the undersampling algorithms [45]. The simplest oversampling algorithm, random oversampling (ROS) algorithm, causes overfitting of the learning model, whereas the SMOTE algorithm can overcome this disadvantage.