RGAN-EL: A GAN and ensemble learning-based hybrid approach for imbalanced data classification
Introduction
Machine learning is playing an increasingly important role in various decision-making fields, among which supervised learning is a widely used classification algorithm. Supervised classification problems can be divided into two processes: model training and prediction. In the training process, according to the known training data, a classifier is trained by an effective classification algorithm; in the prediction process, the trained classifier is used to predict the unknown data (Chen et al., 2021). For the classification problem, many algorithms have been proposed, such as support vector machine, multi-layer perceptron and decision tree algorithm. These algorithms are also widely used in intrusion detection, pattern recognition and fraud detection. However, the existing algorithms are designed based on class balance. These algorithms use balanced samples to train classifiers, that is, give each class of samples the same misclassification cost, and their optimization goal is to maximize the overall classification accuracy of all kinds of samples. In practical application, the data used is usually imbalanced. For example, in software defect detection, most samples are normal samples, while only a few samples are defect samples (Chakraborty & Chakraborty, 2020); in network intrusion detection, the normal traffic data is much larger than the attack traffic data (Ding et al., 2022); abnormal data in fraud detection is much lower than normal data (Li et al., 2021a). Under the condition of imbalanced data, the minority class contributes less to the overall error, so the final recognition effect of the algorithm tends to favor the majority class and ignore the minority class. When the number of minority class samples is very small, the classification algorithm may identify all the data as the majority class. Due to the extremely high proportion of the majority class, although the final classification accuracy can reach a high value, it is obviously meaningless. For example, in sample defect detection, intrusion detection and fraud detection, minority class samples have higher value and should be the focus of data mining (Wang et al., 2021b).
The problem of imbalanced data classification is more difficult and complex than that of balanced data classification (Huang et al., 2022, Wang et al., 2021a). In the problem of imbalanced classification, we should not only pay attention to the overall classification performance, but also pay more attention to the classification performance of the minority classes. How to improve the recognition rate of minority class samples while improving the overall recognition rate has become a very challenging problem. In this study, we call the class with a large number of samples as negative class, the class with a small number as positive class, and the ratio of negative class to positive class as imbalance rate (as shown in Eq. (1), and represent negative class and positive class respectively) (Mirzaei et al., 2021). In order to solve the problem of imbalanced data classification, many scholars have done a lot of research and proposed relevant algorithms. These methods can be summarized into three types: data level, algorithm level and ensemble learning (Jedrzejowicz & Jedrzejowicz, 2021).
Data-level strategy is usually called data preprocessing method, which makes the data set reach the balance state through data resampling (oversampling or undersampling) (Tsai et al., 2019). The balanced samples obtained by resampling can be used in any classifier. Different from the data-level preprocessing method, the algorithm-level strategy is usually to design new algorithms or improve existing algorithms (such as cost-sensitive algorithms) to solve the problem of imbalanced data classification (Wen et al., 2021). The traditional ensemble learning strategy is to design the integration of different basic classifiers to jointly decide the final classification, so as to solve the problem that a single classifier is vulnerable to imbalanced data (Kim et al., 2021). Generally, the classification effect of ensemble learning is better than that of a single classifier, so it is more widely used in the actual classification of imbalanced data.
Because the data-level and ensemble learning strategies are not limited by the algorithm itself, the combination of data-level methods and ensemble learning methods has a more effective application prospect. Data resampling technology can make up for the shortage of training samples of a single classifier in ensemble learning, so as to solve the problem of overfitting in the training process. In recent years, generative adversarial networks have been widely used in the fields of image, voice and text (Kaliyev et al., 2021, Yu et al., 2017). Based on its good data fitting ability, the application of GAN to data expansion and enhancement has also been widely studied (Li et al., 2021b, Zhou et al., 2019). At present, most of the research directly use GAN to generate data to make the data set balanced, and then used for classifier training. However, in the research of imbalanced learning problems, class imbalance is not the only factor leading to model learning difficulties. The division of decision boundaries is usually closely related to the density of positive and negative classes in the class overlapping area (Yuan et al., 2021). When the number of minority class samples in the class overlapping area is extremely sparse, due to insufficient training, in order to obtain the optimal result of the overall classification, the classifier tends to bias the decision boundary towards the minority class. Therefore, the training difficulty on imbalanced data increases sharply when there is a class overlapping area (Vuttipittayamongkol et al., 2021). Based on this, in this study, combined with the improved GAN and ensemble learning, a hybrid method is designed to solve the problem of imbalanced data classification. Our main contributions are as follows:
- •
Firstly, we propose a training data selection strategy based on roulette wheel selection method. The positive and negative class data in inter-class overlapping areas plays an important role in the division of the final classification boundary. Therefore, a roulette wheel selection method is proposed to improve GAN, so that it can more effectively fit the sample distribution of inter-class overlapping areas.
- •
Secondly, we propose an improved GAN model (RGAN), which can effectively improve the quality of generated data. In the improved GAN, we design similarity loss and difference loss for the generator. Similarity loss is used to measure the feature distance between the generated sample and the original sample, so that the generated positive sample is closer to the distribution of the original sample. The difference loss introduces the information of negative sample distribution, so that the generated positive sample distribution can be pushed away from the negative sample distribution, so as to avoid increasing the classification difficulty of inter-class overlapping areas.
- •
Thirdly, based on the trained RGAN, we propose a noise sample filtering strategy, which can effectively prevent the generation of noise samples.
- •
Fourthly, based on the improved RGAN model, we propose an imbalanced data classification strategy combining data oversampling and ensemble learning.
- •
Fifthly, we carried out relevant experimental verification on 41 real-world imbalanced data sets. The experimental results show that RGAN-EL has better classification effect than typical GAN methods and ensemble learning methods.
The rest of the paper is organized as follows. Section 2 summarizes the existing classification methods of imbalanced data. Section 3 describes the methods proposed in this study in detail. In Section 4, detailed experimental verification is carried out, and the results are discussed and analyzed. Finally, in Section 5, we give the conclusion of this study and describe the next research plan.
Section snippets
Related works
In this section, we mainly review and summarize the three methods commonly used to solve imbalanced data classification, namely data-level, algorithm-level and ensemble learning.
Methodology
This section introduces the detailed algorithm of RGAN-EL.2 In order to show the algorithm flow more intuitively, the structure diagram shown in Fig. 1 is given. As shown in the figure, during the experiment, we first divide the data set into training set and test set. The training set is used to construct a balanced training sample set to obtain better training effect. The test set maintains the original sample distribution for the detection of the
Experiment
In this section, we designed three experiments to evaluate the effectiveness of the proposed method. The first experiment is to explore whether RGAN-EL can provide more competitive results compared with typical and advanced ensemble learning methods. The second experiment is to explore whether the performance of the improved GAN model is significantly improved compared with the original GAN model. The last experiment is to further analyze the RGAN-EL model. On the one hand, RGAN method is
Conclusions and future work
In the field of machine learning and data mining, traditional classification algorithms are designed based on balanced data. However, in many practical applications, the data to be processed are imbalanced. When the traditional classification algorithm is used to solve the problem of imbalanced data classification, the generalization performance of the classification algorithm will be significantly reduced. Although data sampling, cost-sensitive learning and ensemble learning have been proposed
CRediT authorship contribution statement
Hongwei Ding: Conceptualization, Methodology, Validation, Investigation, Data curation, Writing – original draft, Writing – review & editing. Yu Sun: Conceptualization, Methodology, Validation, Investigation, Data curation, Writing – original draft, Writing – review & editing. Zhenyu Wang: Validation, Investigation, Writing – review & editing. Nana Huang: Validation, Investigation, Writing – review & editing. Zhidong Shen: Investigation, Funding acquisition. Xiaohui Cui: Conceptualization,
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported in part by the National Key R&D Program of China under Grant (No. 2018YFC1604000), the Key R&D projects in Hubei Province, China under Grant (No. 2022BAA041, No. 2021BCA124), in part by Wuhan University Specific Fund for Major School-level International Initiatives, China .
References (56)
- et al.
A hybrid data-level ensemble to enable learning from highly imbalanced dataset
Information Sciences
(2021) - et al.
Imbalanced data classification: A KNN and generative adversarial networks-based hybrid approach for intrusion detection
Future Generation Computer Systems
(2022) - et al.
Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning
Expert Systems with Applications
(2021) - et al.
Data augmentation of credit default swap transactions based on a sequence GAN
Information Processing & Management
(2022) Stochastic gradient boosting
Computational Statistics & Data Analysis
(2002)- et al.
Cost sensitive -support vector machine with LINEX loss
Information Processing & Management
(2022) - et al.
Relevant information undersampling to support imbalanced data classification
Neurocomputing
(2021) - et al.
GEP-based classifier for mining imbalanced data
Expert Systems with Applications
(2021) - et al.
Ensemble learning-based filter-centric hybrid feature selection framework for high-dimensional imbalanced data
Knowledge-Based Systems
(2021) - et al.
Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data
Neural Networks
(2020)
A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection
Expert Systems with Applications
JDGAN: Enhancing generator on extremely limited data via joint distribution
Neurocomputing
FW-SMOTE: A feature-weighted oversampling approach for imbalanced classification
Pattern Recognition
An imbalanced learning method by combining SMOTE with center offset factor
Applied Soft Computing
CDBH: A clustering and density-based hybrid approach for imbalanced data classification
Expert Systems with Applications
SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning
Information Sciences
Under-sampling class imbalanced datasets by combining clustering analysis and instance selection
Information Sciences
Neighbourhood-based undersampling approach for handling imbalanced and overlapped data
Information Sciences
On the class overlap problem in imbalanced data classification
Knowledge-Based Systems
Adaptive ensemble of classifiers with regularization for imbalanced data classification
Information Fusion
Local distribution-based adaptive minority oversampling for imbalanced data classification
Neurocomputing
NI-MWMOTE: An improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems
Expert Systems with Applications
One-step spectral rotation clustering for imbalanced high-dimensional data
Information Processing & Management
A novel progressively undersampling method based on the density peaks sequence for imbalanced data
Knowledge-Based Systems
Cost-sensitive KNN classification
Neurocomputing
KRNN: k rare-class nearest neighbour classification
Pattern Recognition
EHSO: Evolutionary hybrid sampling in overlapping scenarios for imbalanced learning
Neurocomputing
To combat multi-class imbalanced problems by means of over-sampling techniques
IEEE Transactions on Knowledge and Data Engineering
Cited by (24)
Rapid detection method for insulation performance of vacuum glass based on ensemble learning
2024, Engineering Applications of Artificial IntelligenceGenerative Adversarial Networks for text-to-face synthesis & generation: A quantitative–qualitative analysis of Natural Language Processing encoders for Spanish
2024, Information Processing and ManagementSurface defect detection methods for industrial products with imbalanced samples: A review of progress in the 2020s
2024, Engineering Applications of Artificial IntelligenceAWGAN: An adaptive weighting GAN approach for oversampling imbalanced datasets
2024, Information SciencesSupervised contrastive representation learning with tree-structured parzen estimator Bayesian optimization for imbalanced tabular data
2024, Expert Systems with Applications
- 1
Hongwei Ding and Yu Sun contributed equally to this work.