RGAN-EL: A GAN and ensemble learning-based hybrid approach for imbalanced data classification

https://doi.org/10.1016/j.ipm.2022.103235Get rights and content

Abstract

Imbalanced sample distribution is usually the main reason for the performance degradation of machine learning algorithms. Based on this, this study proposes a hybrid framework (RGAN-EL) combining generative adversarial networks and ensemble learning method to improve the classification performance of imbalanced data. Firstly, we propose a training sample selection strategy based on roulette wheel selection method to make GAN pay more attention to the class overlapping area when fitting the sample distribution. Secondly, we design two kinds of generator training loss, and propose a noise sample filtering method to improve the quality of generated samples. Then, minority class samples are oversampled using the improved RGAN to obtain a balanced training sample set. Finally, combined with the ensemble learning strategy, the final training and prediction are carried out. We conducted experiments on 41 real imbalanced data sets using two evaluation indexes: F1-score and AUC. Specifically, we compare RGAN-EL with six typical ensemble learning; RGAN is compared with three typical GAN models. The experimental results show that RGAN-EL is significantly better than the other six ensemble learning methods, and RGAN is greatly improved compared with three classical GAN models.

Introduction

Machine learning is playing an increasingly important role in various decision-making fields, among which supervised learning is a widely used classification algorithm. Supervised classification problems can be divided into two processes: model training and prediction. In the training process, according to the known training data, a classifier is trained by an effective classification algorithm; in the prediction process, the trained classifier is used to predict the unknown data (Chen et al., 2021). For the classification problem, many algorithms have been proposed, such as support vector machine, multi-layer perceptron and decision tree algorithm. These algorithms are also widely used in intrusion detection, pattern recognition and fraud detection. However, the existing algorithms are designed based on class balance. These algorithms use balanced samples to train classifiers, that is, give each class of samples the same misclassification cost, and their optimization goal is to maximize the overall classification accuracy of all kinds of samples. In practical application, the data used is usually imbalanced. For example, in software defect detection, most samples are normal samples, while only a few samples are defect samples (Chakraborty & Chakraborty, 2020); in network intrusion detection, the normal traffic data is much larger than the attack traffic data (Ding et al., 2022); abnormal data in fraud detection is much lower than normal data (Li et al., 2021a). Under the condition of imbalanced data, the minority class contributes less to the overall error, so the final recognition effect of the algorithm tends to favor the majority class and ignore the minority class. When the number of minority class samples is very small, the classification algorithm may identify all the data as the majority class. Due to the extremely high proportion of the majority class, although the final classification accuracy can reach a high value, it is obviously meaningless. For example, in sample defect detection, intrusion detection and fraud detection, minority class samples have higher value and should be the focus of data mining (Wang et al., 2021b).

The problem of imbalanced data classification is more difficult and complex than that of balanced data classification (Huang et al., 2022, Wang et al., 2021a). In the problem of imbalanced classification, we should not only pay attention to the overall classification performance, but also pay more attention to the classification performance of the minority classes. How to improve the recognition rate of minority class samples while improving the overall recognition rate has become a very challenging problem. In this study, we call the class with a large number of samples as negative class, the class with a small number as positive class, and the ratio of negative class to positive class as imbalance rate (as shown in Eq. (1), nneg and npos represent negative class and positive class respectively) (Mirzaei et al., 2021). In order to solve the problem of imbalanced data classification, many scholars have done a lot of research and proposed relevant algorithms. These methods can be summarized into three types: data level, algorithm level and ensemble learning (Jedrzejowicz & Jedrzejowicz, 2021). IR=nnegnpos

Data-level strategy is usually called data preprocessing method, which makes the data set reach the balance state through data resampling (oversampling or undersampling) (Tsai et al., 2019). The balanced samples obtained by resampling can be used in any classifier. Different from the data-level preprocessing method, the algorithm-level strategy is usually to design new algorithms or improve existing algorithms (such as cost-sensitive algorithms) to solve the problem of imbalanced data classification (Wen et al., 2021). The traditional ensemble learning strategy is to design the integration of different basic classifiers to jointly decide the final classification, so as to solve the problem that a single classifier is vulnerable to imbalanced data (Kim et al., 2021). Generally, the classification effect of ensemble learning is better than that of a single classifier, so it is more widely used in the actual classification of imbalanced data.

Because the data-level and ensemble learning strategies are not limited by the algorithm itself, the combination of data-level methods and ensemble learning methods has a more effective application prospect. Data resampling technology can make up for the shortage of training samples of a single classifier in ensemble learning, so as to solve the problem of overfitting in the training process. In recent years, generative adversarial networks have been widely used in the fields of image, voice and text (Kaliyev et al., 2021, Yu et al., 2017). Based on its good data fitting ability, the application of GAN to data expansion and enhancement has also been widely studied (Li et al., 2021b, Zhou et al., 2019). At present, most of the research directly use GAN to generate data to make the data set balanced, and then used for classifier training. However, in the research of imbalanced learning problems, class imbalance is not the only factor leading to model learning difficulties. The division of decision boundaries is usually closely related to the density of positive and negative classes in the class overlapping area (Yuan et al., 2021). When the number of minority class samples in the class overlapping area is extremely sparse, due to insufficient training, in order to obtain the optimal result of the overall classification, the classifier tends to bias the decision boundary towards the minority class. Therefore, the training difficulty on imbalanced data increases sharply when there is a class overlapping area (Vuttipittayamongkol et al., 2021). Based on this, in this study, combined with the improved GAN and ensemble learning, a hybrid method is designed to solve the problem of imbalanced data classification. Our main contributions are as follows:

  • Firstly, we propose a training data selection strategy based on roulette wheel selection method. The positive and negative class data in inter-class overlapping areas plays an important role in the division of the final classification boundary. Therefore, a roulette wheel selection method is proposed to improve GAN, so that it can more effectively fit the sample distribution of inter-class overlapping areas.

  • Secondly, we propose an improved GAN model (RGAN), which can effectively improve the quality of generated data. In the improved GAN, we design similarity loss and difference loss for the generator. Similarity loss is used to measure the feature distance between the generated sample and the original sample, so that the generated positive sample is closer to the distribution of the original sample. The difference loss introduces the information of negative sample distribution, so that the generated positive sample distribution can be pushed away from the negative sample distribution, so as to avoid increasing the classification difficulty of inter-class overlapping areas.

  • Thirdly, based on the trained RGAN, we propose a noise sample filtering strategy, which can effectively prevent the generation of noise samples.

  • Fourthly, based on the improved RGAN model, we propose an imbalanced data classification strategy combining data oversampling and ensemble learning.

  • Fifthly, we carried out relevant experimental verification on 41 real-world imbalanced data sets. The experimental results show that RGAN-EL has better classification effect than typical GAN methods and ensemble learning methods.

The rest of the paper is organized as follows. Section 2 summarizes the existing classification methods of imbalanced data. Section 3 describes the methods proposed in this study in detail. In Section 4, detailed experimental verification is carried out, and the results are discussed and analyzed. Finally, in Section 5, we give the conclusion of this study and describe the next research plan.

Section snippets

Related works

In this section, we mainly review and summarize the three methods commonly used to solve imbalanced data classification, namely data-level, algorithm-level and ensemble learning.

Methodology

This section introduces the detailed algorithm of RGAN-EL.2 In order to show the algorithm flow more intuitively, the structure diagram shown in Fig. 1 is given. As shown in the figure, during the experiment, we first divide the data set into training set and test set. The training set is used to construct a balanced training sample set to obtain better training effect. The test set maintains the original sample distribution for the detection of the

Experiment

In this section, we designed three experiments to evaluate the effectiveness of the proposed method. The first experiment is to explore whether RGAN-EL can provide more competitive results compared with typical and advanced ensemble learning methods. The second experiment is to explore whether the performance of the improved GAN model is significantly improved compared with the original GAN model. The last experiment is to further analyze the RGAN-EL model. On the one hand, RGAN method is

Conclusions and future work

In the field of machine learning and data mining, traditional classification algorithms are designed based on balanced data. However, in many practical applications, the data to be processed are imbalanced. When the traditional classification algorithm is used to solve the problem of imbalanced data classification, the generalization performance of the classification algorithm will be significantly reduced. Although data sampling, cost-sensitive learning and ensemble learning have been proposed

CRediT authorship contribution statement

Hongwei Ding: Conceptualization, Methodology, Validation, Investigation, Data curation, Writing – original draft, Writing – review & editing. Yu Sun: Conceptualization, Methodology, Validation, Investigation, Data curation, Writing – original draft, Writing – review & editing. Zhenyu Wang: Validation, Investigation, Writing – review & editing. Nana Huang: Validation, Investigation, Writing – review & editing. Zhidong Shen: Investigation, Funding acquisition. Xiaohui Cui: Conceptualization,

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the National Key R&D Program of China under Grant (No. 2018YFC1604000), the Key R&D projects in Hubei Province, China under Grant (No. 2022BAA041, No. 2021BCA124), in part by Wuhan University Specific Fund for Major School-level International Initiatives, China .

References (56)

  • LiZ. et al.

    A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection

    Expert Systems with Applications

    (2021)
  • LiW. et al.

    JDGAN: Enhancing generator on extremely limited data via joint distribution

    Neurocomputing

    (2021)
  • MaldonadoS. et al.

    FW-SMOTE: A feature-weighted oversampling approach for imbalanced classification

    Pattern Recognition

    (2022)
  • MengD. et al.

    An imbalanced learning method by combining SMOTE with center offset factor

    Applied Soft Computing

    (2022)
  • MirzaeiB. et al.

    CDBH: A clustering and density-based hybrid approach for imbalanced data classification

    Expert Systems with Applications

    (2021)
  • TaoX. et al.

    SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning

    Information Sciences

    (2022)
  • TsaiC.-F. et al.

    Under-sampling class imbalanced datasets by combining clustering analysis and instance selection

    Information Sciences

    (2019)
  • VuttipittayamongkolP. et al.

    Neighbourhood-based undersampling approach for handling imbalanced and overlapped data

    Information Sciences

    (2020)
  • VuttipittayamongkolP. et al.

    On the class overlap problem in imbalanced data classification

    Knowledge-Based Systems

    (2021)
  • WangC. et al.

    Adaptive ensemble of classifiers with regularization for imbalanced data classification

    Information Fusion

    (2021)
  • WangX. et al.

    Local distribution-based adaptive minority oversampling for imbalanced data classification

    Neurocomputing

    (2021)
  • WeiJ. et al.

    NI-MWMOTE: An improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems

    Expert Systems with Applications

    (2020)
  • WenG. et al.

    One-step spectral rotation clustering for imbalanced high-dimensional data

    Information Processing & Management

    (2021)
  • XieX. et al.

    A novel progressively undersampling method based on the density peaks sequence for imbalanced data

    Knowledge-Based Systems

    (2021)
  • ZhangS.

    Cost-sensitive KNN classification

    Neurocomputing

    (2020)
  • ZhangX. et al.

    KRNN: k rare-class nearest neighbour classification

    Pattern Recognition

    (2017)
  • ZhuY. et al.

    EHSO: Evolutionary hybrid sampling in overlapping scenarios for imbalanced learning

    Neurocomputing

    (2020)
  • AbdiL. et al.

    To combat multi-class imbalanced problems by means of over-sampling techniques

    IEEE Transactions on Knowledge and Data Engineering

    (2015)
  • Cited by (24)

    View all citing articles on Scopus
    1

    Hongwei Ding and Yu Sun contributed equally to this work.

    View full text