Elsevier

Neural Networks

Volume 133, January 2021, Pages 69-86
Neural Networks

2020 Special Issue
CEGAN: Classification Enhancement Generative Adversarial Networks for unraveling data imbalance problems

https://doi.org/10.1016/j.neunet.2020.10.004Get rights and content

Highlights

  • A classification enhancement generative adversarial networks (CEGAN) is introduced to improve the classification under the imbalanced data condition.

  • The proposed method is composed of three independent networks, a generator, a discriminator, and a classifier.

  • By designing a loss function for ambiguous classes, we propose a classification enhancement GAN for ambiguity reduction (CEGAN-AR).

  • The proposed method outperforms various standard data augmentation methods under data imbalanced conditions.

Abstract

The data imbalance problem in classification is a frequent but challenging task. In real-world datasets, numerous class distributions are imbalanced and the classification result under such condition reveals extreme bias in the majority data class. Recently, the potential of GAN as a data augmentation method on minority data has been studied. In this paper, we propose a classification enhancement generative adversarial networks (CEGAN) to enhance the quality of generated synthetic minority data and more importantly, to improve the prediction accuracy in data imbalanced condition. In addition, we propose an ambiguity reduction method using the generated synthetic minority data for the case of multiple similar classes that are degenerating the classification accuracy. The proposed method is demonstrated with five benchmark datasets. The results indicate that approximating the real data distribution using CEGAN improves the classification performance significantly in data imbalanced conditions compared with various standard data augmentation methods.

Introduction

The data imbalance problem in classification is a frequent but challenging task. As the number of data points for certain classes exhibits significant imbalances, the accuracy of the classifiers can deteriorate (Buda et al., 2018, Japkowicz and Stephen, 2002, Xie and Qiu, 2007). The data imbalance problem is easily unveiled in numerous fields including computer vision (Beijbom et al., 2012, Johnson et al., 2013, Kubat et al., 1998, Van Horn et al., 2017, Xiao et al., 2010), medical diagnosis (Grzymala-Busse et al., 2004, Mac Namee et al., 2002, Wang et al., 2019), fault detection (Lee et al., 2017, Liu and Li, 2010, Suh et al., 2019), and others (Graves et al., 2016, Haixiang et al., 2017, Zhao et al., 2008). The performance of machine learning algorithms, such as convolutional neural networks (CNN), deteriorates under imbalanced data conditions as the classification results show a bias in the majority data class.

Among the main approaches to deal with the data imbalance problem, the rebalancing of the class distribution at the data level, such as oversampling (Lee et al., 2017, Ramentol et al., 2012, Suh et al., 2019), undersampling (Ng, Hu, Yeung, Yin, & Roli, 2014), and ensemble learning (Lu, Chen, Wu, & Chan, 2015), is a general solution without the dependency of the classifier. Between them, the oversampling technique, which results in the generation of artificial data for the minority class, has proven to be the most effective way to handle the class imbalance for the CNN model on image classification (Buda et al., 2018).

A traditional oversampling method in the computer vision domain to augment the training dataset and reduce overfitting consists of geometric transformations such as rotation, image cropping, flipping, and color conversions (Krizhevsky et al., 2012, Wong et al., 2016). However, the images generated with these methods are merely simple and redundant copies of the original data in many cases. Additionally, the geometric transformations do not improve the data distribution determined by the high-level features as they only lead to an image-level transformation through depth and scale (Wong et al., 2016). Thus, the oversampling technique is needed to estimate the data distribution and generate data, not just to augment the training set.

The standard oversampling algorithm is the synthetic minority oversampling technique (SMOTE) (Chawla, Bowyer, Hall, & Kegelmeyer, 2002), which generates new data samples for minority class based on the similarities between the original minority class samples in the feature space. It is one of the most widely used algorithms to generate synthetic minority class samples to improve the performance of the classification (Jeatrakul, Wong, & Fung, 2010). However, it can produce noisy results when the appearances of the majority and minority classes are ambiguous (He & Garcia, 2008). For this reason, extensions including borderline SMOTE (Han, Wang, & Mao, 2005) and ADASYN (He, Bai, Garcia, & Li, 2008) have emerged, attempting to increase classification accuracy by sharpening the boundary between the two classes. However, these oversampling techniques can fail to generate new samples that are similar to the original data at first glance but different in detail, especially when extracting features in a regularized way from the imbalanced dataset is difficult. When these oversampling techniques target high-dimensional imbalanced data, such as images and audio, they cannot reduce the classification bias towards the majority class (Lusa et al., 2012). Moreover, the Euclidean distance used in SMOTE is not a suitable metric to measure the similarity between samples in high dimensional spaces (Holzinger, 2016). In certain cases, the Euclidean distance between the target sample and the nearest neighbor is larger than the distance between the sample and the furthermost neighbor.

Recently, generative adversarial networks (GANs) (Goodfellow et al., 2014, Schmidhuber, 2020) have emerged as a class of generative models approximating the real data distribution. Conditional GANs (CGAN) (Mirza & Osindero, 2014) and auxiliary classifier GANs (ACGAN) (Odena, Olah, & Shlens, 2017) also extend GANs by conditioning the training procedure on the class labels for the classifier. Douzas and Bacao (2018) demonstrate that the data generated by CGAN improves the performance of the classification.

However, GAN and conditional GANs, such as CGAN and ACGAN, have limitations. The instability of the training process still remains these GANs as a challenge in practice. The other problem is the influence of noise. As the generator of the GANs takes as input a random noise vector and outputs a synthetic image, we need the classifier for the conditional GANs to consider the influence of noise. Also, conditional GANs performs well in the hypothesis that the class boundaries are clear. In a real-world dataset, this hypothesis does not hold because the boundaries between classes are often unclear and ambiguous.

Taking these problems into consideration, we propose a classification enhancement generative adversarial networks (CEGAN), which can generate synthetic minority data to enhance the classification under the data imbalanced conditions. We deploy the objective formulation of Wasserstein generative adversarial network with gradient penalty (WGAN-GP) (Gulrajani, Ahmed, Arjovsky, Dumoulin, & Courville, 2017), which improves the stability and performance of the classification under imbalanced data conditions (Gao et al., 2019, Suh et al., 2019). CEGAN is composed of three independent networks, a generator, a discriminator and a classifier, to generate artificial data that can improve the performance of the classifier. The classifier in CEGAN is employed from the classifier used for the imbalanced data classification, with modifications for reducing the impact of noise.

Secondly, we propose a conditional generative model that considers the relationships between ambiguous classes. As the augmented data under unclear boundary conditions can lead to a deterioration in performance, we formulate a new loss function with multiple subsets of ambiguous class labels. The ambiguous class subsets are extracted from the classification results and we propose a new objective function to reduce ambiguity, which is a summation form of loss functions for the classifier in CEGAN.

The main contributions of this paper are summarized as follows: (1) A novel GAN structure containing a classifier, that induces the generated data, has more features for classification. (2) The classifier has the functionality of reducing the impact of noise input and the ambiguity between classes. (3) Comprehensive experimental results on various benchmark datasets demonstrate that the proposed method achieves promising performance in terms of quantity and quality of data augmentation compared with many classical and state-of-the-art algorithms.

The sections in the paper are organized as follows: Section 2 introduces an overview of related work on GANs. Section 3 presents the generative model in detail. In Section 4, we describe the network architectures of five benchmark datasets from MNIST (LeCun, Cortes, & Burges, 2010), extended MNIST (Cohen, Afshar, Tapson, & van Schaik, 2017), fashion-MNIST, CIFAR-10 (Krizhevsky & Hinton, 2009), and CINIC-10 (Darlow, Crowley, Antoniou, & Storkey, 2018) and present the experimental results compared with ACGAN, SMOTE, support vector machine with SMOTE (SVM–SMOTE) (Nguyen, Cooper, & Kamei, 2011), majority weighted minority oversampling technique (MWMOTE) (Barua, Islam, Yao, & Murase, 2012), VAE-GANs, BAGAN, and WGAN-GP. Section 5 concludes the paper and discusses future research.

Section snippets

Related work

In this section, we provide a brief summary of the GAN, ACGAN, and WGAN-GP.

GANs (Goodfellow et al., 2014, Schmidhuber, 2020) represent a class of generative models based on a game theory scenario in which a generator network G competes against an adversary, D, a discriminator. GANs aim to approximate the probability distribution function that certain data is assumed to be drawn from. The objective function of the min–max game between the generator and the discriminator is expressed as follows:

CEGAN: Classification Enhancement Generative Adversarial Networks

It is necessary to ensure the performance of the classifier in the training procedure of ACGAN because the discriminator and the generator are trained by using the classification loss. However, because the auxiliary classifier in ACGAN shares network structure and weight parameters with the discriminator, the performance of the auxiliary classifier in ACGAN cannot lead to generating high-quality images. In order to generate data for minority classes to improve the performance of the classifier

Datasets and models

For the evaluation of the proposed method, we used five benchmark datasets: MNIST (LeCun et al., 2010), extended MNIST digits (Cohen et al., 2017), fashion-MNIST (Xiao, Rasul, & Vollgraf, 2017), CIFAR-10 (Krizhevsky & Hinton, 2009), and CINIC-10 (Darlow et al., 2018). All benchmark datasets provide labeled training and a test set. Example images of the benchmark datasets are displayed in Fig. 3. The LeNet5 (LeCun et al., 1998) classifier model is used for MNIST, EMNIST, and fashion-MNIST,

Conclusions

In this paper, we proposed a data augmentation method, named CEGAN, composed of three independent networks and employed the objective formulation of WGAN-GP, for classification under imbalanced data conditions. As the generator of GANs takes a random noise vector for input and outputs a synthetic image, we proposed a modified classifier architecture for generated images considering the effect of noise input. We also proposed a conditional generative model with class subsets, which were

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research was supported by Korea Institute of Science and Technology Europe Institutional Program (Project No. 12020).

References (60)

  • BaruaS. et al.

    MWMOTE–Majority weighted minority oversampling technique for imbalanced data set learning

    IEEE Transactions on Knowledge and Data Engineering

    (2012)
  • BeijbomO. et al.

    Automated annotation of coral reef survey images

  • ChawlaN.V. et al.

    SMOTE: synthetic minority over-sampling technique

    Journal of Artificial Intelligence Research

    (2002)
  • CohenG. et al.

    EMNIST: an extension of MNIST to handwritten letters

    (2017)
  • DarlowL.N. et al.

    CINIC-10 is not imagenet or CIFAR-10

    (2018)
  • GaoX. et al.

    Data augmentation in fault diagnosis based on the wasserstein generative adversarial network with gradient penalty

    Neurocomputing

    (2019)
  • Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In...
  • GoodfellowI. et al.

    Generative adversarial nets

  • GoodfellowI.J. et al.

    Maxout networks

    (2013)
  • GravesS. et al.

    Tree species abundance predictions in a tropical agricultural landscape with a supervised classification model and imbalanced data

    Remote Sensing

    (2016)
  • Grzymala-BusseJ.W. et al.

    An approach to imbalanced data sets based on changing rule strength

  • GulrajaniI. et al.

    Improved training of wasserstein gans

  • HanH. et al.

    Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

  • HeH. et al.

    ADASYN: Adaptive synthetic sampling approach for imbalanced learning

  • HeH. et al.

    Learning from imbalanced data

    IEEE Transactions on Knowledge & Data Engineering

    (2008)
  • HolzingerA.

    Machine learning for health informatics: state-of-the-art and future challenges (vol. 9605)

    (2016)
  • JapkowiczN. et al.

    The class imbalance problem: A systematic study

    Intelligent Data Analysis

    (2002)
  • JeatrakulP. et al.

    Classification of imbalanced data by combining the complementary neural network and SMOTE algorithm

  • JohnsonB.A. et al.

    A hybrid pansharpening approach and multiscale object-based image analysis for mapping diseased pine and oak trees

    International Journal of Remote Sensing

    (2013)
  • KingmaD.P. et al.

    Adam: A method for stochastic optimization

    (2014)
  • Cited by (77)

    View all citing articles on Scopus
    View full text