2020 Special IssueCEGAN: Classification Enhancement Generative Adversarial Networks for unraveling data imbalance problems
Introduction
The data imbalance problem in classification is a frequent but challenging task. As the number of data points for certain classes exhibits significant imbalances, the accuracy of the classifiers can deteriorate (Buda et al., 2018, Japkowicz and Stephen, 2002, Xie and Qiu, 2007). The data imbalance problem is easily unveiled in numerous fields including computer vision (Beijbom et al., 2012, Johnson et al., 2013, Kubat et al., 1998, Van Horn et al., 2017, Xiao et al., 2010), medical diagnosis (Grzymala-Busse et al., 2004, Mac Namee et al., 2002, Wang et al., 2019), fault detection (Lee et al., 2017, Liu and Li, 2010, Suh et al., 2019), and others (Graves et al., 2016, Haixiang et al., 2017, Zhao et al., 2008). The performance of machine learning algorithms, such as convolutional neural networks (CNN), deteriorates under imbalanced data conditions as the classification results show a bias in the majority data class.
Among the main approaches to deal with the data imbalance problem, the rebalancing of the class distribution at the data level, such as oversampling (Lee et al., 2017, Ramentol et al., 2012, Suh et al., 2019), undersampling (Ng, Hu, Yeung, Yin, & Roli, 2014), and ensemble learning (Lu, Chen, Wu, & Chan, 2015), is a general solution without the dependency of the classifier. Between them, the oversampling technique, which results in the generation of artificial data for the minority class, has proven to be the most effective way to handle the class imbalance for the CNN model on image classification (Buda et al., 2018).
A traditional oversampling method in the computer vision domain to augment the training dataset and reduce overfitting consists of geometric transformations such as rotation, image cropping, flipping, and color conversions (Krizhevsky et al., 2012, Wong et al., 2016). However, the images generated with these methods are merely simple and redundant copies of the original data in many cases. Additionally, the geometric transformations do not improve the data distribution determined by the high-level features as they only lead to an image-level transformation through depth and scale (Wong et al., 2016). Thus, the oversampling technique is needed to estimate the data distribution and generate data, not just to augment the training set.
The standard oversampling algorithm is the synthetic minority oversampling technique (SMOTE) (Chawla, Bowyer, Hall, & Kegelmeyer, 2002), which generates new data samples for minority class based on the similarities between the original minority class samples in the feature space. It is one of the most widely used algorithms to generate synthetic minority class samples to improve the performance of the classification (Jeatrakul, Wong, & Fung, 2010). However, it can produce noisy results when the appearances of the majority and minority classes are ambiguous (He & Garcia, 2008). For this reason, extensions including borderline SMOTE (Han, Wang, & Mao, 2005) and ADASYN (He, Bai, Garcia, & Li, 2008) have emerged, attempting to increase classification accuracy by sharpening the boundary between the two classes. However, these oversampling techniques can fail to generate new samples that are similar to the original data at first glance but different in detail, especially when extracting features in a regularized way from the imbalanced dataset is difficult. When these oversampling techniques target high-dimensional imbalanced data, such as images and audio, they cannot reduce the classification bias towards the majority class (Lusa et al., 2012). Moreover, the Euclidean distance used in SMOTE is not a suitable metric to measure the similarity between samples in high dimensional spaces (Holzinger, 2016). In certain cases, the Euclidean distance between the target sample and the nearest neighbor is larger than the distance between the sample and the furthermost neighbor.
Recently, generative adversarial networks (GANs) (Goodfellow et al., 2014, Schmidhuber, 2020) have emerged as a class of generative models approximating the real data distribution. Conditional GANs (CGAN) (Mirza & Osindero, 2014) and auxiliary classifier GANs (ACGAN) (Odena, Olah, & Shlens, 2017) also extend GANs by conditioning the training procedure on the class labels for the classifier. Douzas and Bacao (2018) demonstrate that the data generated by CGAN improves the performance of the classification.
However, GAN and conditional GANs, such as CGAN and ACGAN, have limitations. The instability of the training process still remains these GANs as a challenge in practice. The other problem is the influence of noise. As the generator of the GANs takes as input a random noise vector and outputs a synthetic image, we need the classifier for the conditional GANs to consider the influence of noise. Also, conditional GANs performs well in the hypothesis that the class boundaries are clear. In a real-world dataset, this hypothesis does not hold because the boundaries between classes are often unclear and ambiguous.
Taking these problems into consideration, we propose a classification enhancement generative adversarial networks (CEGAN), which can generate synthetic minority data to enhance the classification under the data imbalanced conditions. We deploy the objective formulation of Wasserstein generative adversarial network with gradient penalty (WGAN-GP) (Gulrajani, Ahmed, Arjovsky, Dumoulin, & Courville, 2017), which improves the stability and performance of the classification under imbalanced data conditions (Gao et al., 2019, Suh et al., 2019). CEGAN is composed of three independent networks, a generator, a discriminator and a classifier, to generate artificial data that can improve the performance of the classifier. The classifier in CEGAN is employed from the classifier used for the imbalanced data classification, with modifications for reducing the impact of noise.
Secondly, we propose a conditional generative model that considers the relationships between ambiguous classes. As the augmented data under unclear boundary conditions can lead to a deterioration in performance, we formulate a new loss function with multiple subsets of ambiguous class labels. The ambiguous class subsets are extracted from the classification results and we propose a new objective function to reduce ambiguity, which is a summation form of loss functions for the classifier in CEGAN.
The main contributions of this paper are summarized as follows: (1) A novel GAN structure containing a classifier, that induces the generated data, has more features for classification. (2) The classifier has the functionality of reducing the impact of noise input and the ambiguity between classes. (3) Comprehensive experimental results on various benchmark datasets demonstrate that the proposed method achieves promising performance in terms of quantity and quality of data augmentation compared with many classical and state-of-the-art algorithms.
The sections in the paper are organized as follows: Section 2 introduces an overview of related work on GANs. Section 3 presents the generative model in detail. In Section 4, we describe the network architectures of five benchmark datasets from MNIST (LeCun, Cortes, & Burges, 2010), extended MNIST (Cohen, Afshar, Tapson, & van Schaik, 2017), fashion-MNIST, CIFAR-10 (Krizhevsky & Hinton, 2009), and CINIC-10 (Darlow, Crowley, Antoniou, & Storkey, 2018) and present the experimental results compared with ACGAN, SMOTE, support vector machine with SMOTE (SVM–SMOTE) (Nguyen, Cooper, & Kamei, 2011), majority weighted minority oversampling technique (MWMOTE) (Barua, Islam, Yao, & Murase, 2012), VAE-GANs, BAGAN, and WGAN-GP. Section 5 concludes the paper and discusses future research.
Section snippets
Related work
In this section, we provide a brief summary of the GAN, ACGAN, and WGAN-GP.
GANs (Goodfellow et al., 2014, Schmidhuber, 2020) represent a class of generative models based on a game theory scenario in which a generator network competes against an adversary, , a discriminator. GANs aim to approximate the probability distribution function that certain data is assumed to be drawn from. The objective function of the min–max game between the generator and the discriminator is expressed as follows:
CEGAN: Classification Enhancement Generative Adversarial Networks
It is necessary to ensure the performance of the classifier in the training procedure of ACGAN because the discriminator and the generator are trained by using the classification loss. However, because the auxiliary classifier in ACGAN shares network structure and weight parameters with the discriminator, the performance of the auxiliary classifier in ACGAN cannot lead to generating high-quality images. In order to generate data for minority classes to improve the performance of the classifier
Datasets and models
For the evaluation of the proposed method, we used five benchmark datasets: MNIST (LeCun et al., 2010), extended MNIST digits (Cohen et al., 2017), fashion-MNIST (Xiao, Rasul, & Vollgraf, 2017), CIFAR-10 (Krizhevsky & Hinton, 2009), and CINIC-10 (Darlow et al., 2018). All benchmark datasets provide labeled training and a test set. Example images of the benchmark datasets are displayed in Fig. 3. The LeNet5 (LeCun et al., 1998) classifier model is used for MNIST, EMNIST, and fashion-MNIST,
Conclusions
In this paper, we proposed a data augmentation method, named CEGAN, composed of three independent networks and employed the objective formulation of WGAN-GP, for classification under imbalanced data conditions. As the generator of GANs takes a random noise vector for input and outputs a synthetic image, we proposed a modified classifier architecture for generated images considering the effect of noise input. We also proposed a conditional generative model with class subsets, which were
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This research was supported by Korea Institute of Science and Technology Europe Institutional Program (Project No. 12020).
References (60)
- et al.
A systematic study of the class imbalance problem in convolutional neural networks
Neural Networks
(2018) - et al.
Effective data generation for imbalanced learning using conditional generative adversarial networks
Expert Systems with Applications
(2018) - et al.
Learning from class-imbalanced data: Review of methods and applications
Expert Systems with Applications
(2017) - et al.
The problem of bias in training data in regression problems in medical decision support
Artificial Intelligence in Medicine
(2002) On the momentum term in gradient descent learning algorithms
Neural Networks
(1999)Generative adversarial networks are special cases of artificial curiosity (1990) and also closely related to predictability minimization (1991)
Neural Networks
(2020)- et al.
Principal component analysis
Chemometrics and Intelligent Laboratory Systems
(1987) - et al.
The effect of imbalanced data sets on LDA: A theoretical and empirical analysis
Pattern Recognition
(2007) - Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In International...
- et al.
k-means++: The advantages of careful seeding
MWMOTE–Majority weighted minority oversampling technique for imbalanced data set learning
IEEE Transactions on Knowledge and Data Engineering
Automated annotation of coral reef survey images
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
EMNIST: an extension of MNIST to handwritten letters
CINIC-10 is not imagenet or CIFAR-10
Data augmentation in fault diagnosis based on the wasserstein generative adversarial network with gradient penalty
Neurocomputing
Generative adversarial nets
Maxout networks
Tree species abundance predictions in a tropical agricultural landscape with a supervised classification model and imbalanced data
Remote Sensing
An approach to imbalanced data sets based on changing rule strength
Improved training of wasserstein gans
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning
ADASYN: Adaptive synthetic sampling approach for imbalanced learning
Learning from imbalanced data
IEEE Transactions on Knowledge & Data Engineering
Machine learning for health informatics: state-of-the-art and future challenges (vol. 9605)
The class imbalance problem: A systematic study
Intelligent Data Analysis
Classification of imbalanced data by combining the complementary neural network and SMOTE algorithm
A hybrid pansharpening approach and multiscale object-based image analysis for mapping diseased pine and oak trees
International Journal of Remote Sensing
Adam: A method for stochastic optimization
Cited by (77)
A malware detection model based on imbalanced heterogeneous graph embeddings
2024, Expert Systems with ApplicationsGenerativeMTD: A deep synthetic data generation framework for small datasets
2023, Knowledge-Based SystemsData-driven digital twin method for leak detection in natural gas pipelines
2023, Computers and Electrical EngineeringVerification and performance comparison of CNN-based algorithms for two-step helmet-wearing detection
2023, Expert Systems with ApplicationsDeep learning-based animal activity recognition with wearable sensors: Overview, challenges, and future directions
2023, Computers and Electronics in AgricultureA survey on GANs for computer vision: Recent research, analysis and taxonomy
2023, Computer Science Review