Penalized multiple distribution selection method for imbalanced data classification

https://doi.org/10.1016/j.knosys.2020.105833Get rights and content

Abstract

In reality, the amount of data from different categories varies significantly, which results in learning bias towards prominent classes, hindering the overall classification performance. In this paper, by proving that traditional classification methods that use single softmax distribution are limited for modeling complex and imbalanced data, we propose a general Multiple Distribution Selection (MDS) method for imbalanced data classification. MDS employs a mixture distribution that is composed of a single softmax distribution and a set of degenerate distributions to model imbalanced data. Furthermore, a dynamic distribution selection method, based on L1 regularization, is also proposed to automatically determine the weights of distributions. In addition, the corresponding two-stage optimization algorithm is designed to estimate the parameters of models. Extensive experiments conducted on three widely used benchmark datasets (IMDB, ACE2005, 20NewsGroups) show that our proposed mixture method outperforms previous methods. Moreover, under highly imbalanced setting, our method achieves up to a 4.1 absolute F1 gain over high-performing baselines.

Introduction

Data classification, which classifies data points into pre-defined categories, is widely applied in many Natural Language Processing (NLP) tasks, such as sentiment analysis, information extraction, and document classification. One common method that deals with data classification consists of feature extractor and softmax classifier. Recently, neural networks have been widely adopted as feature extractors in NLP tasks [1], [2], [3] and achieved significant success. However, the problem of classifier being sensitive to imbalanced data is still not well studied. In many real scenarios, the number of data samples from different categories varies significantly, and this kind of phenomenon is called data imbalance, which worsens the overall performance of models distinctly, due to the learning bias towards the prominent classes. Take the ACE2005 relation extraction dataset [4] as an example, the number of the majority class “none” is 752 times over the minority class “agent artifact”, training models using traditional softmax classifier, the majority classes can overwhelm the minority classes and lead to degenerate models. This is because the strong underlying assumption of the softmax distribution, which assumes that the dependent variable y should be equivalently categorical [5], is not consistent with the distributions of many datasets in reality.

Most previous methods [6], [7], [8], [9] tackle this problem by either oversampling the minority class or undersampling the majority class to obtain a balanced class distribution. Sampling methods have many advantages, such as low computational complexity, flexibility, and the fit of the assumption underlying softmax classifier. Although such approaches perform reasonably well, they still suffer from some defects. Oversampling methods duplicate training samples of minority categories to collect a relative balance label distribution, which inevitably overfit the data. While undersampling methods randomly remove some training instances of majority classes, causing information loss due to the reduced training instances.

Considering these defects, another fold of methods named the cost-sensitive approaches [10], [11], [12], [13] are proposed, which mitigate the imbalanced training problem by manually assigning lower misclassification costs to the majority class than to the minority. Compared to oversampling and undersampling approaches, this kind of method does not prune the label distribution and can fully utilize all the training data. However, cost-sensitive methods assign lower weights directly to majority classes on the whole dataset instead of on instance level, which prevents the model from learning hard cases of majority classes. Worse still is that the classifier is sensitive to cost weights, making it hard to select appropriate values of cost parameters to achieve optimal performance. Take the following instance of sentiment analysis as an example: “I was a little concerned about the touch pad based on reviews, but I have found it fine to work with. Label: Positive”; It is a big challenge for a model to classify the instance into the positive category due to the two adverse sentiment words “concerned” and “fine”. However, since the positive instance is from the majority class, the instance will be assigned a low cost weight, which hinders a model from learning it well.

The goal of this paper is to propose an effective classifier in dealing with imbalanced data classification. Specifically, we propose a novel penalized Multiple Distribution Selection (MDS) classifier (Fig. 1), which employs a softmax distribution and a set of degenerate distributions to compose the label distribution of dataset. Compared to the traditional methods which solely rely on the single softmax distribution, the MDS classifier does not adhere to the equal categorical assumption underlying the softmax distribution, so it needs not to prune the dataset, thus avoiding overfitting and information loss problems. In addition, the adaptive lasso regularization is imposed on the mixing proportion of composed distributions to automatically determine the weights of distributions. All distributions are treated as potential distributions to avoid artificial selection, and all classes contribute equally to the training loss, thus avoiding the insufficient learning of hard examples. Moreover, we propose a two-stage optimization algorithm to jointly estimate the parameters of feature extractor and MDS classifier. We randomly initialize the weights of distributions in the first-stage training. After the first-stage training, we can obtain the weights of the distributions, which are more accurate than the randomly initialized ones. By using the learned distribution weights as the initial point of the second-stage training, we can achieve better classification results.

To demonstrate the effectiveness of our proposed MDS classifier, we conduct extensive experiments with public datasets on three different NLP tasks, namely Sentiment Analysis, Relation Extraction, and Document Classification. The results show that the proposed multiple distribution selection method outperforms the previous approaches. Under highly imbalanced setting, our method achieves up to a 4.1 absolute F1 gain over high-performing baselines. Our contributions are fourfold:

  • We show the mixture distribution is superior to the traditional single softmax assumption under imbalanced setting, based on which we deduce a plug-in classifier. The new classifier is of low computational complexity and highly effective in dealing with imbalanced data classification.

  • We show that the L1 regularization scheme can be used to automatically select the mixing proportion of potential composed distributions in the MDS model.

  • A two-stage training method is proposed to jointly estimate the parameters of feature extractor and classifier, by which we can achieve better classification performance.

  • Experimental studies on three NLP tasks using publicly available datasets show that our proposed framework outperforms the current state-of-the-art results.

Section snippets

Related work

Previous research on imbalanced data classification can be classified into three categories, namely, (1) sampling approaches, (2) cost-sensitive approaches, and (3) hybrid approaches. This section reviews these three types of methods. It then discusses the imbalanced problem in the deep learning field, which has attracted much attention recently and advances system performance significantly in many areas.

Approach overview

Data classification aims to classify instances into pre-defined categories. Fig. 1 illustrates the overall architecture of our approach.

In NLP tasks, given pre-trained word embeddings, we first map each word to its corresponding distributed representation. On top of that, a CNN feature extractor processes the sentence sequence. It encodes each word and its context within the fixed sizes of windows, after which we place max-pooling operations and finally using the after-pooled results as the

Experiments

Our model is evaluated by applying it to three datasets (see Table 1), the IMDB dataset, the 20Newsgroups dataset, and the ACE dataset, which fall into three different tasks, sentiment analysis, document classification, and relation extraction, respectively. In addition, we conduct a set of exploratory experiments, including an ablation study, the effects of the imbalance ratio on classification performance and analysis on the learned weights of distributions. Since we mainly focus on proposing

Conclusions and future work

Based on the assumption that one distribution is not enough for modeling complex data in real scenarios, we propose a multiple distribution selection method to tackle imbalanced problems in neural network training. Without pruning dataset, artificially re-weighting costs for error predictions, or assembling models, our proposed framework can effectively model imbalanced data via modeling a composed distribution. To automatically determine which distribution the data point is from, we employ L1

CRediT authorship contribution statement

Ge Shi: Conceptualization, Methodology, Software, Writing - original draft. Chong Feng: Supervision, Funding acquisition. Wenfu Xu: Visualization, Investigation. Lejian Liao: Writing - review & editing. Heyan Huang: Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We thank the anonymous reviewers for their valuable comments and suggestions. This work was supported by the National Key R&D Program of China No. 2017YFB1002101, National Natural Science Foundation of China No. U1636203, and the Joint Advanced Research Foundation of China Electronics Technology Group Corporation No. 6141B08010102.

References (54)

  • ZhouF. et al.

    Deep learning fault diagnosis method based on global optimization GAN for unbalanced data

    Knowl.-Based Syst.

    (2020)
  • EspositoM. et al.

    Hybrid query expansion using lexical resources and word embeddings for sentence retrieval in question answering

    Inform. Sci.

    (2020)
  • WangY. et al.

    Word sense disambiguation: a comprehensive knowledge exploitation framework

    Knowl.-Based Syst.

    (2020)
  • DengT. et al.

    Low-rank local tangent space embedding for subspace clustering

    Inform. Sci.

    (2020)
  • DominguesR. et al.

    A comparative evaluation of outlier detection algorithms: experiments and analyses

    Pattern Recognit.

    (2018)
  • MeinshausenN.

    Relaxed lasso

    Comput. Statist. Data Anal.

    (2007)
  • BudaM. et al.

    A systematic study of the class imbalance problem in convolutional neural networks

    Neural Netw.

    (2018)
  • VaswaniA. et al.

    Attention is all you need

  • SeoY. et al.

    Structured sequence modeling with graph convolutional recurrent networks

  • ZhangY. et al.

    Graph convolution over pruned dependency trees improves relation extraction

    (2018)
  • C. Walker, S. Strassel, J. Medero, K. Maeda, ACE 2005 multilingual training corpus, Linguistic Data Consortium,...
  • Hosmer JrD.W. et al.

    Applied Logistic Regression, vol. 398

    (2013)
  • GarciaV. et al.

    The class imbalance problem in pattern classification and learning

  • T. Zhang, A. Subburathinam, G. Shi, L. Huang, D. Lu, X. Pan, M. Li, B. Zhang, Q. Wang, S. Whitehead, et al. Gaia-a...
  • L. Huang, H. Ji, J. May, Cross-lingual multi-level adversarial transfer to enhance low-resource name tagging, in:...
  • S. Maliah, G. Shani, MDP-based cost sensitive classification using decision trees, in: Thirty-Second AAAI Conference on...
  • ZhangS.

    Cost-sensitive KNN classification

    Neurocomputing

    (2019)
  • Cited by (12)

    • Imbalanced multiclass classification with active learning in strip rolling process

      2022, Knowledge-Based Systems
      Citation Excerpt :

      Incomplete categories are major concerns for learning from imbalanced data, which leads to the failure of assigning the same attention to minority faults as the majority, resulting in a lack of generalization ability. Methods called algorithm-level approaches and data-level approaches are most frequently used to address imbalanced datasets [11–13]. Cost-sensitive learning methods are popular approaches for addressing imbalanced classification problems using unknown and unidentical costs on the algorithmic level while maintaining the distribution of the original class [14].

    View all citing articles on Scopus
    View full text