Penalized multiple distribution selection method for imbalanced data classification
Introduction
Data classification, which classifies data points into pre-defined categories, is widely applied in many Natural Language Processing (NLP) tasks, such as sentiment analysis, information extraction, and document classification. One common method that deals with data classification consists of feature extractor and softmax classifier. Recently, neural networks have been widely adopted as feature extractors in NLP tasks [1], [2], [3] and achieved significant success. However, the problem of classifier being sensitive to imbalanced data is still not well studied. In many real scenarios, the number of data samples from different categories varies significantly, and this kind of phenomenon is called data imbalance, which worsens the overall performance of models distinctly, due to the learning bias towards the prominent classes. Take the ACE2005 relation extraction dataset [4] as an example, the number of the majority class “none” is 752 times over the minority class “agent artifact”, training models using traditional softmax classifier, the majority classes can overwhelm the minority classes and lead to degenerate models. This is because the strong underlying assumption of the softmax distribution, which assumes that the dependent variable should be equivalently categorical [5], is not consistent with the distributions of many datasets in reality.
Most previous methods [6], [7], [8], [9] tackle this problem by either oversampling the minority class or undersampling the majority class to obtain a balanced class distribution. Sampling methods have many advantages, such as low computational complexity, flexibility, and the fit of the assumption underlying softmax classifier. Although such approaches perform reasonably well, they still suffer from some defects. Oversampling methods duplicate training samples of minority categories to collect a relative balance label distribution, which inevitably overfit the data. While undersampling methods randomly remove some training instances of majority classes, causing information loss due to the reduced training instances.
Considering these defects, another fold of methods named the cost-sensitive approaches [10], [11], [12], [13] are proposed, which mitigate the imbalanced training problem by manually assigning lower misclassification costs to the majority class than to the minority. Compared to oversampling and undersampling approaches, this kind of method does not prune the label distribution and can fully utilize all the training data. However, cost-sensitive methods assign lower weights directly to majority classes on the whole dataset instead of on instance level, which prevents the model from learning hard cases of majority classes. Worse still is that the classifier is sensitive to cost weights, making it hard to select appropriate values of cost parameters to achieve optimal performance. Take the following instance of sentiment analysis as an example: “I was a little concerned about the touch pad based on reviews, but I have found it fine to work with. Label: Positive”; It is a big challenge for a model to classify the instance into the positive category due to the two adverse sentiment words “concerned” and “fine”. However, since the positive instance is from the majority class, the instance will be assigned a low cost weight, which hinders a model from learning it well.
The goal of this paper is to propose an effective classifier in dealing with imbalanced data classification. Specifically, we propose a novel penalized Multiple Distribution Selection (MDS) classifier (Fig. 1), which employs a softmax distribution and a set of degenerate distributions to compose the label distribution of dataset. Compared to the traditional methods which solely rely on the single softmax distribution, the MDS classifier does not adhere to the equal categorical assumption underlying the softmax distribution, so it needs not to prune the dataset, thus avoiding overfitting and information loss problems. In addition, the adaptive lasso regularization is imposed on the mixing proportion of composed distributions to automatically determine the weights of distributions. All distributions are treated as potential distributions to avoid artificial selection, and all classes contribute equally to the training loss, thus avoiding the insufficient learning of hard examples. Moreover, we propose a two-stage optimization algorithm to jointly estimate the parameters of feature extractor and MDS classifier. We randomly initialize the weights of distributions in the first-stage training. After the first-stage training, we can obtain the weights of the distributions, which are more accurate than the randomly initialized ones. By using the learned distribution weights as the initial point of the second-stage training, we can achieve better classification results.
To demonstrate the effectiveness of our proposed MDS classifier, we conduct extensive experiments with public datasets on three different NLP tasks, namely Sentiment Analysis, Relation Extraction, and Document Classification. The results show that the proposed multiple distribution selection method outperforms the previous approaches. Under highly imbalanced setting, our method achieves up to a 4.1 absolute F1 gain over high-performing baselines. Our contributions are fourfold:
- •
We show the mixture distribution is superior to the traditional single softmax assumption under imbalanced setting, based on which we deduce a plug-in classifier. The new classifier is of low computational complexity and highly effective in dealing with imbalanced data classification.
- •
We show that the regularization scheme can be used to automatically select the mixing proportion of potential composed distributions in the MDS model.
- •
A two-stage training method is proposed to jointly estimate the parameters of feature extractor and classifier, by which we can achieve better classification performance.
- •
Experimental studies on three NLP tasks using publicly available datasets show that our proposed framework outperforms the current state-of-the-art results.
Section snippets
Related work
Previous research on imbalanced data classification can be classified into three categories, namely, (1) sampling approaches, (2) cost-sensitive approaches, and (3) hybrid approaches. This section reviews these three types of methods. It then discusses the imbalanced problem in the deep learning field, which has attracted much attention recently and advances system performance significantly in many areas.
Approach overview
Data classification aims to classify instances into pre-defined categories. Fig. 1 illustrates the overall architecture of our approach.
In NLP tasks, given pre-trained word embeddings, we first map each word to its corresponding distributed representation. On top of that, a CNN feature extractor processes the sentence sequence. It encodes each word and its context within the fixed sizes of windows, after which we place max-pooling operations and finally using the after-pooled results as the
Experiments
Our model is evaluated by applying it to three datasets (see Table 1), the IMDB dataset, the 20Newsgroups dataset, and the ACE dataset, which fall into three different tasks, sentiment analysis, document classification, and relation extraction, respectively. In addition, we conduct a set of exploratory experiments, including an ablation study, the effects of the imbalance ratio on classification performance and analysis on the learned weights of distributions. Since we mainly focus on proposing
Conclusions and future work
Based on the assumption that one distribution is not enough for modeling complex data in real scenarios, we propose a multiple distribution selection method to tackle imbalanced problems in neural network training. Without pruning dataset, artificially re-weighting costs for error predictions, or assembling models, our proposed framework can effectively model imbalanced data via modeling a composed distribution. To automatically determine which distribution the data point is from, we employ L1
CRediT authorship contribution statement
Ge Shi: Conceptualization, Methodology, Software, Writing - original draft. Chong Feng: Supervision, Funding acquisition. Wenfu Xu: Visualization, Investigation. Lejian Liao: Writing - review & editing. Heyan Huang: Project administration.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
We thank the anonymous reviewers for their valuable comments and suggestions. This work was supported by the National Key R&D Program of China No. 2017YFB1002101, National Natural Science Foundation of China No. U1636203, and the Joint Advanced Research Foundation of China Electronics Technology Group Corporation No. 6141B08010102.
References (54)
- et al.
Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification
Neurocomputing
(2018) - et al.
Cost-sensitive linguistic fuzzy rule based classification systems under the mapreduce framework for imbalanced big data
Fuzzy Sets and Systems
(2015) - et al.
Cost-sensitive support vector machines
Neurocomputing
(2019) - et al.
Boosting weighted ELM for imbalanced learning
Neurocomputing
(2014) - et al.
Weighted extreme learning machine for imbalance learning
Neurocomputing
(2013) - et al.
Near-bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs
Neural Netw.
(2015) - et al.
Imbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on SMOTE and bagging with differentiated sampling rates
Inform. Sci.
(2018) - et al.
Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting
Inf. Fusion
(2020) - et al.
An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme
Knowl.-Based Syst.
(2018) - et al.
Multi-imbalance: an open-source software for multi-class imbalance learning
Knowl.-Based Syst.
(2019)
Deep learning fault diagnosis method based on global optimization GAN for unbalanced data
Knowl.-Based Syst.
Hybrid query expansion using lexical resources and word embeddings for sentence retrieval in question answering
Inform. Sci.
Word sense disambiguation: a comprehensive knowledge exploitation framework
Knowl.-Based Syst.
Low-rank local tangent space embedding for subspace clustering
Inform. Sci.
A comparative evaluation of outlier detection algorithms: experiments and analyses
Pattern Recognit.
Relaxed lasso
Comput. Statist. Data Anal.
A systematic study of the class imbalance problem in convolutional neural networks
Neural Netw.
Attention is all you need
Structured sequence modeling with graph convolutional recurrent networks
Graph convolution over pruned dependency trees improves relation extraction
Applied Logistic Regression, vol. 398
The class imbalance problem in pattern classification and learning
Cost-sensitive KNN classification
Neurocomputing
Cited by (12)
Collective prompt tuning with relation inference for document-level relation extraction
2023, Information Processing and ManagementImbalanced multiclass classification with active learning in strip rolling process
2022, Knowledge-Based SystemsCitation Excerpt :Incomplete categories are major concerns for learning from imbalanced data, which leads to the failure of assigning the same attention to minority faults as the majority, resulting in a lack of generalization ability. Methods called algorithm-level approaches and data-level approaches are most frequently used to address imbalanced datasets [11–13]. Cost-sensitive learning methods are popular approaches for addressing imbalanced classification problems using unknown and unidentical costs on the algorithmic level while maintaining the distribution of the original class [14].
Document-level relation extraction with Entity-Selection Attention
2021, Information SciencesMulti-aspect Understanding with Cooperative Graph Attention Networks for Medical Dialogue Information Extraction
2023, ACM Transactions on Intelligent Systems and TechnologyBERT-based chinese text classification for emergency management with a novel loss function
2023, Applied IntelligenceDynamic interest modeling via dual learning for recommendation
2023, Multimedia Tools and Applications