Resampling algorithms based on sample concatenation for imbalance learning
Introduction
Class imbalance and class overlap are two important factors that influence the performance of the classification models learned on a given dataset. Class imbalance refers to a problem in which the number of samples from each class is (perhaps extremely) unequal in a dataset. For a two-class imbalanced dataset, the number of samples from one class (called the majority class) is far larger than that from the other class (called the minority class). The class imbalance may result in insufficient representation ability of the minority class due to the shortage of minority samples. Class overlap means that there are regions in the sample space where two or more classes have approximately equal prior probabilities and may lead to an ambiguous classification boundary between classes. An insufficient representation ability for the minority class and an ambiguous classification boundary between classes may deteriorate the recognition rate of minority samples. However, correctly predicting minority samples is important in many real applications, such as disease detection [1], [2], financial risk assessment [3], [4], and intrusion detection [5], [6].
Numerous solutions [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25] have been proposed to improve the recognition rate of the minority class without affecting the prediction of the majority class. These solutions can be roughly classified into three categories: (1) Resampling methods (also called data-level methods). Resampling methods change the distribution of various classes in a dataset to rebalance its classes. They rebalance directly imbalanced training data via undersampling [7], [8], [9], [10] or oversampling [11], [12], [13], [14] and can be easily combined with different classifiers because of their independence from classifier learning processes. (2) Algorithm-level methods. Algorithm-level methods include adaptation learning [15], [16] and cost sensitive learning [17], [18], [19]. Adaptation learning modifies existing learning algorithms of specific models to adapt them to imbalanced data, and cost sensitive learning assigns a higher cost to minority classes to reduce their bias toward majority classes. (3) Ensemble methods. Ensemble methods [20], [21], [22], [23], [24], [25] combine ensemble learning algorithms with resampling methods or/and algorithm-level methods for dealing with class imbalance problems. In a given ensemble framework, such as Bagging and Boosting, algorithm-level methods slightly modify the base learner and resampling methods preprocess the training data before learning each classifier.
The resampling methods adopt various techniques to change the distribution of imbalanced data for improving the prediction of classifiers, and the effectiveness of the methods has been verified [7], [8], [9], [10], [11], [12], [13], [14]. At present, except for the simple sampling methods, two informed resampling strategies, informed undersampling [7], [8], [9] and informed oversampling [26], [27], [28], [29], [30], have been proposed to enhance the representation ability of the minority class or give clearer classification boundary between classes. The informed undersampling is to remove parts of the majority samples in the class overlapping region from the majority class, thereby alleviating the overwhelmed issue of the minority class, whereas the informed oversampling is to add synthetic minority samples in the class overlapping region to enhance the representation ability of the minority class (see Section 2.1 for more detail on informed resampling methods). Although the above two strategies can improve the predictive performance of minority samples, they may reduce the performance of the entire predicted data because the first strategy may cause the loss of valuable information, and the second strategy may make the classification boundary between classes more ambiguous. However, our objective is to improve the recognition rate of the minority samples without reducing the classification performance of the entire predicted data.
Currently, most existing resampling methods adopt various techniques in the original space to deal with loss of information and ambiguous classification boundaries. However, it is difficult to deal with these issues simultaneously owing to class imbalance and class overlap in the datasets. This study introduces a data mapping method called sample concatenation, and proposes a novel resampling algorithm based on sample concatenation (Re-SC) for imbalance learning. Re-SC merges an original imbalanced dataset and an undersampled original dataset into a balanced concatenated dataset in a new sample space. The undersampled original dataset is composed of parts of the original majority samples and all of the original minority samples. In the concatenated dataset, both the first half and the last half of the features in the minority class come from the original minority samples, whereas the first half and the last half of the features in the majority class come from the original majority samples and the important majority samples, respectively. That means that Re-SC focuses on the entire distribution of the original minority samples, and at the same time it considers the distribution of the original majority samples and that of the important majority samples, thereby alleviating the loss of valuable samples and reducing the class overlapping region. We derive the relation of the interclass overlapping ratio between the original dataset and the concatenated dataset and illustrate, via theoretical analysis, that the concatenated dataset has better class separability.
Further, we noticed that there are usually more majority samples available for sample concatenation in Re-SC. However, to obtain a concatenated dataset with class balance, only a portion of the majority samples can be used for sample concatenation. In this case, the classifier trained on the concatenated dataset is usually biased. Thus, to fully take advantage of the majority samples, we propose an ensemble resampling algorithm based on sample concatenation (EnRe-SC), which uses balanced datasets obtained by the Re-SC algorithm as base datasets and an unstable classifier as the base classifier. Compared with the classifier learned on a dataset obtained by Re-SC, the ensemble classifier learned by EnRe-SC has better performance. The contributions of this study are as follows:
(1) Sample concatenation is introduced for imbalance learning, and a resampling algorithm based on sample concatenation (Re-SC) is proposed. Re-SC transforms an imbalanced dataset in an original sample space into a concatenated dataset in a new sample space. In the concatenated dataset, the overlapping regions between classes may be reduced and the sample sizes from different classes are approximately the same.
(2) An ensemble resampling algorithm based on sample concatenation (EnRe-SC) for imbalanced data is proposed, which transforms an original imbalanced dataset into multiple balanced datasets in the new sample space, thereby alleviating the issue of information loss and obtaining a classifier with better generalization.
(3) The classification difficulties of original datasets and concatenated datasets are measured by data complexity, and the effectiveness of the proposed algorithms is verified via experiments on the datasets from the UCI and KEEL dataset repository.
Section snippets
Resampling
Undersampling and oversampling are two frequently used resampling strategies. For an imbalanced dataset, undersampling [7], [8], [9], [10] removes part of the samples from the majority class to reduce the over-whelming superiority of the majority class against the minority class, thus improving the representation ability of the minority class, whereas oversampling [11], [12], [13], [14] adds some original and synthetic minority samples into the minority class to improve the representation
Motivation
Sample concatenation connects two samples end to end into a new concatenated sample. For two d-dimensional samples and , a concatenated sample from and is denoted as . For a given dataset, its corresponding concatenated dataset obtained by sample concatenation is different from the original dataset in terms of sample dimensionality, sample size, and data distribution.
We illustrate sample concatenation using an example. First, a
A resampling algorithm based on sample concatenation
For an imbalanced dataset , the resampling algorithm based on sample concatenation transforms a pair of samples in into one sample in a new sample space. The idea of Re-SC algorithm is as follows: each majority sample in is first weighted in terms of Eq. (2) to generate a set of weighted majority samples. Then, the number of samples in is determined by Eq. (5), and the is obtained by drawing samples from the set of weighted majority samples. Subsequently, is generated by
Experimental data
To verify the effectiveness of the proposed algorithms, we selected 28 datasets from the UCI machine learning repository [40] (http://archive.ics.uci.edu/ml/index.php) and the KEEL dataset repository [41] (http://www.keel.es/). In the original selected datasets, some of them have two classes, while others have multiple classes. As we focus on two-class classification problems, multi-class datasets are converted into two-class imbalanced datasets by merging some classes. For instance, for the
Conclusion
We proposed a novel resampling algorithm Re-SC that is different from the existing resampling methods to learn from imbalanced datasets. The Re-SC maps samples in the original sample space into a new sample space, which changes the distribution of the original data to a larger extent. The concatenated training datasets have different characteristics from the original training datasets. First, the imbalance ratio of the original dataset is larger, while that of the concatenated datasets is
CRediT authorship contribution statement
Hongbo Shi: Conceptualization, Methodology, Writing – review & editing. Ying Zhang: Conceptualization, Software, Validation, Visualization. Yuwen Chen: Conceptualization, Software, Writing – original draft. Suqin Ji: Methodology, Writing – review & editing. Yuanxiang Dong: Methodology, Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by National Natural Science Foundation of China (No. 71701116); Key Research and Development Program of Shanxi Province, China (No. 201903D121160); Shanxi Natural Science Foundation, China (No. 201901D111318) and Humanities and Social Science Fund of Ministry of Education of China (No. 21YJA630011).
References (51)
- et al.
Heart disease detection using deep learning methods from imbalanced ECG samples
Biomed. Signal Process. Control
(2021) - et al.
A novel ensemble method for credit scoring: Adaption of different imbalance ratios
Expert Syst. Appl.
(2018) - et al.
Ramp loss K-support vector classification-regression; A robust and sparse multi-class approach to the intrusion detection problem
Knowl.-Based Syst.
(2017) - et al.
Fuzziness based semi-supervised learning approach for intrusion detection system
Inform. Sci.
(2017) - et al.
Clustering-based undersampling in class-imbalanced data
Inform. Sci.
(2017) - et al.
Neighbourhood-based undersampling approach for handling imbalanced and overlapped data
Inform. Sci.
(2020) - et al.
Under-sampling class imbalanced datasets by combining clustering analysis and instance selection
Inform. Sci.
(2019) - et al.
Combined cleaning and resampling algorithm for multi-class imbalanced data with label noise
Knowl.-Based Syst.
(2020) - et al.
NI-MWMOTE: An improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems
Expert Syst. Appl.
(2020) - et al.
Preprocessing unbalanced data using support vector machine
Decis. Support Syst.
(2012)
GIR-based ensemble sampling approaches for imbalanced learning
Pattern Recognit.
Eusboost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling
Pattern Recognit.
SVM-boosting based on Markov resampling: Theory and algorithm
Neural Netw.
A weighted hybrid ensemble method for classifying imbalanced data
Knowl.-Based Syst.
Radial-based oversampling for noisy imbalanced data classification
Neurocomputing
SVDD boundary and DPC clustering technique-based oversampling approach for handling imbalanced and overlapped data
Knowl.-Based Syst.
Coupling different methods for overcoming the class imbalance problem
Neurocomputing
A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets
Fuzzy Sets and Systems
The impact of class imbalance in classification performance metrics based on the binary confusion matrix
Pattern Recognit.
SMOTE based class-specific extreme learning machine for imbalanced learning
Knowl.-Based Syst.
Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification
Neurocomputing
Heartbeat anomaly detection using adversarial oversampling
Credit risk prediction in an imbalanced social lending environment
Int. J. Comput. Intell. Syst.
LDAMSS: Fast and efficient undersampling method for imbalanced learning
Appl. Intell.
SMOTE: Synthetic minority over-sampling technique
J. Artificial Intelligence Res.
Cited by (19)
A two-stage case-based reasoning driven classification paradigm for financial distress prediction with missing and imbalanced data
2024, Expert Systems with ApplicationsCluster-based oversampling with area extraction from representative points for class imbalance learning
2024, Intelligent Systems with ApplicationsNovel extended NI-MWMOTE-based fault diagnosis method for data-limited and noise-imbalanced scenarios
2024, Expert Systems with ApplicationsPAMPred: A hierarchical evolutionary ensemble framework for identifying plant antimicrobial peptides
2023, Computers in Biology and MedicineImbalanced learning for wind turbine blade icing detection via spatio-temporal attention model with a self-adaptive weight loss function
2023, Expert Systems with Applications