Resampling algorithms based on sample concatenation for imbalance learning

doi:10.1016/j.knosys.2022.108592

Knowledge-Based Systems

Volume 245, 7 June 2022, 108592

https://doi.org/10.1016/j.knosys.2022.108592 Get rights and content

Highlights

•
Sample concatenation is introduced and analyzed for imbalance learning.
•
Two resampling algorithms based on sample concatenation are proposed.
•
Data complexity is used to measure the classification difficulty of imbalanced data.
•
A superior performance of the proposed algorithms is verified via experiments.

Abstract

Resampling is the widely used method for imbalance learning. Most existing resampling methods use various techniques in the original sample space to rebalance imbalanced datasets, but they may cause loss of valuable information or aggravate interclass overlap. In this paper, sample concatenation, i.e., concatenating two samples with the same labels into one sample, is introduced into imbalance learning, and a resampling algorithm based on sample concatenation (Re-SC) is proposed. Re-SC transforms an imbalanced training dataset in the original sample space into a concatenated dataset in a new sample space. In the transformation process, Re-SC considers both the distribution of the original dataset and that of the majority samples, thereby alleviating the loss of valuable samples and reducing the class overlapping region. Furthermore, an ensemble resampling algorithm based on sample concatenation (EnRe-SC) for imbalanced data is also proposed. EnRe-SC can reduce the negative effect of removing part of the samples from the majority class. Experiments were conducted on UCI and KEEL imbalanced datasets to evaluate the performance of the proposed methods. The results verify the effectiveness of the proposed methods.

Introduction

Class imbalance and class overlap are two important factors that influence the performance of the classification models learned on a given dataset. Class imbalance refers to a problem in which the number of samples from each class is (perhaps extremely) unequal in a dataset. For a two-class imbalanced dataset, the number of samples from one class (called the majority class) is far larger than that from the other class (called the minority class). The class imbalance may result in insufficient representation ability of the minority class due to the shortage of minority samples. Class overlap means that there are regions in the sample space where two or more classes have approximately equal prior probabilities and may lead to an ambiguous classification boundary between classes. An insufficient representation ability for the minority class and an ambiguous classification boundary between classes may deteriorate the recognition rate of minority samples. However, correctly predicting minority samples is important in many real applications, such as disease detection [1], [2], financial risk assessment [3], [4], and intrusion detection [5], [6].

Numerous solutions [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25] have been proposed to improve the recognition rate of the minority class without affecting the prediction of the majority class. These solutions can be roughly classified into three categories: (1) Resampling methods (also called data-level methods). Resampling methods change the distribution of various classes in a dataset to rebalance its classes. They rebalance directly imbalanced training data via undersampling [7], [8], [9], [10] or oversampling [11], [12], [13], [14] and can be easily combined with different classifiers because of their independence from classifier learning processes. (2) Algorithm-level methods. Algorithm-level methods include adaptation learning [15], [16] and cost sensitive learning [17], [18], [19]. Adaptation learning modifies existing learning algorithms of specific models to adapt them to imbalanced data, and cost sensitive learning assigns a higher cost to minority classes to reduce their bias toward majority classes. (3) Ensemble methods. Ensemble methods [20], [21], [22], [23], [24], [25] combine ensemble learning algorithms with resampling methods or/and algorithm-level methods for dealing with class imbalance problems. In a given ensemble framework, such as Bagging and Boosting, algorithm-level methods slightly modify the base learner and resampling methods preprocess the training data before learning each classifier.

The resampling methods adopt various techniques to change the distribution of imbalanced data for improving the prediction of classifiers, and the effectiveness of the methods has been verified [7], [8], [9], [10], [11], [12], [13], [14]. At present, except for the simple sampling methods, two informed resampling strategies, informed undersampling [7], [8], [9] and informed oversampling [26], [27], [28], [29], [30], have been proposed to enhance the representation ability of the minority class or give clearer classification boundary between classes. The informed undersampling is to remove parts of the majority samples in the class overlapping region from the majority class, thereby alleviating the overwhelmed issue of the minority class, whereas the informed oversampling is to add synthetic minority samples in the class overlapping region to enhance the representation ability of the minority class (see Section 2.1 for more detail on informed resampling methods). Although the above two strategies can improve the predictive performance of minority samples, they may reduce the performance of the entire predicted data because the first strategy may cause the loss of valuable information, and the second strategy may make the classification boundary between classes more ambiguous. However, our objective is to improve the recognition rate of the minority samples without reducing the classification performance of the entire predicted data.

Currently, most existing resampling methods adopt various techniques in the original space to deal with loss of information and ambiguous classification boundaries. However, it is difficult to deal with these issues simultaneously owing to class imbalance and class overlap in the datasets. This study introduces a data mapping method called sample concatenation, and proposes a novel resampling algorithm based on sample concatenation (Re-SC) for imbalance learning. Re-SC merges an original imbalanced dataset and an undersampled original dataset into a balanced concatenated dataset in a new sample space. The undersampled original dataset is composed of parts of the original majority samples and all of the original minority samples. In the concatenated dataset, both the first half and the last half of the features in the minority class come from the original minority samples, whereas the first half and the last half of the features in the majority class come from the original majority samples and the important majority samples, respectively. That means that Re-SC focuses on the entire distribution of the original minority samples, and at the same time it considers the distribution of the original majority samples and that of the important majority samples, thereby alleviating the loss of valuable samples and reducing the class overlapping region. We derive the relation of the interclass overlapping ratio between the original dataset and the concatenated dataset and illustrate, via theoretical analysis, that the concatenated dataset has better class separability.

Further, we noticed that there are usually more majority samples available for sample concatenation in Re-SC. However, to obtain a concatenated dataset with class balance, only a portion of the majority samples can be used for sample concatenation. In this case, the classifier trained on the concatenated dataset is usually biased. Thus, to fully take advantage of the majority samples, we propose an ensemble resampling algorithm based on sample concatenation (EnRe-SC), which uses balanced datasets obtained by the Re-SC algorithm as base datasets and an unstable classifier as the base classifier. Compared with the classifier learned on a dataset obtained by Re-SC, the ensemble classifier learned by EnRe-SC has better performance. The contributions of this study are as follows:

(1) Sample concatenation is introduced for imbalance learning, and a resampling algorithm based on sample concatenation (Re-SC) is proposed. Re-SC transforms an imbalanced dataset in an original sample space into a concatenated dataset in a new sample space. In the concatenated dataset, the overlapping regions between classes may be reduced and the sample sizes from different classes are approximately the same.

(2) An ensemble resampling algorithm based on sample concatenation (EnRe-SC) for imbalanced data is proposed, which transforms an original imbalanced dataset into multiple balanced datasets in the new sample space, thereby alleviating the issue of information loss and obtaining a classifier with better generalization.

(3) The classification difficulties of original datasets and concatenated datasets are measured by data complexity, and the effectiveness of the proposed algorithms is verified via experiments on the datasets from the UCI and KEEL dataset repository.

Section snippets

Resampling

Undersampling and oversampling are two frequently used resampling strategies. For an imbalanced dataset, undersampling [7], [8], [9], [10] removes part of the samples from the majority class to reduce the over-whelming superiority of the majority class against the minority class, thus improving the representation ability of the minority class, whereas oversampling [11], [12], [13], [14] adds some original and synthetic minority samples into the minority class to improve the representation

Motivation

Sample concatenation connects two samples end to end into a new concatenated sample. For two d-dimensional samples $x_{i} = (x_{i 1}, \dots, x_{i d})$ and $x_{j} = (x_{j 1}, \dots, x_{j d})$ , a concatenated sample from $x_{i}$ and $x_{j}$ is denoted as $\hat{x_{i} x_{j}} = (x_{i 1}, \dots, x_{i d}, x_{j 1}, \dots, x_{j d})$ . For a given dataset, its corresponding concatenated dataset obtained by sample concatenation is different from the original dataset in terms of sample dimensionality, sample size, and data distribution.

We illustrate sample concatenation using an example. First, a

A resampling algorithm based on sample concatenation

For an imbalanced dataset $D$ , the resampling algorithm based on sample concatenation transforms a pair of samples in $D$ into one sample in a new sample space. The idea of Re-SC algorithm is as follows: each majority sample in $D$ is first weighted in terms of Eq. (2) to generate a set of weighted majority samples. Then, the number of samples in ${Set}_{N}$ is determined by Eq. (5), and the ${Set}_{N}$ is obtained by drawing samples from the set of weighted majority samples. Subsequently, $P_{C}$ is generated by

Experimental data

To verify the effectiveness of the proposed algorithms, we selected 28 datasets from the UCI machine learning repository [40] (http://archive.ics.uci.edu/ml/index.php) and the KEEL dataset repository [41] (http://www.keel.es/). In the original selected datasets, some of them have two classes, while others have multiple classes. As we focus on two-class classification problems, multi-class datasets are converted into two-class imbalanced datasets by merging some classes. For instance, for the

Conclusion

We proposed a novel resampling algorithm Re-SC that is different from the existing resampling methods to learn from imbalanced datasets. The Re-SC maps samples in the original sample space into a new sample space, which changes the distribution of the original data to a larger extent. The concatenated training datasets have different characteristics from the original training datasets. First, the imbalance ratio of the original dataset is larger, while that of the concatenated datasets is

CRediT authorship contribution statement

Hongbo Shi: Conceptualization, Methodology, Writing – review & editing. Ying Zhang: Conceptualization, Software, Validation, Visualization. Yuwen Chen: Conceptualization, Software, Writing – original draft. Suqin Ji: Methodology, Writing – review & editing. Yuanxiang Dong: Methodology, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by National Natural Science Foundation of China (No. 71701116); Key Research and Development Program of Shanxi Province, China (No. 201903D121160); Shanxi Natural Science Foundation, China (No. 201901D111318) and Humanities and Social Science Fund of Ministry of Education of China (No. 21YJA630011).

References (51)

RathA. et al.
Heart disease detection using deep learning methods from imbalanced ECG samples
Biomed. Signal Process. Control
(2021)
HeH. et al.
A novel ensemble method for credit scoring: Adaption of different imbalance ratios
Expert Syst. Appl.
(2018)
Hosseini BamakanS.M. et al.
Ramp loss K-support vector classification-regression; A robust and sparse multi-class approach to the intrusion detection problem
Knowl.-Based Syst.
(2017)
AshfaqR.A.R. et al.
Fuzziness based semi-supervised learning approach for intrusion detection system
Inform. Sci.
(2017)
LinW.-C. et al.
Clustering-based undersampling in class-imbalanced data
Inform. Sci.
(2017)
VuttipittayamongkolP. et al.
Neighbourhood-based undersampling approach for handling imbalanced and overlapped data
Inform. Sci.
(2020)
TsaiC.-F. et al.
Under-sampling class imbalanced datasets by combining clustering analysis and instance selection
Inform. Sci.
(2019)
KoziarskiM. et al.
Combined cleaning and resampling algorithm for multi-class imbalanced data with label noise
Knowl.-Based Syst.
(2020)
WeiJ. et al.
NI-MWMOTE: An improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems
Expert Syst. Appl.
(2020)
FarquadM.A.H. et al.
Preprocessing unbalanced data using support vector machine
Decis. Support Syst.
(2012)

TangB. et al.

GIR-based ensemble sampling approaches for imbalanced learning

Pattern Recognit.

(2017)

GalarM. et al.

Eusboost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling

Pattern Recognit.

(2013)

JiangH. et al.

SVM-boosting based on Markov resampling: Theory and algorithm

Neural Netw.

(2020)

ZhaoJ. et al.

A weighted hybrid ensemble method for classifying imbalanced data

Knowl.-Based Syst.

(2020)

KoziarskiM. et al.

Radial-based oversampling for noisy imbalanced data classification

Neurocomputing

(2019)

TaoX. et al.

SVDD boundary and DPC clustering technique-based oversampling approach for handling imbalanced and overlapped data

Knowl.-Based Syst.

(2021)

NanniL. et al.

Coupling different methods for overcoming the class imbalance problem

Neurocomputing

(2015)

FernándezA. et al.

A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets

Fuzzy Sets and Systems

(2008)

LuqueA. et al.

The impact of class imbalance in classification performance metrics based on the binary confusion matrix

Pattern Recognit.

(2019)

RaghuwanshiB.S. et al.

SMOTE based class-specific extreme learning machine for imbalanced learning

Knowl.-Based Syst.

(2020)

NejatianS. et al.

Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification

Neurocomputing

(2018)

LimaJ.L.P. et al.

Heartbeat anomaly detection using adversarial oversampling

NamvarA. et al.

Credit risk prediction in an imbalanced social lending environment

Int. J. Comput. Intell. Syst.

(2018)

LiangT. et al.

LDAMSS: Fast and efficient undersampling method for imbalanced learning

Appl. Intell.

(2021)

ChawlaN.V. et al.

SMOTE: Synthetic minority over-sampling technique

J. Artificial Intelligence Res.

(2002)

Cited by (19)

A two-stage case-based reasoning driven classification paradigm for financial distress prediction with missing and imbalanced data
2024, Expert Systems with Applications
Financial distress prediction often accompanies missing sample feature data and imbalanced normal and abnormal samples. To solve missing and imbalanced data that have significant negative impacts on the financial distress prediction model, a two-stage case-based reasoning (CBR)-driven classification paradigm is proposed to accurately and robustly predict financial distress. The proposed classification paradigm involves two main stages: CBR-driven missing data imputation and learning vector quantization (LVQ)-CBR-driven classifier prediction. In the first stage, the hybrid CBR-driven weighted imputation method is used to fill in missing values in the analytical dataset to obtain reliable and stable imputation performance, thereby solving the data missing problem. In the second stage, the LVQ-CBR-driven classification model is constructed to predict financial distress. By highlighting and fully learning minority abnormal samples, the classification model solves the low prediction accuracy of minority abnormal samples arising from data imbalance. For illustration and verification, some experiments are performed on seven Chinese-listed enterprise datasets with different missing and imbalance rates. Corresponding results show that the proposed two-stage CBR-driven classification paradigm can achieve the best imputation performance, greatly improve the prediction accuracy of minority abnormal samples, and integrally realize the best overall prediction performance compared with other imputation methods, imbalanced data processing methods, and their combinations. This implies that the proposed two-stage CBR-driven classification paradigm can be used as a competitive solution to financial distress prediction with missing and imbalanced data.
Cluster-based oversampling with area extraction from representative points for class imbalance learning
2024, Intelligent Systems with Applications
Class imbalance learning is challenging in various domains where training datasets exhibit disproportionate samples in a specific class. Resampling methods have been used to adjust the class distribution, but they often have limitations for small disjunct minority subsets. This paper introduces AROSS, an adaptive cluster-based oversampling approach that addresses these limitations. AROSS utilizes an optimized agglomerative clustering algorithm with the Cophenetic Correlation Coefficient and the Bayesian Information Criterion to identify representative areas of the minority class. Safe and half-safe areas are obtained using an incremental k-Nearest Neighbor strategy, and oversampling is performed with a truncated hyperspherical Gaussian distribution. Experimental evaluations on 70 binary datasets demonstrate the effectiveness of AROSS in improving class imbalance learning performance, making it a promising solution for mitigating class imbalance challenges, especially for small disjunct minority subsets.
Enhancing and improving the performance of imbalanced class data using novel GBO and SSG: A comparative analysis
2024, Neural Networks
Class imbalance problem (CIP) in a dataset is a major challenge that significantly affects the performance of Machine Learning (ML) models resulting in biased predictions. Numerous techniques have been proposed to address CIP, including, but not limited to, Oversampling, Undersampling, and cost-sensitive approaches. Due to its ability to generate synthetic data, oversampling techniques such as the Synthetic Minority Oversampling Technique (SMOTE) are the most widely used methodology by researchers. However, one of SMOTE’s potential disadvantages is that newly created minor samples overlap with major samples. Therefore, the probability of ML models’ biased performance toward major classes increases. Generative adversarial network (GAN) has recently garnered much attention due to their ability to create real samples. However, GAN is hard to train even though it has much potential. Considering these opportunities, this work proposes two novel techniques: GAN-based Oversampling (GBO) and Support Vector Machine-SMOTE-GAN (SSG) to overcome the limitations of the existing approaches. The preliminary results show that SSG and GBO performed better on the nine imbalanced benchmark datasets than several existing SMOTE-based approaches. Additionally, it can be observed that the proposed SSG and GBO methods can accurately classify the minor class with more than 90% accuracy when tested with 20%, 30%, and 40% of the test data. The study also revealed that the minor sample generated by SSG demonstrates Gaussian distributions, which is often difficult to achieve using original SMOTE and SVM-SMOTE.
Novel extended NI-MWMOTE-based fault diagnosis method for data-limited and noise-imbalanced scenarios
2024, Expert Systems with Applications
Under real-world conditions, faulty samples of key components (e.g., bearings and cutting tools, etc.) are typically limited and sparse. Additionally, their historical data is characterized by time-series and imbalance characteristics. In other words, the training samples are not only limited and noisy, but also exhibit both within-class and between-class imbalance. These factors present significant challenges in the realm of fault monitoring modeling. To tackle these challenges, this paper presents an innovative fault diagnosis method rooted in the extended NI-MWMOTE and LS-SVM. NI-MWMOTE stands as an advanced noise-immunity majority weighted minority oversampling technique, originally introduced in our prior research, and it has exhibited exceptional competitiveness in noisy imbalanced benchmark datasets. It champions an adaptive noise processing strategy leveraging the distribution characteristics of noisy imbalanced data and the essence of machine learning. Specifically, it employs Euclidean distance and neighbor density to differentiate between spurious noise and true noise, and it determines the optimal processing strategy based on misclassification error and iteration. Furthermore, it employs unsupervised aggregative hierarchical clustering, misclassification error, and majority-weighted minority oversampling in a collaborative manner to address both within-class and between-class imbalanced problems. The primary contribution of our paper lies in the context of the monitoring scenario mentioned above. We have expanded the hyper-parameter range of NI-MWMOTE, corrected and optimized its built-in noise function to enhance the interpretability of the model, and successfully applied it in conjunction with LS-SVM to this particular setting. Notably, this marks the pioneering endeavor within our established knowledge sphere into the domain of tool wear state monitoring. The results suggest that, when compared to 11 well-known algorithms, our framework demonstrates significant competitiveness in real-world scenarios characterized under data-limited and noise-imbalanced scenarios for bearings and cutting tools fault diagnosis. This establishes a solid theoretical and practical foundation for similar scenarios.
PAMPred: A hierarchical evolutionary ensemble framework for identifying plant antimicrobial peptides
2023, Computers in Biology and Medicine
Antimicrobial peptides (AMPs) play a crucial role in plant immune regulation, growth and development stages, which have attracted significant attentions in recent years. As the wet-lab experiments are laborious and cost-prohibitive, it is indispensable to develop computational methods to discover novel plant AMPs accurately. In this study, we presented a hierarchical evolutionary ensemble framework, named PAMPred, which consisted of a multi-level heterogeneous architecture to identify plant AMPs. Specifically, to address the existing class imbalance problem, a cluster-based resampling method was adopted to build multiple balanced subsets. Then, several peptide features including sequence information-based and physicochemical properties-based features were fed into the different types of basic learners to increase the ensemble diversity. For boosting the predictive capability of PAMPred, the improved particle swarm optimization (PSO) algorithm and dynamic ensemble pruning strategy were used to optimize the weights at different levels adaptively. Furthermore, extensive ten-fold cross-validation and independent testing experimental results demonstrated that PAMPred achieved excellent prediction performance and generalization ability, and outperformed the state-of-the-art methods. It also indicated that the proposed method could serve as an effective auxiliary tool to identify plant AMPs, which would be conducive to explore the immune regulatory mechanism of plants.
Imbalanced learning for wind turbine blade icing detection via spatio-temporal attention model with a self-adaptive weight loss function
2023, Expert Systems with Applications
Accurate blade icing detection of wind turbines is of great significance to avoid secondary damages or accidents. Recently, data-driven methods have attracted increasing attention due to the availability of supervisory control and data acquisition (SCADA) data with their low-cost and high volume. However, SCADA data usually present complex characteristics of high dimensionality and strong spatio-temporal correlations among different sensor variables and severe class imbalance. It is challenging to extract effective features for accurate and reliable icing detection from imbalanced SCADA data. To address these challenges, we propose an imbalanced learning method for blade icing detection. First, we propose a spatio-temporal attention model to extract and assign weights to temporal–spatial features based on their contributions to blade icing detection. Then, to address the class imbalance issue, we design a self-adaptive weight (SAW) loss function to divide SCADA data into multiple batches and then classify them during model training. Unlike conventional loss functions, our SAW loss can adaptively assign weights to each data category according to their numbers in the divided batches. The effectiveness of the proposed method is verified on real SCADA datasets. The results demonstrate our designed spatio-temporal attention and SAW loss achieved better classification results and superior self-adaptability.

View all citing articles on Scopus

View full text

Resampling algorithms based on sample concatenation for imbalance learning

Highlights

Abstract

Introduction

Section snippets

Resampling

Motivation

A resampling algorithm based on sample concatenation

Experimental data

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Biomed. Signal Process. Control

Expert Syst. Appl.

Knowl.-Based Syst.

Inform. Sci.

Inform. Sci.

Inform. Sci.

Inform. Sci.

Knowl.-Based Syst.

Expert Syst. Appl.

Decis. Support Syst.

Pattern Recognit.

Pattern Recognit.

Neural Netw.

Knowl.-Based Syst.

Neurocomputing

Knowl.-Based Syst.

Neurocomputing

Fuzzy Sets and Systems

Pattern Recognit.

Knowl.-Based Syst.

Neurocomputing

Heartbeat anomaly detection using adversarial oversampling

Credit risk prediction in an imbalanced social lending environment

Int. J. Comput. Intell. Syst.

LDAMSS: Fast and efficient undersampling method for imbalanced learning

Appl. Intell.

SMOTE: Synthetic minority over-sampling technique

J. Artificial Intelligence Res.