A novel oversampling technique for class-imbalanced learning based on SMOTE and natural neighbors

doi:10.1016/j.ins.2021.03.041

Information Sciences

Volume 565, July 2021, Pages 438-455

https://doi.org/10.1016/j.ins.2021.03.041 Get rights and content

Abstract

Developing techniques for the machine learning of a classifier from class-imbalanced data presents an important challenge. Among the existing methods for addressing this problem, SMOTE has been successful, has received great praise, and features an extensive range of practical applications. In this paper, we focus on SMOTE and its extensions, aiming to solve the most challenging issues, namely, the choice of the parameter k and the determination of the neighbor number of each sample. Hence, a synthetic minority oversampling technique with natural neighbors (NaNSMOTE) is proposed. In NaNSMOTE, the random difference between a selected base sample and one of its natural neighbors is used to generate synthetic samples. The main advantages of NaNSMOTE are that (a) it has an adaptive k value related to the data complexity; (b) samples of class centers have more neighbors to improve the generalization of synthetic samples, while border samples have fewer neighbors to reduce the error of synthetic samples; and (c) it can remove outliers. The effectiveness of NaNSMOTE is proven by comparing it with SMOTE and extended versions of SMOTE on real data sets.

Introduction

An important cause leading to the performance deterioration of classifiers is class imbalance [1]. In class-imbalance problems, one or more classes (i.e., minority classes) have very few cases while other classes (i.e., majority classes) have large numbers of cases. Hence, class distribution is highly skewed in real-world applications of class-imbalance learning. Examples are biomedical analysis [2], fraud detection [3], enterprise credit evaluation [4], image recognition [5], etc.

The prediction accuracy [6] is usually used to evaluate the performance of a trained classifier in machine learning. Although the predictive capability of a classifier is impaired by class imbalance, the classifier may still achieve very high prediction accuracy. The reason is that the prediction capacity of the trained classifier becomes biased towards the majority class instances and the number of majority class instances is much greater than that of the minority class instances. As a result, the prediction accuracy would not be used as a single evaluation metric for class-imbalance problems. Other metrics, such as G-mean [7], F-measure [8], Kappa statistics [9] and AUC/ROC [10], are also used frequently to complement performance evaluations.

Common technologies for handling the class-imbalance problems are usually divided into three categories: cost-sensitive approaches [11], algorithm level approaches [12], [13] and data level approaches [14]. In cost-sensitive approaches, the degree of imbalance is used to generate the cost matrices for misclassification costs. The objective of the algorithm level approach is to modify classification algorithms so that they can be adapted for the imbalance issue. The data-level approach is the most dominant imbalanced technology since it can be independent of prediction algorithms. Specifically, the data-level approach intends to preprocess data by rebalancing the majority and minority classes.

Oversampling methods [15] and the undersampling methods [16], [17] are two common practices of the data-level approach. Oversampling methods aim to generate new data by replicating important samples of the minority class, while undersampling methods are applied by removing redundant samples from the majority class. Among these methods, the synthetic minority oversampling technique (SMOTE) [18] is one of the most popular oversampling methods. It has garnered much praise and features an extensive range of practical applications, such as gender prediction [19], DNA recognition [20], biomedical research [21] and phishing detection [22].

The classical SMOTE and almost all of its extensions [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33] use the k-nearest neighbor (KNN) [34] to generate new samples. Specifically, the aim of these tools is to synthesize samples of the minority class by using a random difference between a selected base example and one of its KNNs. Typical improved approaches applying this concept include Borderline-SMOTE [24], Safe-level-SMOTE [25], ADASYN [29], SMOTE-IPF [30], k-means SMOTE [32], etc.

In this paper, we focus on SMOTE and its extensions, aiming to solve the most challenging neighborhood issues.

(a)
The choice of k value is an important but unsolved issue [14], [18], [35] in SMOTE and its extensions. Its importance results from the ability of the parameter k to affect the generalization and faithfulness of synthetic samples [14]. In SMOTE and its extensions, the value of k is usually set to the fixed empirical value 5. Actually, we need a dynamic and adaptive k value to adapt to different data sets with differing aspects of complexity (i.e., distributions, size and dimensions).
(b)
The second issue is that the number of neighbors should not be the same for each sample. In fact, samples in dense areas (e.g., samples of class centers) should have more neighbors because we tend to increase the generalization of the generated samples. In contrast, samples in sparse areas (e.g., border samples and outlier samples) should have fewer neighbors because we tend to decrease the error of generated samples. Fig. 1 provides a toy example to visualize this issue, where the SMOTE with k = 5 is used because SMOTE’s extensions use the SMOTE in the last step. Specifically, Fig. 1 shows that the KNNs of sample A are samples B-F. We consider border sample A as a base sample. Then, we use sample A and one of its KNNs (i.e., sample F) to generate the new sample L. However, sample L tends to be noise since it is located on the fuzzy boundary and has class labels different from those samples around it. If sample A has fewer neighbors (e.g., samples B-D), the generated samples will show higher faithfulness. In particular, the number of neighbors for sparse outliers (e.g., sample G and sample H) should be 0. We expect that synthetic samples (i.e., samples I-J) will not be generated based on outliers because outliers distort the distribution of data. In addition, the sample K generated by using sample M and one of its neighbors is too similar to the sample M. Therefore, we expect sample M (i.e., samples with high density) to have more neighbors to improve the generalization of generated samples.

To the best of our knowledge, these issues have not been addressed in related work on SMOTE algorithms so far.

To solve the above issues in SMOTE and its extensions, we propose a new variation of SMOTE by introducing state-of-the-art natural neighbors [36]. The proposed approach is named the synthetic minority oversampling technique with natural neighbors (NaNSMOTE). The principal idea underlying NaNSMOTE is that the random difference between a selected base sample and one of its natural neighbors is used to generate synthetic samples. The main advantages of NaNSMOTE are that (a) it has an adaptive k value related to the data complexity; (b) samples of class centers have more neighbors to improve the generalization of synthetic samples, while border samples have fewer neighbors to reduce the error of synthetic samples; and (c) it can remove outliers. The effectiveness of the NaNSMOTE is validated by intensive comparative experiments on real data sets. The main contributions of this paper are as follows:

(a)
The NaNSMOTE is proposed. It can solve the most challenging issues, namely, the choice of the parameter k and the determination of the number of neighbors for each sample in SMOTE and SMOTE’s extensions.
(b)
We apply NaNSMOTE to 6 oversampling methods based on SMOTE [24], [25], [28], [29], [30], [32]. In addition, 6 new oversampling methods based on NaNSMOTE (i.e., Borderline-NaNSMOTE, Safe-level NaNSMOTE, NaNADASYN, NaNSMOTE-ENN, NaNSMOTE-IPF and k-means NaNSMOTE) are proposed and compared in experiments.

The rest of the paper is structured as follows. Related work is introduced in Section 2. Related terms and concepts are described in Section 3. Section 4 describes the NaNSMOTE, and the intensive experiments we conducted are described in Section 5. Finally, we summarize our work and describe future plans in Section 6.

Section snippets

Related work

The SMOTE was first proposed by Chawla et al. [18]. Due to the simplicity and success of SMOTE, various improved SMOTE algorithms have been developed and applied in practice.

Borderline-SMOTE1 [24], Borderline-SMOTE2 [24] and Safe-level-SMOTE [25] belong to the category of improvements that attach importance to class regions. Specifically, Borderline-SMOTE1 and Borderline-SMOTE2 only generate new samples at the class boundary since they assume that border samples make the greatest contribution

Preliminaries

In this section, symbols and terms are first listed. Next, the theory of natural neighbors is introduced in brief.

Proposed algorithm

A flowchart of the proposed NaNSMOTE is depicted in Fig. 2. First, the original minority class data Smin and the parameter N are used as inputs. Second, NaNs are calculated, and outliers are filtered out by searching for samples without NaNs. Next, synthetic samples are generated by using the random difference between a selected base sample and one of its NaNs. In particular, the main difference between our algorithm and algorithms based on SMOTE (including SMOTE) is that NaNs are used to

Experimental settings

A server with an Inter (R) Xeon (R) Silver 4100 CPU of 2.10 GHz, 64 GB memory and a 64-bit Windows 10 operating system were used to run all experiments. MATLAB 2015 was used for coding.

Real data sets were selected from UCI (University of California Irvine) and then used as experimental data. Table 1 describes them in terms of instance number, attributions and imbalance ratios. Multiclass datasets were binarized using the one-versus-rest method. Specifically, the smallest class was regarded as

Conclusions and future plans

In this paper, we focus on the SMOTE and its extensions, aiming to solve the challenging problems, namely, the choice of parameter k and the number of neighbors. Hence, a synthetic minority over-sampling technique with natural neighbors (NaNSMOTE) is presented. The main idea of NaNSMOTE is that synthetic samples are generated by using the random difference between a selected base sample and one of its natural neighbors. The main advantages of NaNSMOTE are that (a) NaNSMOTE has an adaptive k

CRediT authorship contribution statement

Junnan Li: Conceptualization, Methodology, Software, Writing - review & editing. Qingsheng Zhu: Methodology. Quanwang Wu: Writing - review & editing. Zhu Fan: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61802360), the Chongqing Education and Science Committee project (KJZH17104 and CSTC2017rgun-zdyfx0040) and the Project of Chongqing Natural Science Foundation (cstc2019jcyjmsxmX0683). Also, Project is supported by the Graduate Scientific Research and Innovation Foundation of Chongqing, China (Grant No. CYB20063 and No. CYB20049).

References (50)

X. Yuan et al.
A regularized ensemble framework of deep learning for cancer detection from multi-class, imbalanced training data
Pattern Recogn.
(2018)
L.u. Wang et al.
Dynamic imbalanced business credit evaluation based on Learn++ with sliding time window and weight sampling and FCM with multiple kernels
Inf. Sci.
(2020)
B. Krawczyk et al.
Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy
Appl. Soft Comput.
(2016)
C. Liu et al.
Classifying dna methylation imbalance data in cancer risk prediction using smote and tomek link methods
International Conference of Pioneering Computer Scientists, Engineers and Educators
(2018)
M.R. Prusty et al.
Weighted-smote: a modification to smote for event classification in sodium cooled fast reactors
Prog. Nucl. Energy
(2017)
G. Douzas et al.
Improving imbalanced learning through a heuristic oversampling method based on k-means and smote
Information Science
(2018)
S. Susan et al.
SSO_Maj-SMOTE-SSO_Min: Three-step intelligent pruning of majority and minority samples for learning from imbalanced datasets
Appl. Soft Comput.
(2019)
Q. Zhu et al.
Natural neighbor: a self-adaptive neighborhood method without parameter k
Pattern Recogn. Lett.
(2016)
N. Verbiest et al.
Preprocessing noisy imbalanced datasets using smote enhanced with fuzzy rough prototype selection
Appl. Soft Comput.
(2014)
E. Ramentol et al.
Fuzzy-rough imbalanced learning for the diagnosis of High Voltage Circuit Breaker maintenance: The SMOTE-FRST-2T algorithm
Eng. Appl. Artif. Intell.
(2016)

Cited by (88)

Qingxiangxing Baijiu sensory quality grade classification by <sup>1</sup>H NMR and GC combined with multivariate statistical analysis
2024, Food Control
The traditional uniform artificial sensory evaluation makes it difficult to standardize the classification of different sensory quality grades of Baijiu. In this study, a total of 92 authentic Qingxiangxing Baijiu samples with 3 sensory quality grades were carefully collected. Gas chromatography (GC) was used to determine 46 main flavor components and proton nuclear magnetic resonance (¹H NMR) spectroscopy was employed to obtain hydrogen atom characteristic information of organic compounds. The principal component analysis (PCA), k-nearest neighbor (KNN) and linear discriminant analysis (LDA) models were conducted and fully validated by internal leave-one-out cross validation (LOOCV) and external repeated double random cross validation (RDRCV). The sensory quality grades of Qingxiangxing Baijiu were effectively classified by using GC and ¹H NMR techniques coupled with PCA/KNN analysis with the averaged accuracy higher than 80%. In addition, synthetic minority oversampling technique (SMOTE) algorithm was successfully used to address the model overfitting problem caused by an unbalanced sample composition. This study demonstrated that ¹H NMR and GC combined with multivariate statistical analysis were effective for sensory quality classification of Qingxiangxing Baijiu.
Imbalanced customer churn classification using a new multi-strategy collaborative processing method
2024, Expert Systems with Applications
The rapid advancement of big data and artificial intelligence heralds a dual-edged era of opportunities and challenges for the banking sector. Indeed, enhancing a model's capability to accurately classify imbalanced datasets represents a critical challenge within the field of customer churn prediction (CCP). In this paper, to address the challenges presented by the problem of imbalanced customer classification, a new multi-strategy collaborative processing method named IADASYN-FLCatBoost is proposed from dual perspectives: data and algorithm. At the data level, the traditional Adaptive Synthetic (ADASYN) sampling is improved, that is, the LOF (Local Outlier Factor) algorithm is introduced to eliminate outliers, and the classification features are specially processed to synthesize new minority class samples, thus an improved ADASYN (IADASYN) algorithm is obtained. At the algorithm level, the Focal Loss is embedded into the CatBoost ensemble learning framework to form a new Focal Loss-CatBoost (FLCatBoost) to make a focal-aware, cost-sensitive version of imbalanced customer churn prediction. Moreover, the empirical analysis is conducted in conjunction with the credit card customer dataset obtained from the Kaggle platform. The results of the staged comparison experiments show that the proposed method IADASYN-FLCatBoost in this paper shows the best prediction performance. Comparing the proposed method with 5 other imbalanced classification algorithms and 20 classifiers composed of classical sampling methods and ensemble learning algorithms, it is verified that the classification effect of the proposed method performs best, and the values of Recall, F1 score, G-mean and Area under Precision-Recall curve (AUPRC) have been significantly improved. In addition, further verification of the model also proves that the proposed method has certain generalizability and is still valid for other banks and customer churn datasets of other industries.
A malware detection model based on imbalanced heterogeneous graph embeddings
2024, Expert Systems with Applications
The proliferation of malware in recent years has posed a significant threat to the security of computers and mobile devices. Detecting malware, especially on the Android platform, has become a growing concern for researchers and the software industry. This paper proposes a new method for detecting Android malware based on unbalanced heterogeneous graph embedding. First of all, most malware datasets contain an imbalance of malicious and benign samples, since some types of malware are scarce and difficult to collect. Thus, as a result of this problem, the classification algorithm is unable to analyze the minority samples through sufficient data, resulting in poor downstream classifier performance, in light of the fact that adversarial generation networks possess the characteristic of completing data, an algorithm for generating graph structure data is presented, in which nodes are generated to simulate the distribution of minority nodes within a network topology. Then, considering that heterogeneous information networks have the characteristics of retaining rich node semantic features and mining implicit relationships, heterogeneous graphs are used to construct models for different types of entities (i.e. Apps, APIs, permissions, intents, etc.) and different meta-paths. Finally, a new method is introduced to alleviate the over-smoothing phenomenon of node information in the propagation of deep network. In the deep GCN, we first sample the leader nodes of each layer node, and then add a residual connection and an identity map in order to determine the characteristics of the high-order leader. In this paper, a self-attention-based semantic fusion method is also applied to adaptively fuse embedded representations of software nodes under different meta-paths. The test results demonstrate that the proposed IHODroid model effectively detects malicious software. In the DREBIN dataset, which consists of 123,453 Android applications and 5,560 malicious samples, the IHODroid model achieves an accuracy of 0.9360 and an F1 score of 0.9360, outperforming other state-of-the-art baseline methods.
Two-step ensemble under-sampling algorithm for massive imbalanced data classification
2024, Information Sciences
Imbalanced data classification is a challenging problem in the field of machine learning. Class imbalance, class overlap, and large data volume significantly affect classification performance. Focusing on the impact of class overlap on classification effectiveness, we propose a two-step ensemble under-sampling algorithm based on boundary information mining (TSSE-BIM) with the goal of reducing the information loss from under-sampling methods on large-scale imbalanced data. In the first stage, the proposed method applies an improved equalization under-sampling strategy to mine sample contribution information and quickly obtains the distribution information of data relative to the decision boundary. In the second stage, based on the boundary information, a weighted boundary sampling is performed to remove noisy and highly overlapping samples. It is easy to retain samples with high contribution and effectively suppress the information loss caused by under-sampling. Then, the overall framework is designed based on a serial ensemble similar to boosting, where the weights of each base classifier are assigned to achieve a more powerful performance based on the false positive rate and false negative rate on the original data. Finally, extensive experiments indicate that TSSE-BIM outperforms state-of-the-art methods and ranks first on average under four metrics, especially F1 and MCC.
CBReT: A Cluster-Based Resampling Technique for dealing with imbalanced data in code smell prediction
2024, Knowledge-Based Systems
Code smell refers to substandard design patterns in software’s source code that may lead to faults-prone implementation. Machine learning-based code smell prediction models suffer from data imbalance problems, i.e., one class contains significantly more instances than another. The existing oversampling approaches, such as SMOTE (Synthetic Minority Over-sampling Technique), have been used for balancing the code smell dataset by generating synthetic samples for the minority class. However, the distribution of classes of code smell datasets is overlapped; hence, randomly generated instances can damage the decision boundary between both classes. This paper addresses this issue and proposes a novel Cluster-Based Resampling Technique, CBReT, that generates synthetic instances by considering the distribution of the code smell data. The CBReT first formulates clusters (containing minority and majority instances) based on the data distribution using Gaussian Mixture Model (GMM). Next, each cluster is balanced separately by synthesizing minority instances. While balancing the clusters, the CBReT also checks the validity of the synthetic instances so that each synthetic instance holds similar properties as the other minority instances. To assess the performance of CBReT, extensive experiments have been conducted on the four publicly available benchmark code smell datasets. We have used various performance metrics to evaluate our model’s performance. The experimental results show that the CBReT technique significantly increased the performance of the code smell prediction model by 0.18% (min) and 9.08% (max) compared to the state-of-the-art imbalance learning approaches.
A new oversampling approach based differential evolution on the safe set for highly imbalanced datasets
2023, Expert Systems with Applications
Oversampling method is used to solve the class imbalanced issues. Some existing oversampling methods do not well remove noisy samples and avoid synthesizing noisy samples. Therefore, we propose a new oversampling approach based differential evolution on the safe set for highly imbalanced datasets (SS_DEBOHID). SS_DEBOHID firstly uses $k$ -nearest neighbors (kNN) method to learn the safe area of minority; then the DEBOHID oversampling method is used to synthesize new minority samples in the safe area. The advantages of SS_DEBOHID include that (a) it generates samples in the safe area to reduce generation of noisy samples and reduce synthetic samples falling into the classification boundary and majority area; (b) it uses the DEBOHID method to synthesize samples and increase the diversity of samples; (c) the method is suitable for highly imbalanced datasets. The proposed method is compared with 10 methods on 43 highly imbalanced datasets and evaluated on AUC and G_Mean metrics. The experimental results show that SS_DEBOHID obtains more than 30 best performing datasets on KNN, SVM, and DT classifiers in terms of AUC and G_mean, respectively. The proposed method outperforms other methods by 8.07% to 24.34% on average AUC metric and by at least 6.96% and up to 45.37% on average G_mean metric. In addition, we validate the efficiency of SS_DEBOHID on 8 high-dimensional and large sample size datasets. The experimental results show that SS_DEBOHID has better classification performance and robustness.

View all citing articles on Scopus

View full text

A novel oversampling technique for class-imbalanced learning based on SMOTE and natural neighbors

Abstract

Introduction

Section snippets

Related work

Preliminaries

Proposed algorithm

Experimental settings

Conclusions and future plans

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Pattern Recogn.

Inf. Sci.

Appl. Soft Comput.

International Conference of Pioneering Computer Scientists, Engineers and Educators

Prog. Nucl. Energy

Information Science

Appl. Soft Comput.

Pattern Recogn. Lett.

Appl. Soft Comput.

Eng. Appl. Artif. Intell.

Comput. Methods Programs Biomed.

Knowl.-Based Syst.

Information Science

Pattern Recogn. Lett.

Knowl.-Based Syst.

Knowl.-Based Syst.

Knowl.-Based Syst.

Inf. Sci.

Learning from imbalanced data

IEEE Trans. Knowl. Data Eng.

Consolidated tree classifier learning in a car insurance fraud detection domain with class imbalance

Lect. Notes Comput. Sci.

Strategies for Tackling the Class Imbalance Problem in Marine Image Classification

Int. Conf. on Pattern Recognition

Semi-supervised self-training method based on an optimum-path forest

IEEE Access

SVMs modeling for highly imbalanced classification

IEEE Transactions On Systems Man And Cybernetics

Learning from imbalanced data sets with boosting and data generation: the databoost-IM approach

ACM SIGKDD Explorations Newsletter

Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms

J. Superconput.