A novel oversampling technique for class-imbalanced learning based on SMOTE and natural neighbors
Introduction
An important cause leading to the performance deterioration of classifiers is class imbalance [1]. In class-imbalance problems, one or more classes (i.e., minority classes) have very few cases while other classes (i.e., majority classes) have large numbers of cases. Hence, class distribution is highly skewed in real-world applications of class-imbalance learning. Examples are biomedical analysis [2], fraud detection [3], enterprise credit evaluation [4], image recognition [5], etc.
The prediction accuracy [6] is usually used to evaluate the performance of a trained classifier in machine learning. Although the predictive capability of a classifier is impaired by class imbalance, the classifier may still achieve very high prediction accuracy. The reason is that the prediction capacity of the trained classifier becomes biased towards the majority class instances and the number of majority class instances is much greater than that of the minority class instances. As a result, the prediction accuracy would not be used as a single evaluation metric for class-imbalance problems. Other metrics, such as G-mean [7], F-measure [8], Kappa statistics [9] and AUC/ROC [10], are also used frequently to complement performance evaluations.
Common technologies for handling the class-imbalance problems are usually divided into three categories: cost-sensitive approaches [11], algorithm level approaches [12], [13] and data level approaches [14]. In cost-sensitive approaches, the degree of imbalance is used to generate the cost matrices for misclassification costs. The objective of the algorithm level approach is to modify classification algorithms so that they can be adapted for the imbalance issue. The data-level approach is the most dominant imbalanced technology since it can be independent of prediction algorithms. Specifically, the data-level approach intends to preprocess data by rebalancing the majority and minority classes.
Oversampling methods [15] and the undersampling methods [16], [17] are two common practices of the data-level approach. Oversampling methods aim to generate new data by replicating important samples of the minority class, while undersampling methods are applied by removing redundant samples from the majority class. Among these methods, the synthetic minority oversampling technique (SMOTE) [18] is one of the most popular oversampling methods. It has garnered much praise and features an extensive range of practical applications, such as gender prediction [19], DNA recognition [20], biomedical research [21] and phishing detection [22].
The classical SMOTE and almost all of its extensions [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33] use the k-nearest neighbor (KNN) [34] to generate new samples. Specifically, the aim of these tools is to synthesize samples of the minority class by using a random difference between a selected base example and one of its KNNs. Typical improved approaches applying this concept include Borderline-SMOTE [24], Safe-level-SMOTE [25], ADASYN [29], SMOTE-IPF [30], k-means SMOTE [32], etc.
In this paper, we focus on SMOTE and its extensions, aiming to solve the most challenging neighborhood issues.
- (a)
The choice of k value is an important but unsolved issue [14], [18], [35] in SMOTE and its extensions. Its importance results from the ability of the parameter k to affect the generalization and faithfulness of synthetic samples [14]. In SMOTE and its extensions, the value of k is usually set to the fixed empirical value 5. Actually, we need a dynamic and adaptive k value to adapt to different data sets with differing aspects of complexity (i.e., distributions, size and dimensions).
- (b)
The second issue is that the number of neighbors should not be the same for each sample. In fact, samples in dense areas (e.g., samples of class centers) should have more neighbors because we tend to increase the generalization of the generated samples. In contrast, samples in sparse areas (e.g., border samples and outlier samples) should have fewer neighbors because we tend to decrease the error of generated samples. Fig. 1 provides a toy example to visualize this issue, where the SMOTE with k = 5 is used because SMOTE’s extensions use the SMOTE in the last step. Specifically, Fig. 1 shows that the KNNs of sample A are samples B-F. We consider border sample A as a base sample. Then, we use sample A and one of its KNNs (i.e., sample F) to generate the new sample L. However, sample L tends to be noise since it is located on the fuzzy boundary and has class labels different from those samples around it. If sample A has fewer neighbors (e.g., samples B-D), the generated samples will show higher faithfulness. In particular, the number of neighbors for sparse outliers (e.g., sample G and sample H) should be 0. We expect that synthetic samples (i.e., samples I-J) will not be generated based on outliers because outliers distort the distribution of data. In addition, the sample K generated by using sample M and one of its neighbors is too similar to the sample M. Therefore, we expect sample M (i.e., samples with high density) to have more neighbors to improve the generalization of generated samples.
To the best of our knowledge, these issues have not been addressed in related work on SMOTE algorithms so far.
To solve the above issues in SMOTE and its extensions, we propose a new variation of SMOTE by introducing state-of-the-art natural neighbors [36]. The proposed approach is named the synthetic minority oversampling technique with natural neighbors (NaNSMOTE). The principal idea underlying NaNSMOTE is that the random difference between a selected base sample and one of its natural neighbors is used to generate synthetic samples. The main advantages of NaNSMOTE are that (a) it has an adaptive k value related to the data complexity; (b) samples of class centers have more neighbors to improve the generalization of synthetic samples, while border samples have fewer neighbors to reduce the error of synthetic samples; and (c) it can remove outliers. The effectiveness of the NaNSMOTE is validated by intensive comparative experiments on real data sets. The main contributions of this paper are as follows:
- (a)
The NaNSMOTE is proposed. It can solve the most challenging issues, namely, the choice of the parameter k and the determination of the number of neighbors for each sample in SMOTE and SMOTE’s extensions.
- (b)
We apply NaNSMOTE to 6 oversampling methods based on SMOTE [24], [25], [28], [29], [30], [32]. In addition, 6 new oversampling methods based on NaNSMOTE (i.e., Borderline-NaNSMOTE, Safe-level NaNSMOTE, NaNADASYN, NaNSMOTE-ENN, NaNSMOTE-IPF and k-means NaNSMOTE) are proposed and compared in experiments.
The rest of the paper is structured as follows. Related work is introduced in Section 2. Related terms and concepts are described in Section 3. Section 4 describes the NaNSMOTE, and the intensive experiments we conducted are described in Section 5. Finally, we summarize our work and describe future plans in Section 6.
Section snippets
Related work
The SMOTE was first proposed by Chawla et al. [18]. Due to the simplicity and success of SMOTE, various improved SMOTE algorithms have been developed and applied in practice.
Borderline-SMOTE1 [24], Borderline-SMOTE2 [24] and Safe-level-SMOTE [25] belong to the category of improvements that attach importance to class regions. Specifically, Borderline-SMOTE1 and Borderline-SMOTE2 only generate new samples at the class boundary since they assume that border samples make the greatest contribution
Preliminaries
In this section, symbols and terms are first listed. Next, the theory of natural neighbors is introduced in brief.
Proposed algorithm
A flowchart of the proposed NaNSMOTE is depicted in Fig. 2. First, the original minority class data Smin and the parameter N are used as inputs. Second, NaNs are calculated, and outliers are filtered out by searching for samples without NaNs. Next, synthetic samples are generated by using the random difference between a selected base sample and one of its NaNs. In particular, the main difference between our algorithm and algorithms based on SMOTE (including SMOTE) is that NaNs are used to
Experimental settings
A server with an Inter (R) Xeon (R) Silver 4100 CPU of 2.10 GHz, 64 GB memory and a 64-bit Windows 10 operating system were used to run all experiments. MATLAB 2015 was used for coding.
Real data sets were selected from UCI (University of California Irvine) and then used as experimental data. Table 1 describes them in terms of instance number, attributions and imbalance ratios. Multiclass datasets were binarized using the one-versus-rest method. Specifically, the smallest class was regarded as
Conclusions and future plans
In this paper, we focus on the SMOTE and its extensions, aiming to solve the challenging problems, namely, the choice of parameter k and the number of neighbors. Hence, a synthetic minority over-sampling technique with natural neighbors (NaNSMOTE) is presented. The main idea of NaNSMOTE is that synthetic samples are generated by using the random difference between a selected base sample and one of its natural neighbors. The main advantages of NaNSMOTE are that (a) NaNSMOTE has an adaptive k
CRediT authorship contribution statement
Junnan Li: Conceptualization, Methodology, Software, Writing - review & editing. Qingsheng Zhu: Methodology. Quanwang Wu: Writing - review & editing. Zhu Fan: Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (61802360), the Chongqing Education and Science Committee project (KJZH17104 and CSTC2017rgun-zdyfx0040) and the Project of Chongqing Natural Science Foundation (cstc2019jcyjmsxmX0683). Also, Project is supported by the Graduate Scientific Research and Innovation Foundation of Chongqing, China (Grant No. CYB20063 and No. CYB20049).
References (50)
- et al.
A regularized ensemble framework of deep learning for cancer detection from multi-class, imbalanced training data
Pattern Recogn.
(2018) - et al.
Dynamic imbalanced business credit evaluation based on Learn++ with sliding time window and weight sampling and FCM with multiple kernels
Inf. Sci.
(2020) - et al.
Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy
Appl. Soft Comput.
(2016) - et al.
Classifying dna methylation imbalance data in cancer risk prediction using smote and tomek link methods
International Conference of Pioneering Computer Scientists, Engineers and Educators
(2018) - et al.
Weighted-smote: a modification to smote for event classification in sodium cooled fast reactors
Prog. Nucl. Energy
(2017) - et al.
Improving imbalanced learning through a heuristic oversampling method based on k-means and smote
Information Science
(2018) - et al.
SSOMaj-SMOTE-SSOMin: Three-step intelligent pruning of majority and minority samples for learning from imbalanced datasets
Appl. Soft Comput.
(2019) - et al.
Natural neighbor: a self-adaptive neighborhood method without parameter k
Pattern Recogn. Lett.
(2016) - et al.
Preprocessing noisy imbalanced datasets using smote enhanced with fuzzy rough prototype selection
Appl. Soft Comput.
(2014) - et al.
Fuzzy-rough imbalanced learning for the diagnosis of High Voltage Circuit Breaker maintenance: The SMOTE-FRST-2T algorithm
Eng. Appl. Artif. Intell.
(2016)
A hybrid classifier combining borderline-smote with airs algorithm for estimating brain metastasis from lung cancer: a case study in taiwan
Comput. Methods Programs Biomed.
SMOTE based class-specific extreme learning machine for imbalanced learning
Knowl.-Based Syst.
A concurrency control algorithm for nearest neighbor query
Information Science
Analysis of new techniques to obtain quality training sets
Pattern Recogn. Lett.
A non-parameter outlier detection algorithm based on natural neighbor
Knowl.-Based Syst.
Wu A self-training method based on density peaks and an extended parameter-free local noise filter for k nearest neighbor
Knowl.-Based Syst.
An effective framework based on local cores for self-labeled semi-supervised classification
Knowl.-Based Syst.
Learning imbalanced datasets based on SMOTE and Gaussian distribution
Inf. Sci.
Learning from imbalanced data
IEEE Trans. Knowl. Data Eng.
Consolidated tree classifier learning in a car insurance fraud detection domain with class imbalance
Lect. Notes Comput. Sci.
Strategies for Tackling the Class Imbalance Problem in Marine Image Classification
Int. Conf. on Pattern Recognition
Semi-supervised self-training method based on an optimum-path forest
IEEE Access
SVMs modeling for highly imbalanced classification
IEEE Transactions On Systems Man And Cybernetics
Learning from imbalanced data sets with boosting and data generation: the databoost-IM approach
ACM SIGKDD Explorations Newsletter
Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms
J. Superconput.
Cited by (88)
Imbalanced customer churn classification using a new multi-strategy collaborative processing method
2024, Expert Systems with ApplicationsA malware detection model based on imbalanced heterogeneous graph embeddings
2024, Expert Systems with ApplicationsTwo-step ensemble under-sampling algorithm for massive imbalanced data classification
2024, Information SciencesCBReT: A Cluster-Based Resampling Technique for dealing with imbalanced data in code smell prediction
2024, Knowledge-Based SystemsA new oversampling approach based differential evolution on the safe set for highly imbalanced datasets
2023, Expert Systems with Applications