SMOTE-RkNN: A hybrid re-sampling method based on SMOTE and reverse k-nearest neighbors
Introduction
In recent years, with the rapid increase in amounts of data generated in various fields, machine learning techniques have played an unprecedented part [26], [49]. However, some inherent characteristics of data, such as imbalanced data distribution, pose huge challenges for traditional machine learning techniques.
Learning from imbalanced data is an important topic in machine learning, as it is relevant to a wide range of applications, including diagnosis and classification of diseases [21], detection of software defects [3], evaluation of credit risk [41], prediction of actionable revenue change and bankruptcy [29], diagnosis of faults in the industrial procedure [32], classification of soil types [37], and even prediction of crash injury severity [22] or analysis of crime linkages [25]. Class imbalance learning (CIL) is also a challenging task in machine learning, as most supervised learning algorithms are constructed based on the theory of empirical risk minimization. Therefore, traditional learning models tend to favor the majority classes, but ignore the performance of the minority classes [5].
In the past two decades, many CIL technologies have been developed to address the class imbalance problem [16]. In general, these CIL techniques can be roughly divided into three categories: data level, algorithmic level, and ensemble learning. At the data level, data are balanced by either increasing the minority class instances (over-sampling) [9], [27], [31], [43], or removing the majority class instances (under-sampling) [2], [44], or both (hybrid sampling) [15]. At the algorithmic level, the algorithms do not process the distribution of data but modify the existing machine learning algorithms to adapt to the imbalanced data. The algorithmic level techniques include the following main types of methods: cost-sensitive learning [46], threshold moving strategies [45], and kernel learning [42]. Ensemble learning, which combines either a data level or algorithmic level algorithm with the Bagging, Boosting or Random Forest paradigms, can improve the accuracy and robustness of CIL [18], [10], [24], [38], [50]. In comparison with other CIL algorithms, the data level method has two inherent advantages: 1) it is more easily implemented, and 2) it is irrelevant with the used classification models. Therefore, data level algorithms are more popular in practical applications.
At the data level, over-sampling is generally more popular and more widely used than the under-sampling technique. This is because in the process of under-sampling, some important information contained in the data is lost, which further affects the modelling quality. Among the oversampling techniques, the Synthetic Minority Oversampling TEchnique (SMOTE) [9] is the best known and most popular algorithm because it simultaneously solves the overfitting problem of traditional random oversampling (ROS) techniques and the information loss problem of under-sampling techniques. The SMOTE algorithm is effective, however, it also has an inherent drawback, i.e., it tends to propagate noise. Several SMOTE variants have been proposed to address this problem, including SMOTE-TL [2], SMOTE-ENN [15], SMOTE-RSB [34] and SMOTE-IPF [36]. All these variants adopt the idea of hybrid sampling, which means an undersampling procedure is conducted after the SMOTE oversampling procedure to clean the instances of noise.
The hybrid sampling algorithms based on SMOTE determine noise in different fashions. Both SMOTE-TL [2] and SMOTE-ENN [15] take advantage of local neighbourhood information to seek out noise. SMOTE-RSB [34] uses the lower approximation concept in rough set theory to determine and remove synthetic minority noise. SMOTE-IPF [36] adopts an undersampling ensemble to find instances of noise, and remove them iteratively. These solutions can alleviate noise propagation caused by SMOTE to a greater or lesser extent. However, data distribution in real-world applications is complicated in the case of data with a complex distribution, e. g., highly imbalanced data, or imbalanced data with multiple minority small disjunctions, the SMOTE hybrid variants based on neighbourhood calculation are apt to provide extremely inaccurate noise estimation, resulting in a poor quality data set to train classification models.
As indicated above, on complexly distributed data, the information acquired from neighbourhood calculation would be inadequate to accurately estimate the location of each instance, causing wrong estimation of noisy instances. However, we considered that the density information would be robust to the data distribution and could help to accurately locate noise. Therefore, in this paper, we propose a novel hybrid sampling algorithm, SMOTE-RkNN, by combining reverse k-nearest neighbors (RkNN) [30], [33], [35] and SMOTE. Unlike existing techniques, SMOTE-RkNN determines noise according to its density information, which is acquired in a global fashion. First, the SMOTE procedure is conducted on the original training set to generate a balanced training set. Then, within each class, the number of reverse k-nearest neighbors for each training instance is counted. Next, the approximate probability density of each instance is calculated according to RkNN results. Furthermore, a normalized procedure is conducted to proportionally tune the approximate probability densities to make them comparable within different classes. Then, each training instance is put into the other class to obtain heterogeneous probability densities one by one; furthermore, the so-called relevant probability density is acquired by calculating the ratio between the heterogeneous probability density and the homogeneous probability density. Finally, we remove training instances with relevant probability densities higher than a given threshold, as these can be safely considered to be noise, to acquire a noiseless training set. The proposed SMOTE-RkNN algorithm is compared with the SMOTE algorithm and several SMOTE hybrid variants on 46 class imbalanced data sets. SMOTE-RkNN shows promising results, indicating its effectiveness and superiority.
The remainder of this paper is organized as follows. Section 2 reviews related work in the context of SMOTE and its variants. Section 3 describes the proposed algorithm in detail. In Section 4, the experimental results and the corresponding analysis are presented. Finally, Section 5 concludes this paper and proposes some directions for future work.
Section snippets
Related work
As mentioned above, the hundreds of different CIL solutions that exist can be divided into three main categories: data level, algorithmic level, and ensemble learning. CIL techniques at the data level are more suitable for addressing CIL problem as they are more easily implemented and are irrelevant to the learning model. Data level CIL algorithms can be further divided into two groups: under-sampling and over-sampling. In the process of under-sampling, much classification information may be
Methods
Motivated by the problems discussed above, we propose a novel SMOTE hybrid sampling algorithm called SMOTE-RkNN integrating the SMOTE and RkNN techniques. Specifically, RkNN is used to estimate the probability density of each instance in a global fashion, and can be further used to accurately locate noise. Technical details of the SMOTE, RkNN and SMOTE-RkNN algorithms are introduced in the flowing subsections. The time-complexity of SMOTE-RkNN is also discussed.
Data sets description
We collected 46 binary-class imbalanced data sets to verify the effectiveness and superiority of the proposed algorithm. The collection includes 11 data sets collected from the UCI machine learning repository [4], 29 data sets acquired from the Keel data repository [40], two real world bioinformatics data sets [20], [47] and four real data sets from Kaggle. These data sets have 3 ∼ 32 attributes, 169 ∼ 20000 instances, and the class imbalance ratio (IR) varying from 1.78 to 129.92. A detailed
Conclusions
In this paper, we propose a new SMOTE hybrid sampling variant that combines SMOTE and RkNN to address the CIL problem. The reason for adopting RkNN as an undersampling tool is that it can reflect data density distribution in a global fashion, further providing accurate and robust denoising results. That is, the use of RkNN decreases the risk of failing to identify real noisy instances. We conducted experiments on 46 class imbalanced data sets and compared the performance of the proposed SMOTE-Rk
CRediT authorship contribution statement
Aimin Zhang: Conceptualization, Data curation, Investigation, Writing – original draft. Hualong Yu: Resources, Writing – review & editing, Funding acquisition, Supervision. Zhangjun Huan: . Xibei Yang: Writing – review & editing, Funding acquisition. Shang Zheng: Resources, Writing – review & editing. Shang Gao: Formal analysis, Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The work was supported in part by Natural Science Foundation of Jiangsu Province of China under grant No.BK20191457, Open Project of Artificial Intelligence Key Laboratory of Sichuan Province under grant No.2019RYJ02, National Natural Science Foundation of China under grants No.62176107, No. 62076111 and No. 62076215.
References (50)
- et al.
Convolution-based linear descriminant analysis for functional data classification
Inf. Sci.
(2021) - et al.
A hybrid data-level ensemble to enable learning from highly imbalanced dataset
Inf. Sci.
(2021) - et al.
Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power
Inf. Sci.
(2010) - et al.
Ensembles of feature selectors for dealing with class-imbalance datasets: A proposal and comparative study
Inf. Sci.
(2020) - et al.
Sample imbalance disease classification model based on association rule feature selection
Pattern Recogn. Lett.
(2020) - et al.
Classification of motor vehicle crash injury severity: A hybrid approach for imbalanced data
Accid. Anal. Prev.
(2018) Smote-variants: A python implementation of 85 minority oversampling techniques
Neurocomputing
(2019)- et al.
Ensembles of cost-diverse Bayesian neural learners for imbalanced binary classification
Inf. Sci.
(2020) - et al.
A novel random forest approach for imbalance problem in crime linkage
Knowl.-Based Syst.
(2020) - et al.
A novel oversampling technique for class-imbalanced learning based on SMOTE and natural neighbors
Inf. Sci.
(2021)
Optimizing predictive precision in imbalanced datasets for actionable revenue change prediction
Eur. J. Oper. Res.
Learning imbalanced datasets based on SMOTE and Gaussian distribution
Inf. Sci.
A novel class imbalance-robust network for bearing fault diagnosis utilizing raw vibration signals
Measurement
Reserve-nearest neighborhood based oversaampling for imbalanced, multi-label datasets
Pattern Recogn. Lett.
SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering
Inf. Sci.
Mapping imbalanced soil classes using Markov chain random fields models treated with data resampling technique
Comput. Electron. Agric.
GIR-based ensemble sampling approaches for imbalanced learning
Pattern Recogn.
SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning
Inf. Sci.
Imbalanced credit risk evaluation based on multiple sampling multiple kernel fuzzy self- organizing map and local accuracy ensemble
Appl. Soft Comput..
Cost-sensitive Fuzzy Multiple Kernel Learning for imbalanced problem
Neurocomputing
A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data
Inf. Sci.
ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data
Neurocomputing.
ODOC-ELM: Optimal decision outputs compensation-based extreme learning machine for classifying imbalanced data
Knowl.-Based Syst.
Class-specific attribute value weighting for Naïve Bayes
Inf. Sci.
A survey on federated learning
Knowl.-Based Syst.
Cited by (44)
Two-step ensemble under-sampling algorithm for massive imbalanced data classification
2024, Information SciencesPredicting lodging severity in dry peas using UAS-mounted RGB, LIDAR, and multispectral sensors
2024, Remote Sensing Applications: Society and EnvironmentSMOTE-kTLNN: A hybrid re-sampling method based on SMOTE and a two-layer nearest neighbor classifier
2024, Expert Systems with ApplicationsR-WDLS: An efficient security region oversampling technique based on data distribution
2024, Applied Soft ComputingAWGAN: An adaptive weighting GAN approach for oversampling imbalanced datasets
2024, Information Sciences