Elsevier

Information Sciences

Volume 595, May 2022, Pages 70-88
Information Sciences

SMOTE-RkNN: A hybrid re-sampling method based on SMOTE and reverse k-nearest neighbors

https://doi.org/10.1016/j.ins.2022.02.038Get rights and content

Abstract

In recent years, class imbalance learning (CIL) has become an important branch of machine learning. The Synthetic Minority Oversampling TEchnique (SMOTE) is considered to be a benchmark algorithm among CIL techniques. Although the SMOTE algorithm performs well on the vast majority of class-imbalance tasks, it also has the inherent drawback of noise propagation. Many SMOTE-variants have been proposed to address this problem. Generally, the improved solutions conduct a hybrid sampling procedure, i.e., carrying out an undersampling process after SMOTE to remove noises. However, owing to the complexity of data distribution, it is sometimes difficult to accurately identify real instances of noise, resulting in low modeling quality. In this paper, we propose a more robust and universal SMOTE hybrid variant algorithm named SMOTE-reverse k-nearest neighbors (SMOTE-RkNN). The proposed algorithm identifies noise based on probability density but not local neighborhood information. Specifically, the probability density information of each instance is provided by RkNN, a well-known KNN variant. Noisy instances are found and deleted according to their relevant probability density. In experiments on 46 class-imbalanced data sets, SMOTE-RkNN showed promising results in comparison with several popular SMOTE hybrid variant algorithms.

Introduction

In recent years, with the rapid increase in amounts of data generated in various fields, machine learning techniques have played an unprecedented part [26], [49]. However, some inherent characteristics of data, such as imbalanced data distribution, pose huge challenges for traditional machine learning techniques.

Learning from imbalanced data is an important topic in machine learning, as it is relevant to a wide range of applications, including diagnosis and classification of diseases [21], detection of software defects [3], evaluation of credit risk [41], prediction of actionable revenue change and bankruptcy [29], diagnosis of faults in the industrial procedure [32], classification of soil types [37], and even prediction of crash injury severity [22] or analysis of crime linkages [25]. Class imbalance learning (CIL) is also a challenging task in machine learning, as most supervised learning algorithms are constructed based on the theory of empirical risk minimization. Therefore, traditional learning models tend to favor the majority classes, but ignore the performance of the minority classes [5].

In the past two decades, many CIL technologies have been developed to address the class imbalance problem [16]. In general, these CIL techniques can be roughly divided into three categories: data level, algorithmic level, and ensemble learning. At the data level, data are balanced by either increasing the minority class instances (over-sampling) [9], [27], [31], [43], or removing the majority class instances (under-sampling) [2], [44], or both (hybrid sampling) [15]. At the algorithmic level, the algorithms do not process the distribution of data but modify the existing machine learning algorithms to adapt to the imbalanced data. The algorithmic level techniques include the following main types of methods: cost-sensitive learning [46], threshold moving strategies [45], and kernel learning [42]. Ensemble learning, which combines either a data level or algorithmic level algorithm with the Bagging, Boosting or Random Forest paradigms, can improve the accuracy and robustness of CIL [18], [10], [24], [38], [50]. In comparison with other CIL algorithms, the data level method has two inherent advantages: 1) it is more easily implemented, and 2) it is irrelevant with the used classification models. Therefore, data level algorithms are more popular in practical applications.

At the data level, over-sampling is generally more popular and more widely used than the under-sampling technique. This is because in the process of under-sampling, some important information contained in the data is lost, which further affects the modelling quality. Among the oversampling techniques, the Synthetic Minority Oversampling TEchnique (SMOTE) [9] is the best known and most popular algorithm because it simultaneously solves the overfitting problem of traditional random oversampling (ROS) techniques and the information loss problem of under-sampling techniques. The SMOTE algorithm is effective, however, it also has an inherent drawback, i.e., it tends to propagate noise. Several SMOTE variants have been proposed to address this problem, including SMOTE-TL [2], SMOTE-ENN [15], SMOTE-RSB [34] and SMOTE-IPF [36]. All these variants adopt the idea of hybrid sampling, which means an undersampling procedure is conducted after the SMOTE oversampling procedure to clean the instances of noise.

The hybrid sampling algorithms based on SMOTE determine noise in different fashions. Both SMOTE-TL [2] and SMOTE-ENN [15] take advantage of local neighbourhood information to seek out noise. SMOTE-RSB [34] uses the lower approximation concept in rough set theory to determine and remove synthetic minority noise. SMOTE-IPF [36] adopts an undersampling ensemble to find instances of noise, and remove them iteratively. These solutions can alleviate noise propagation caused by SMOTE to a greater or lesser extent. However, data distribution in real-world applications is complicated in the case of data with a complex distribution, e. g., highly imbalanced data, or imbalanced data with multiple minority small disjunctions, the SMOTE hybrid variants based on neighbourhood calculation are apt to provide extremely inaccurate noise estimation, resulting in a poor quality data set to train classification models.

As indicated above, on complexly distributed data, the information acquired from neighbourhood calculation would be inadequate to accurately estimate the location of each instance, causing wrong estimation of noisy instances. However, we considered that the density information would be robust to the data distribution and could help to accurately locate noise. Therefore, in this paper, we propose a novel hybrid sampling algorithm, SMOTE-RkNN, by combining reverse k-nearest neighbors (RkNN) [30], [33], [35] and SMOTE. Unlike existing techniques, SMOTE-RkNN determines noise according to its density information, which is acquired in a global fashion. First, the SMOTE procedure is conducted on the original training set to generate a balanced training set. Then, within each class, the number of reverse k-nearest neighbors for each training instance is counted. Next, the approximate probability density of each instance is calculated according to RkNN results. Furthermore, a normalized procedure is conducted to proportionally tune the approximate probability densities to make them comparable within different classes. Then, each training instance is put into the other class to obtain heterogeneous probability densities one by one; furthermore, the so-called relevant probability density is acquired by calculating the ratio between the heterogeneous probability density and the homogeneous probability density. Finally, we remove training instances with relevant probability densities higher than a given threshold, as these can be safely considered to be noise, to acquire a noiseless training set. The proposed SMOTE-RkNN algorithm is compared with the SMOTE algorithm and several SMOTE hybrid variants on 46 class imbalanced data sets. SMOTE-RkNN shows promising results, indicating its effectiveness and superiority.

The remainder of this paper is organized as follows. Section 2 reviews related work in the context of SMOTE and its variants. Section 3 describes the proposed algorithm in detail. In Section 4, the experimental results and the corresponding analysis are presented. Finally, Section 5 concludes this paper and proposes some directions for future work.

Section snippets

Related work

As mentioned above, the hundreds of different CIL solutions that exist can be divided into three main categories: data level, algorithmic level, and ensemble learning. CIL techniques at the data level are more suitable for addressing CIL problem as they are more easily implemented and are irrelevant to the learning model. Data level CIL algorithms can be further divided into two groups: under-sampling and over-sampling. In the process of under-sampling, much classification information may be

Methods

Motivated by the problems discussed above, we propose a novel SMOTE hybrid sampling algorithm called SMOTE-RkNN integrating the SMOTE and RkNN techniques. Specifically, RkNN is used to estimate the probability density of each instance in a global fashion, and can be further used to accurately locate noise. Technical details of the SMOTE, RkNN and SMOTE-RkNN algorithms are introduced in the flowing subsections. The time-complexity of SMOTE-RkNN is also discussed.

Data sets description

We collected 46 binary-class imbalanced data sets to verify the effectiveness and superiority of the proposed algorithm. The collection includes 11 data sets collected from the UCI machine learning repository [4], 29 data sets acquired from the Keel data repository [40], two real world bioinformatics data sets [20], [47] and four real data sets from Kaggle. These data sets have 3 ∼ 32 attributes, 169 ∼ 20000 instances, and the class imbalance ratio (IR) varying from 1.78 to 129.92. A detailed

Conclusions

In this paper, we propose a new SMOTE hybrid sampling variant that combines SMOTE and RkNN to address the CIL problem. The reason for adopting RkNN as an undersampling tool is that it can reflect data density distribution in a global fashion, further providing accurate and robust denoising results. That is, the use of RkNN decreases the risk of failing to identify real noisy instances. We conducted experiments on 46 class imbalanced data sets and compared the performance of the proposed SMOTE-Rk

CRediT authorship contribution statement

Aimin Zhang: Conceptualization, Data curation, Investigation, Writing – original draft. Hualong Yu: Resources, Writing – review & editing, Funding acquisition, Supervision. Zhangjun Huan: . Xibei Yang: Writing – review & editing, Funding acquisition. Shang Zheng: Resources, Writing – review & editing. Shang Gao: Formal analysis, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The work was supported in part by Natural Science Foundation of Jiangsu Province of China under grant No.BK20191457, Open Project of Artificial Intelligence Key Laboratory of Sichuan Province under grant No.2019RYJ02, National Natural Science Foundation of China under grants No.62176107, No. 62076111 and No. 62076215.

References (50)

  • P.D. Mahajan et al.

    Optimizing predictive precision in imbalanced datasets for actionable revenue change prediction

    Eur. J. Oper. Res.

    (2020)
  • T. Pan et al.

    Learning imbalanced datasets based on SMOTE and Gaussian distribution

    Inf. Sci.

    (2020)
  • W. Qian et al.

    A novel class imbalance-robust network for bearing fault diagnosis utilizing raw vibration signals

    Measurement

    (2020)
  • P. Sadhukhan et al.

    Reserve-nearest neighborhood based oversaampling for imbalanced, multi-label datasets

    Pattern Recogn. Lett.

    (2019)
  • J.A. Sáez et al.

    SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering

    Inf. Sci.

    (2015)
  • A. Sharififar et al.

    Mapping imbalanced soil classes using Markov chain random fields models treated with data resampling technique

    Comput. Electron. Agric.

    (2019)
  • B. Tang et al.

    GIR-based ensemble sampling approaches for imbalanced learning

    Pattern Recogn.

    (2017)
  • X. Tao et al.

    SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning

    Inf. Sci.

    (2022)
  • L. Wang et al.

    Imbalanced credit risk evaluation based on multiple sampling multiple kernel fuzzy self- organizing map and local accuracy ensemble

    Appl. Soft Comput..

    (2020)
  • Z. Wang et al.

    Cost-sensitive Fuzzy Multiple Kernel Learning for imbalanced problem

    Neurocomputing

    (2019)
  • Z. Xu et al.

    A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data

    Inf. Sci.

    (2021)
  • H. Yu et al.

    ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data

    Neurocomputing.

    (2013)
  • H. Yu et al.

    ODOC-ELM: Optimal decision outputs compensation-based extreme learning machine for classifying imbalanced data

    Knowl.-Based Syst.

    (2016)
  • H. Zhang et al.

    Class-specific attribute value weighting for Naïve Bayes

    Inf. Sci.

    (2020)
  • C. Zhang et al.

    A survey on federated learning

    Knowl.-Based Syst.

    (2021)
  • Cited by (44)

    View all citing articles on Scopus
    View full text