SMOTE-NaN-DE: Addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution

https://doi.org/10.1016/j.knosys.2021.107056Get rights and content

Highlights

  • A novel oversampling method based on natural neighbors and differential evolution.

  • A novel error detection based on natural neighbors.

  • The differential evolution is modified to optimize found noisy and borderline samples.

  • The proposed algorithm is applied to 4 oversampling methods and outperforms 9 SMOTE-based methods.

Abstract

Learning a classifier from class-imbalance data is an important challenge. Among existing solutions, SMOTE is one of the most successful methods and has an extensive range of practical applications. The performance of SMOTE and its extensions usually degrades owing to noisy and borderline examples. Filtering-based methods have been developed to address this problem but still have the following technical defects: (a) Error detection techniques heavily rely on parameter settings; (b) Examples detected by error detection techniques are directly eliminated, leading to deviation of obtained decision boundary and class imbalance again. To advance the state of the art, a novel filtering-based oversampling method called SMOTE-NaN-DE is proposed in this paper. In SMOTE-NaN-DE, a SMOTE-based method is first used to generate synthetic samples and improve original class-imbalance data. Secondly, an error detection technique based on natural neighbors is used to detect noisy and borderline examples. Thirdly, the differential evolution (DE) is used to optimize and change iteratively the position (attributes) of found examples instead of eliminating them. The main advantages of SMOTE-NaN-DE are that (a) It can improve almost all of SMOTE-based methods in terms of the noise problem; (b) Error detection technique is parameter-free; (c) Examples found by error detection technique are optimized by the differential evolution rather than removed, which keeps imbalance ratio and improve the boundary; (d) It is more suitable for data sets with more noise (especially class noise). The effectiveness of the proposed SMOTE-NaN-DE is validated by intensive comparison experiments on artificial and real data sets.

Introduction

Developing techniques for the machine learning of a classifier from class-imbalance data presents an important challenge in fields such as biomedical science [1], text classification [2], fraud prediction [3], intrusion detection [4], image recognition [5], etc. In class-imbalance problems, one or more classes (i.e., minority classes) have very few cases, while other classes (i.e., majority classes) have large numbers of cases. However, the minority classes are often the most interesting from the application point of view and usually tend to be misclassified.

The class-imbalance classification has been intensively researched in the last decade. The methods to address it can be divided into three categories [6]: the cost-sensitive approach [7], the algorithm level approach [8] and the data level approach [9]. The data level approach is the most dominant as it is independent of classifiers. More specifically, it includes oversampling [10] and undersampling methods [11].

The Synthetic Minority Over-sampling Technique (SMOTE) [12] is one of the most well-known oversampling methods. It generates new artificial minority class examples by interpolating among several minority class examples that lie together. SMOTE has been successful, has received great praise, and features an extensive range of practical applications, such as gender prediction [13], DNA recognition [14], biomedical research [15], phishing detection [16], etc.

Recent work [17], [18], [19], [20] shows that the performance degradation of SMOTE and its extensions is usually associated with noisy and borderline examples. This is because SMOTE-based methods cannot​ handle noise adequately and can even reinforce it: (a) The introduced instances might be the result of the interpolation between noise and borderline examples corrupted by noise [19]; (b) There is the disruption of the boundaries between the classes and, therefore, the overlapping between them is increased [20].

This paper considers the noise in the wider sense of [17], [18]. Noisy examples are treated as examples corrupted either in the attribute values or the class label [21]. Borderline examples are defined as examples located either very close to the decision boundary between minority and majority classes or located in the area surrounding class boundaries where classes overlap. The attribute noise can move borderline examples to the wrong side of the decision boundary. Fig. 1 uses toy data to visualize noisy and borderline examples, where two classes have different colors.

Change-direction and filtering-based methods have been proposed to address the problem of noisy and borderline examples. Change-direction methods guide the generation of synthetic examples performed by SMOTE towards specific parts of the input space, e.g., Borderline-SMOTE [22], Safe-Level-SMOTE [23], ADASYN [24] and Adaptive-SMOTE [25]. In Filtering-based methods, error detection techniques are used to find noisy and corrupted borderline examples. Subsequently, found examples are eliminated. SMOTE-TL [17], SMOTE-ENN [17] and SMOTE-IPF [20] are representative filtering-based methods. They use Tomek Links (TL) [26], Edited Nearest Neighbor Rule (ENN) [27] and Iterative-Partitioning Filter (IPF) [28] to detect noisy and corrupted borderline examples. Although recent experimental studies, e.g., [20], confirm the superiority of filtering-based methods, they still have unnoticed but serious shortcomings:

Error detection techniques heavily rely on parameter settings, leading to unstable performance and difficulty in application. SMOTE-ENN relies on the parameter k of ENN. SMOTE-IPF relies on the parameters of IPF, e.g., the iterative number k, the number of ensemble classifiers n and the parameters of the base classifier.

Existing filtering-based methods directly remove examples detected by error detection techniques, which may eliminate a large number of synthetic samples or (and) original samples. Due to the loss of lots of synthetic or (and) original examples, the decision boundary of the preprocessed data may deviate from the real decision boundary while the class proportion will be imbalanced again. In Fig. 2, ADASYN, a SMOTE-based method, is used to generate synthetic samples because ADASYN may generate more synthetic noise. The IPF is the cleaning method used in the latest SMOTE-IPF. It is used to detect noisy and corrupted borderline examples in ADASYN in Fig. 2(c). Observing Fig. 2(c), lots of synthetic or (and) original examples are eliminated, leading to the deviation of obtained decision boundary and the imbalance of class again.

To advance the state of the art, a novel filtering-based oversampling method called SMOTE-NaN-DE is proposed in this paper. In SMOTE-NaN-DE, a SMOTE-based method is first used to generate synthetic samples and improve original class-imbalance data. Secondly, the natural neighbor [29] is introduced as an error detection technique to detect noisy and borderline examples. Thirdly, the differential evolution (DE) [30] is used to optimize and change iteratively the position (attributes) of found examples instead of eliminating them. The main advantages of SMOTE-NaN-DE are that (a) SMOTE-NaN-DE is a wrapping algorithm and can improve almost all of SMOTE-based algorithms in terms of the noisy problem; (b) Error detection technique is parameter-free; (c) Examples found by error detection technique are optimized by the differential evolution rather than removed, which keeps the imbalance ratio and improves the boundary; (d) It is more suitable for data sets with more noise (especially class noise). The effectiveness of the SMOTE-NaN-DE is validated by intensive comparative experiments on artificial and real data sets. The main contributions of this paper are highlighted as follow:

SMOTE-NaN-DE is proposed. It can address the problem of noisy and borderline examples in SMOTE-based methods. Also, it can overcome the mentioned technical defects of existing filtering-based methods.

In SMOTE-NaN-DE, a parameter-free error detection technique based on natural neighbors is proposed to detect noisy and borderline examples in the class-imbalance classification. Compared to the error detection techniques [27], [28] in existing filtering-based methods [17], [20], the proposed error detection technique does not need to set parameters.

Existing filtering-based methods just remove noisy examples found by error detection techniques. In SMOTE-NaN-DE, the differential evolution is modified to optimize the attributes of examples found by the error detection technique, which keeps the imbalance ratio and improves the boundary.

SMOTE-NaN-DE is a wrapping algorithm and can improve almost all SMOTE-based methods. Hence, SMOTE-NaN-DE is applied to 4 oversampling methods based on SMOTE (SMOTE [12], ADASYN [24], k-means SMOTE [31], Adaptive-SMOTE [25]). Besides, they are compared in experiments with 3 test classifiers.

The rest of the paper is structured as follows. Related work is introduced in Section 2. Related terms and concepts are described in Section 3. Section 4 introduces SMOTE-NaN-DE and we carry out intensive experiments in Section 5. Finally, we summarize our work and make future arrangements in Section 6.

Section snippets

Related work

The SMOTE was first proposed by Chawla et al. [12] to handle class-imbalance classification. Due to the wrapping advantage, SMOTE has been used to help other classifiers adapt to class-imbalance data. To adapt the boosting method to the imbalance classification, SMOTEBoost [32] is developed and it applies SMOTE to each iteration of the boosting method. In KSMOTE [33], SMOTE is used to help a support vector machine adapt to class-imbalance data. Besides, SMOTECSELM [34] combines ELM with SMOTE

Preliminaries

The main symbols and terms are introduced firstly. Then the theory of natural neighbors is stated simply because the proposed error detection method is based on it.

Proposed algorithm

A flowchart of the proposed SMOTE-NaN-DE is depicted in Fig. 3. Initially, original data is divided into minority and majority class samples. A SMOTE-based method is used to generate synthetic samples, and then synthetic samples are used to enlarge original data. Secondly, natural neighbors (NaNs) are searched. NaNs are used as the error detection method to find noisy and borderline examples, while others are normal examples. Thirdly, the differential evolution is used to optimize found noisy

Experimental settings

A server with an Inter (R) Xeon (R) Silver 4100 CPU of 2.10 GHz, a 64GB memory and a 64-bit Windows 10 operating system is used to run all experiments. MATLAB 2015 is used for coding.

Real data sets, selected from the University of California Irvine (UCI) data sets, are used as experimental data. Table 1 describes them in terms of instance number, attributes, imbalance ratios, original class number, and abbreviations. While the proposed method could be extended to cope with multi-class

Conclusions and future plans

The noisy and borderline example is an important issue in SMOTE and its extensions. Filtering-based methods are the main technology to solve this problem but still have technical defects: (a) Error detection techniques heavily rely on parameters; (b) Examples detected by error detection techniques are just eliminated, leading to deviation of obtained decision boundary and imbalance of class again. To advance the state of the art, a novel filtering-based oversampling method called SMOTE-NaN-DE

CRediT authorship contribution statement

Junnan Li: Conceptualization, Methodology, Software, Data curation, Writing - original draft. Qingsheng Zhu: Conceptualization, Methodology. Quanwang Wu: Writing - review & editing. Zhiyong Zhang: Writing - review & editing. Yanlu Gong: Writing - review & editing. Ziqing He: Writing - review & editing. Fan Zhu: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61802360 and 62006029), the Chongqing Education and Science Committee, China project (KJZH17104 and CSTC2017rgun-zdyfx0040), the Project of Chongqing Natural Science Foundation, China (cstc2019jcyj-msxmX0683), the Project was supported by the graduate scientific research and innovation foundation of Chongqing, China (Grant No. CYB20063), and the Chongqing Research and Frontier Technology, China (Grant No.

References (49)

  • SánchezJ. et al.

    Analysis of new techniques to obtain quality training sets

    Pattern Recognit. Lett.

    (2003)
  • HuangJ. et al.

    A non-parameter outlier detection algorithm based on natural neighbor

    Knowl.-Based Syst.

    (2016)
  • ChengD. et al.

    Natural neighbor-based clustering algorithm with local representatives

    Knowl.-Based Syst.

    (2017)
  • GaoL. et al.

    Handling imbalanced medical image data: A deep-learning-based one-class classification approach

    Artif. Intell. Med.

    (2020)
  • HeH. et al.

    Learning from imbalanced data

    IEEE Trans. Data Knowl. Eng.

    (2009)
  • FanW. et al.

    Adacost: misclassification cost-sensitive boosting

  • DubeyH. et al.

    Class based weighted K-nearest neighbor over imbalance dataset

  • BatistaG.E. et al.

    A study of the behavior of several methods for balancing machine learning training data

    ACM SIGKDD Explor. Newsl.

    (2004)
  • LiJ. et al.

    A parameter-free hybrid instance selection algorithm based on local sets with natural neighbors

    Appl. Intell.

    (2002)
  • ChawlaN.V. et al.

    SMOTE: Synthetic minority oversampling technique

    J. Artif. Intell. Res.

    (2002)
  • KamarulzalisA.H. et al.

    Data pre-processing using smote technique for gender classification with imbalance hu’s moments features

  • C. Liu, J. Wu, L. Mirador, Y. Song, W. Hou, Classifying dna methylation imbalance data in cancer risk prediction using...
  • NakamuraM. et al.

    Lvq-smote-learning vector quantization based synthetic minority over-sampling technique for biomedical data

    Biodata Min.

    (2013)
  • J. Zhang, X. Li, Phishing detection method based on borderline-smote deep belief network, in: International Conference...
  • Cited by (0)

    View full text