SMOTE-NaN-DE: Addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution
Introduction
Developing techniques for the machine learning of a classifier from class-imbalance data presents an important challenge in fields such as biomedical science [1], text classification [2], fraud prediction [3], intrusion detection [4], image recognition [5], etc. In class-imbalance problems, one or more classes (i.e., minority classes) have very few cases, while other classes (i.e., majority classes) have large numbers of cases. However, the minority classes are often the most interesting from the application point of view and usually tend to be misclassified.
The class-imbalance classification has been intensively researched in the last decade. The methods to address it can be divided into three categories [6]: the cost-sensitive approach [7], the algorithm level approach [8] and the data level approach [9]. The data level approach is the most dominant as it is independent of classifiers. More specifically, it includes oversampling [10] and undersampling methods [11].
The Synthetic Minority Over-sampling Technique (SMOTE) [12] is one of the most well-known oversampling methods. It generates new artificial minority class examples by interpolating among several minority class examples that lie together. SMOTE has been successful, has received great praise, and features an extensive range of practical applications, such as gender prediction [13], DNA recognition [14], biomedical research [15], phishing detection [16], etc.
Recent work [17], [18], [19], [20] shows that the performance degradation of SMOTE and its extensions is usually associated with noisy and borderline examples. This is because SMOTE-based methods cannot handle noise adequately and can even reinforce it: (a) The introduced instances might be the result of the interpolation between noise and borderline examples corrupted by noise [19]; (b) There is the disruption of the boundaries between the classes and, therefore, the overlapping between them is increased [20].
This paper considers the noise in the wider sense of [17], [18]. Noisy examples are treated as examples corrupted either in the attribute values or the class label [21]. Borderline examples are defined as examples located either very close to the decision boundary between minority and majority classes or located in the area surrounding class boundaries where classes overlap. The attribute noise can move borderline examples to the wrong side of the decision boundary. Fig. 1 uses toy data to visualize noisy and borderline examples, where two classes have different colors.
Change-direction and filtering-based methods have been proposed to address the problem of noisy and borderline examples. Change-direction methods guide the generation of synthetic examples performed by SMOTE towards specific parts of the input space, e.g., Borderline-SMOTE [22], Safe-Level-SMOTE [23], ADASYN [24] and Adaptive-SMOTE [25]. In Filtering-based methods, error detection techniques are used to find noisy and corrupted borderline examples. Subsequently, found examples are eliminated. SMOTE-TL [17], SMOTE-ENN [17] and SMOTE-IPF [20] are representative filtering-based methods. They use Tomek Links (TL) [26], Edited Nearest Neighbor Rule (ENN) [27] and Iterative-Partitioning Filter (IPF) [28] to detect noisy and corrupted borderline examples. Although recent experimental studies, e.g., [20], confirm the superiority of filtering-based methods, they still have unnoticed but serious shortcomings:
Error detection techniques heavily rely on parameter settings, leading to unstable performance and difficulty in application. SMOTE-ENN relies on the parameter k of ENN. SMOTE-IPF relies on the parameters of IPF, e.g., the iterative number k, the number of ensemble classifiers n and the parameters of the base classifier.
Existing filtering-based methods directly remove examples detected by error detection techniques, which may eliminate a large number of synthetic samples or (and) original samples. Due to the loss of lots of synthetic or (and) original examples, the decision boundary of the preprocessed data may deviate from the real decision boundary while the class proportion will be imbalanced again. In Fig. 2, ADASYN, a SMOTE-based method, is used to generate synthetic samples because ADASYN may generate more synthetic noise. The IPF is the cleaning method used in the latest SMOTE-IPF. It is used to detect noisy and corrupted borderline examples in ADASYN in Fig. 2(c). Observing Fig. 2(c), lots of synthetic or (and) original examples are eliminated, leading to the deviation of obtained decision boundary and the imbalance of class again.
To advance the state of the art, a novel filtering-based oversampling method called SMOTE-NaN-DE is proposed in this paper. In SMOTE-NaN-DE, a SMOTE-based method is first used to generate synthetic samples and improve original class-imbalance data. Secondly, the natural neighbor [29] is introduced as an error detection technique to detect noisy and borderline examples. Thirdly, the differential evolution (DE) [30] is used to optimize and change iteratively the position (attributes) of found examples instead of eliminating them. The main advantages of SMOTE-NaN-DE are that (a) SMOTE-NaN-DE is a wrapping algorithm and can improve almost all of SMOTE-based algorithms in terms of the noisy problem; (b) Error detection technique is parameter-free; (c) Examples found by error detection technique are optimized by the differential evolution rather than removed, which keeps the imbalance ratio and improves the boundary; (d) It is more suitable for data sets with more noise (especially class noise). The effectiveness of the SMOTE-NaN-DE is validated by intensive comparative experiments on artificial and real data sets. The main contributions of this paper are highlighted as follow:
SMOTE-NaN-DE is proposed. It can address the problem of noisy and borderline examples in SMOTE-based methods. Also, it can overcome the mentioned technical defects of existing filtering-based methods.
In SMOTE-NaN-DE, a parameter-free error detection technique based on natural neighbors is proposed to detect noisy and borderline examples in the class-imbalance classification. Compared to the error detection techniques [27], [28] in existing filtering-based methods [17], [20], the proposed error detection technique does not need to set parameters.
Existing filtering-based methods just remove noisy examples found by error detection techniques. In SMOTE-NaN-DE, the differential evolution is modified to optimize the attributes of examples found by the error detection technique, which keeps the imbalance ratio and improves the boundary.
SMOTE-NaN-DE is a wrapping algorithm and can improve almost all SMOTE-based methods. Hence, SMOTE-NaN-DE is applied to 4 oversampling methods based on SMOTE (SMOTE [12], ADASYN [24], k-means SMOTE [31], Adaptive-SMOTE [25]). Besides, they are compared in experiments with 3 test classifiers.
The rest of the paper is structured as follows. Related work is introduced in Section 2. Related terms and concepts are described in Section 3. Section 4 introduces SMOTE-NaN-DE and we carry out intensive experiments in Section 5. Finally, we summarize our work and make future arrangements in Section 6.
Section snippets
Related work
The SMOTE was first proposed by Chawla et al. [12] to handle class-imbalance classification. Due to the wrapping advantage, SMOTE has been used to help other classifiers adapt to class-imbalance data. To adapt the boosting method to the imbalance classification, SMOTEBoost [32] is developed and it applies SMOTE to each iteration of the boosting method. In KSMOTE [33], SMOTE is used to help a support vector machine adapt to class-imbalance data. Besides, SMOTECSELM [34] combines ELM with SMOTE
Preliminaries
The main symbols and terms are introduced firstly. Then the theory of natural neighbors is stated simply because the proposed error detection method is based on it.
Proposed algorithm
A flowchart of the proposed SMOTE-NaN-DE is depicted in Fig. 3. Initially, original data is divided into minority and majority class samples. A SMOTE-based method is used to generate synthetic samples, and then synthetic samples are used to enlarge original data. Secondly, natural neighbors (NaNs) are searched. NaNs are used as the error detection method to find noisy and borderline examples, while others are normal examples. Thirdly, the differential evolution is used to optimize found noisy
Experimental settings
A server with an Inter (R) Xeon (R) Silver 4100 CPU of 2.10 GHz, a 64GB memory and a 64-bit Windows 10 operating system is used to run all experiments. MATLAB 2015 is used for coding.
Real data sets, selected from the University of California Irvine (UCI) data sets, are used as experimental data. Table 1 describes them in terms of instance number, attributes, imbalance ratios, original class number, and abbreviations. While the proposed method could be extended to cope with multi-class
Conclusions and future plans
The noisy and borderline example is an important issue in SMOTE and its extensions. Filtering-based methods are the main technology to solve this problem but still have technical defects: (a) Error detection techniques heavily rely on parameters; (b) Examples detected by error detection techniques are just eliminated, leading to deviation of obtained decision boundary and imbalance of class again. To advance the state of the art, a novel filtering-based oversampling method called SMOTE-NaN-DE
CRediT authorship contribution statement
Junnan Li: Conceptualization, Methodology, Software, Data curation, Writing - original draft. Qingsheng Zhu: Conceptualization, Methodology. Quanwang Wu: Writing - review & editing. Zhiyong Zhang: Writing - review & editing. Yanlu Gong: Writing - review & editing. Ziqing He: Writing - review & editing. Fan Zhu: Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (61802360 and 62006029), the Chongqing Education and Science Committee, China project (KJZH17104 and CSTC2017rgun-zdyfx0040), the Project of Chongqing Natural Science Foundation, China (cstc2019jcyj-msxmX0683), the Project was supported by the graduate scientific research and innovation foundation of Chongqing, China (Grant No. CYB20063), and the Chongqing Research and Frontier Technology, China (Grant No.
References (49)
- et al.
A contemporary feature selection and classification framework for imbalanced biomedical datasets
Egypt. Inform. J.
(2018) - et al.
Imbalanced text sentiment classification using universal and domain-specific knowledge
Knowl.-Based Syst.
(2018) - et al.
Dynamic imbalanced business credit evaluation based on Learn++ with sliding time window and weight sampling and FCM with multiple kernels
Inf. Sci.
(2020) - et al.
Dual-stage intrusion detection for class imbalance scenarios
Comput. Fraud Secur.
(2019) - et al.
A comprehensive analysis of syntheic minority oversampling technique (SMOTE) for handling class imbalance
Inf. Sci.
(2019) - et al.
Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection
Appl. Soft Comput.
(2014) - et al.
Learning imbalanced datasets based on SMOTE and Gaussian distribution
Inform. Sci.
(2020) - et al.
Natural neighbor: a self-adaptive neighborhood method without parameter k
Pattern Recognit. Lett.
(2016) - et al.
A hybrid classifier combining borderline-smote with airs algorithm for estimating brain metastasis from lung cancer: a case study in Taiwan
Comput. Methods Programs Biomed.
(2015) - et al.
A concurrency control algorithm for nearest neighbor query
Inf. Sci.
(1999)