LR-SMOTE — An improved unbalanced data set oversampling based on K-means and SVM

https://doi.org/10.1016/j.knosys.2020.105845Get rights and content

Abstract

Machine learning classification algorithms are currently widely used. One of the main problems faced by classification algorithms is the problem of unbalanced data sets. Classification algorithms are not sensitive to unbalanced data sets, therefore, it is difficult to classify unbalanced data sets. There is also a problem of unbalanced data categories in the field of loose particle detection of sealed electronic components. The signals generated by internal components are always more than the signals generated by loose particles, which easily leads to misjudgment in classification. To classify unbalanced data sets more accurately, in this paper, based on the traditional oversampling SMOTE algorithm, the LR-SMOTE algorithm is proposed to make the newly generated samples close to the sample center, avoid generating outlier samples or changing the distribution of data sets. Experiments were carried out on four sets of UCI public data sets and six sets of self-built data sets. Unmodified data sets balanced by LR-SMOTE and SMOTE algorithms used random forest algorithm and support vector machine algorithm respectively. The experimental results show that the LR-SMOTE has better performance than the SMOTE algorithm in terms of G-means value, F-measure value and AUC.

Introduction

The problem of unbalanced data classification exists in many fields in real life and has received widespread concern. For example, data imbalance exists in areas such as anomaly detection [1], fault diagnosis [2], or face recognition [3]. Data imbalance [4] means that in the process of two classifications, the number of samples of one class is much greater than the number of samples of another class. Usually, the class having most of the samples is called the majority class, and the other is called the minority. Studies have shown that when the number of a certain class is much greater than in other classes, the traditional classification algorithm will pay more attention to the classification accuracy of the majority class samples in the classification process. (Some studies show that when the number of samples in the majority class is three times or more the number of samples in the minority class, there is an imbalance in the data set.) However, the minority class samples in an unbalanced data set are often the focus of research. Therefore, it is necessary to improve the recognition accuracy of the minority class samples [5].

As a basic component in the aerospace industry, the existence of loose particles inside the sealed relay is directly related to the reliability of the entire system. Loose particles are impurities formed by artificially and unconsciously encapsulating some metal chips, solder slag, and other particles during the production, manufacture, packaging, and use of sealed relays [6], [7]. The internal component signal is a signal generated by the movable components inside the relay during vibration detection. In the detected data set, the number of internal component signals is far greater than the number of loose particle signals, that is, the data set is not balanced. Therefore, how to identify the existence of loose particle signals in a large number of internal component signals is particularly important.

Class imbalances can impair the predictive ability of the classification algorithm because the algorithm pursues the overall classification accuracy [8]. To solve the problem of difficult classification when dealing with unbalanced data sets, researchers mainly improve the data level and algorithm level [8], [9]. The main improvement method at the data level is to balance the amount of data in different classes by adding samples to minority classes (oversampling) or deleting samples from the majority class (undersampling) [10], [11]. The algorithm level mainly improves the original algorithm by introducing cost-sensitive learning methods and integration methods [12], [13], [14].

Data level improvements is a method that does not need to be limited by specific domain and classifier models [15], [16], [17], and it is more generally applicable than improving the algorithm to fit a particular classifier [10]. The undersampling technique reduces the imbalance of the data set by reducing the majority class samples. This reduction can be done randomly, in which case it is called random undersampling [18], [19], [20], [21]. Although this method can make the number of different classes of samples equal and reduces the total sample size, which can reduce the computing time, it may delete important information in the data set [12]. Oversampling methods, on the other hand, add the minority class samples to an imbalanced data set. The easiest way is to directly copy the minority class samples, that is, generate the same samples. Although the oversampling method does not cause the loss of data information, such processing often produces over-fitting problems [21], [22], [23], [24], [25].

Many oversampling techniques have proven to be effective in the real world. The SMOTE algorithm is the most basic oversampling method; some of the most popular approaches to deal with imbalanced learning problems are based on synthetic oversampling methods.

The algorithm proposed in this paper is mainly to improve the SMOTE algorithm in the data layer. A data oversampling algorithm named LR-SMOTE is proposed, and the main purpose of this algorithm is to generate new samples more reasonably. The biggest shortcoming of the SMOTE algorithm is that when generating a sample, the space generated by the new sample does not have a limit range and is randomly generated between the two sample points. If there are noise samples or outlier samples, this may result in newly generated samples being noise points or outliers. The LR-SMOTE algorithm is more capable of dealing with the above disadvantages, and compared with the SMOTE algorithm, the complexity of the LR-SMOTE algorithm is not high.

The main contributions of this article are: To solve the problem of class imbalance in the two-class problem, a new oversampling algorithm LR-SMOTE algorithm is proposed, thus improving the traditional oversampling algorithm and allowing the new samples to distribute in the minority class sample centers and avoid noise generation.

The remainder of this paper is divided into six sections. In Section 2, we provide a brief review of existing works on the imbalanced problem domain. In Section 3 we describe how the proposed LR-SMOTE algorithm works. Experimental details and experimental results are presented in Section 4. The experimental simulation results and discussion are presented in Section 5. Finally, in Section 6, we conclude the paper with some future research directions.

Section snippets

Related work

Unbalanced data set classification problems have been studied for more than 20 years, yet learning from unbalanced data is still a challenge for machine learning [26], [27], [28], [29]. This is mainly because new applications in the real world are constantly appearing, and data imbalance is embedded in these applications [30]. Data imbalance can be divided into relative imbalance and absolute imbalance. Relative imbalance means that the number of minority samples is not too small, but it is

Proposed method

Since the SMOTE algorithm may generate outlier sample points, the existence of noise in the data will reduce the quality of the newly generated samples. This paper proposes the LR-SMOTE algorithm based on the traditional SMOTE algorithm. The proposed method first uses SVM and k-means to remove the noise in the original data set, then changes the formula generated by the new sample. The rule of generating a new sample between the sample and the neighbor sample in the SMOTE algorithm is expanded

Data set

The experiment uses 6 sets of detect result data sets from actual engineering and 4 sets of UCI database unbalanced data sets to evaluate the performance of the improved algorithm for unbalanced data set processing. The experimental data is described in detail in Table 1. Table 1 includes the number of samples, the number of features, the number of minority classes, the number of majority classes, and the imbalance rate between majority and minority classes. The size of the data set ranges from

Discussion

Based on SMOTE, we propose a new oversampling method LR-SMOTE algorithm for dealing with class imbalance problems and achieved satisfactory classification accuracy after LR-SMOTE processing. In this study, we first used the support vector machine algorithm and k-means combination to de-noise the sample before oversampling, improve the quality of the sample, and make the generated sample more meaningful. Secondly, we improved the formula generated by the new sample. Before the modification, the

Conclusions

Data imbalances present difficulties for many classification algorithms. Oversampling the training data to make it more evenly distributed is an effective way to solve this problem at the data processing level. On the one hand, random oversampling leads to overfitting, which reduces the classification performance of the model on invisible data. On the other hand, if the generated data is not controlled, a sample with noise is often generated, which blurs the sample boundary and hinders the

CRediT authorship contribution statement

X.W. Liang: Conceptualization, Methodology, Software, Investigation, Writing - original draft. A.P. Jiang: Validation, Formal analysis, Visualization, Software. T. Li: Validation, Formal analysis, Visualization. Y.Y. Xue: Resources, Writing - review & editing, Supervision, Data curation. G.T. Wang: Resources, Writing - review & editing, Supervision, Data curation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was co-supported by the National Natural Science Foundation of China (Nos. 51607059, 51077022 and 61271347); Natural Science Foundation of Heilongjiang Province (QC2017059); Postdoctoral Fund in Heilongjiang Province (LBH-Z16169);    Talent Innovation Special Project of Heilongjiang Province (HDRCCX-201604); Science and Technology Innovative Research Team in Higher Educational Institutions of Heilongjiang Province (No. 2012TD007); Heilongjiang University Youth Science Fund Project (

References (42)

  • YangZ. et al.

    Association rule mining-based dissolved gas analysis for fault diagnosis of power transformers

    IEEE Trans. Syst. Man Cybern. C

    (2009)
  • LiuY.-H. et al.

    Face recognition using total margin-based adaptive fuzzy support vector machines

    IEEE Trans. Neural Netw.

    (2007)
  • GanganwarV.

    An overview of classification algorithms for imbalanced datasets

    Int. J. Emerg. Technol. Adv. Eng.

    (2012)
  • ChawlaN.V. et al.

    SMOTE: Synthetic minority over-sampling technique

    J. Artificial Intelligence Res.

    (2002)
  • TaoXiong

    Discussion on control method of unloaded objects in spacecraft assembly

    Aerosp. Environ. Eng.

    (2006)
  • N. Japkowicz, Learning from imbalanced data sets: A comparison of various strategies, in: AAAI Workshop Learn....
  • LiH. et al.

    The clustering-based case-based reasoning for imbalanced business failure prediction: A hybrid approach through integrating unsupervised process with supervised process

    Internat. J. Systems Sci.

    (2014)
  • GalarM. et al.

    A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches

    (2012)
  • S. Kotsiantis, D. Kanellopoulos, P. Pintelas, Handling imbalanced datasets: A review Science, 30 (1) (2006) 25–36,...
  • GalarM. et al.

    A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches

    IEEE Trans. Syst. Man Cybern. C

    (2012)
  • BaruaS. et al.

    MWMOTE—majority weighted minority oversampling technique for imbalanced data set learning

    IEEE Trans. Knowl. Data Eng.

    (2014)
  • Cited by (169)

    • FCM-CSMOTE: Fuzzy C-Means Center-SMOTE

      2024, Expert Systems with Applications
    View all citing articles on Scopus
    View full text