LR-SMOTE — An improved unbalanced data set oversampling based on K-means and SVM
Introduction
The problem of unbalanced data classification exists in many fields in real life and has received widespread concern. For example, data imbalance exists in areas such as anomaly detection [1], fault diagnosis [2], or face recognition [3]. Data imbalance [4] means that in the process of two classifications, the number of samples of one class is much greater than the number of samples of another class. Usually, the class having most of the samples is called the majority class, and the other is called the minority. Studies have shown that when the number of a certain class is much greater than in other classes, the traditional classification algorithm will pay more attention to the classification accuracy of the majority class samples in the classification process. (Some studies show that when the number of samples in the majority class is three times or more the number of samples in the minority class, there is an imbalance in the data set.) However, the minority class samples in an unbalanced data set are often the focus of research. Therefore, it is necessary to improve the recognition accuracy of the minority class samples [5].
As a basic component in the aerospace industry, the existence of loose particles inside the sealed relay is directly related to the reliability of the entire system. Loose particles are impurities formed by artificially and unconsciously encapsulating some metal chips, solder slag, and other particles during the production, manufacture, packaging, and use of sealed relays [6], [7]. The internal component signal is a signal generated by the movable components inside the relay during vibration detection. In the detected data set, the number of internal component signals is far greater than the number of loose particle signals, that is, the data set is not balanced. Therefore, how to identify the existence of loose particle signals in a large number of internal component signals is particularly important.
Class imbalances can impair the predictive ability of the classification algorithm because the algorithm pursues the overall classification accuracy [8]. To solve the problem of difficult classification when dealing with unbalanced data sets, researchers mainly improve the data level and algorithm level [8], [9]. The main improvement method at the data level is to balance the amount of data in different classes by adding samples to minority classes (oversampling) or deleting samples from the majority class (undersampling) [10], [11]. The algorithm level mainly improves the original algorithm by introducing cost-sensitive learning methods and integration methods [12], [13], [14].
Data level improvements is a method that does not need to be limited by specific domain and classifier models [15], [16], [17], and it is more generally applicable than improving the algorithm to fit a particular classifier [10]. The undersampling technique reduces the imbalance of the data set by reducing the majority class samples. This reduction can be done randomly, in which case it is called random undersampling [18], [19], [20], [21]. Although this method can make the number of different classes of samples equal and reduces the total sample size, which can reduce the computing time, it may delete important information in the data set [12]. Oversampling methods, on the other hand, add the minority class samples to an imbalanced data set. The easiest way is to directly copy the minority class samples, that is, generate the same samples. Although the oversampling method does not cause the loss of data information, such processing often produces over-fitting problems [21], [22], [23], [24], [25].
Many oversampling techniques have proven to be effective in the real world. The SMOTE algorithm is the most basic oversampling method; some of the most popular approaches to deal with imbalanced learning problems are based on synthetic oversampling methods.
The algorithm proposed in this paper is mainly to improve the SMOTE algorithm in the data layer. A data oversampling algorithm named LR-SMOTE is proposed, and the main purpose of this algorithm is to generate new samples more reasonably. The biggest shortcoming of the SMOTE algorithm is that when generating a sample, the space generated by the new sample does not have a limit range and is randomly generated between the two sample points. If there are noise samples or outlier samples, this may result in newly generated samples being noise points or outliers. The LR-SMOTE algorithm is more capable of dealing with the above disadvantages, and compared with the SMOTE algorithm, the complexity of the LR-SMOTE algorithm is not high.
The main contributions of this article are: To solve the problem of class imbalance in the two-class problem, a new oversampling algorithm LR-SMOTE algorithm is proposed, thus improving the traditional oversampling algorithm and allowing the new samples to distribute in the minority class sample centers and avoid noise generation.
The remainder of this paper is divided into six sections. In Section 2, we provide a brief review of existing works on the imbalanced problem domain. In Section 3 we describe how the proposed LR-SMOTE algorithm works. Experimental details and experimental results are presented in Section 4. The experimental simulation results and discussion are presented in Section 5. Finally, in Section 6, we conclude the paper with some future research directions.
Section snippets
Related work
Unbalanced data set classification problems have been studied for more than 20 years, yet learning from unbalanced data is still a challenge for machine learning [26], [27], [28], [29]. This is mainly because new applications in the real world are constantly appearing, and data imbalance is embedded in these applications [30]. Data imbalance can be divided into relative imbalance and absolute imbalance. Relative imbalance means that the number of minority samples is not too small, but it is
Proposed method
Since the SMOTE algorithm may generate outlier sample points, the existence of noise in the data will reduce the quality of the newly generated samples. This paper proposes the LR-SMOTE algorithm based on the traditional SMOTE algorithm. The proposed method first uses SVM and k-means to remove the noise in the original data set, then changes the formula generated by the new sample. The rule of generating a new sample between the sample and the neighbor sample in the SMOTE algorithm is expanded
Data set
The experiment uses 6 sets of detect result data sets from actual engineering and 4 sets of UCI database unbalanced data sets to evaluate the performance of the improved algorithm for unbalanced data set processing. The experimental data is described in detail in Table 1. Table 1 includes the number of samples, the number of features, the number of minority classes, the number of majority classes, and the imbalance rate between majority and minority classes. The size of the data set ranges from
Discussion
Based on SMOTE, we propose a new oversampling method LR-SMOTE algorithm for dealing with class imbalance problems and achieved satisfactory classification accuracy after LR-SMOTE processing. In this study, we first used the support vector machine algorithm and k-means combination to de-noise the sample before oversampling, improve the quality of the sample, and make the generated sample more meaningful. Secondly, we improved the formula generated by the new sample. Before the modification, the
Conclusions
Data imbalances present difficulties for many classification algorithms. Oversampling the training data to make it more evenly distributed is an effective way to solve this problem at the data processing level. On the one hand, random oversampling leads to overfitting, which reduces the classification performance of the model on invisible data. On the other hand, if the generated data is not controlled, a sample with noise is often generated, which blurs the sample boundary and hinders the
CRediT authorship contribution statement
X.W. Liang: Conceptualization, Methodology, Software, Investigation, Writing - original draft. A.P. Jiang: Validation, Formal analysis, Visualization, Software. T. Li: Validation, Formal analysis, Visualization. Y.Y. Xue: Resources, Writing - review & editing, Supervision, Data curation. G.T. Wang: Resources, Writing - review & editing, Supervision, Data curation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was co-supported by the National Natural Science Foundation of China (Nos. 51607059, 51077022 and 61271347); Natural Science Foundation of Heilongjiang Province (QC2017059); Postdoctoral Fund in Heilongjiang Province (LBH-Z16169); Talent Innovation Special Project of Heilongjiang Province (HDRCCX-201604); Science and Technology Innovative Research Team in Higher Educational Institutions of Heilongjiang Province (No. 2012TD007); Heilongjiang University Youth Science Fund Project (
References (42)
- et al.
Iterative boolean combination of classifiers in the ROC space: An application to anomaly detection with hmms
Pattern Recognit.
(2010) - et al.
Cluster-based weighted oversampling for ordinal regression (cwos-ord)
Neurocomputing
(2016) - et al.
Diversity techniques improve the performance of the best imbalance learning ensembles
Inform. Sci.
(2015) - et al.
Integrating TANBN with cost sensitive classification algorithm for imbalanced data in medical diagnosis
Comput. Ind. Eng.
(2020) - et al.
A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data
Appl. Soft Comput.
(2018) - et al.
An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics
Inform. Sci.
(2013) - et al.
Cluster-based under-sampling approaches for imbalanced data distributions
Expert Syst. Appl.
(2009) - et al.
Acosampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data
Neurocomputing
(2013) - et al.
SVM-TIA a shilling attack detection method based on SVM and target item analysis in recommender systems
Neurocomputing
(2016) - et al.
A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients
J. Biomed. Inform.
(2015)
Association rule mining-based dissolved gas analysis for fault diagnosis of power transformers
IEEE Trans. Syst. Man Cybern. C
Face recognition using total margin-based adaptive fuzzy support vector machines
IEEE Trans. Neural Netw.
An overview of classification algorithms for imbalanced datasets
Int. J. Emerg. Technol. Adv. Eng.
SMOTE: Synthetic minority over-sampling technique
J. Artificial Intelligence Res.
Discussion on control method of unloaded objects in spacecraft assembly
Aerosp. Environ. Eng.
The clustering-based case-based reasoning for imbalanced business failure prediction: A hybrid approach through integrating unsupervised process with supervised process
Internat. J. Systems Sci.
A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches
A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches
IEEE Trans. Syst. Man Cybern. C
MWMOTE—majority weighted minority oversampling technique for imbalanced data set learning
IEEE Trans. Knowl. Data Eng.
Cited by (169)
FCM-CSMOTE: Fuzzy C-Means Center-SMOTE
2024, Expert Systems with ApplicationsMachine Learning based Intelligent System for Breast Cancer Prediction (MLISBCP)
2024, Expert Systems with ApplicationsA data envelopment analysis model for opinion leaders’ identification in social networks
2024, Computers and Industrial EngineeringHigh imbalance fault diagnosis of aviation hydraulic pump based on data augmentation via local wavelet similarity fusion
2024, Mechanical Systems and Signal ProcessingAn improved random forest based on the classification accuracy and correlation measurement of decision trees
2024, Expert Systems with Applications