A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems
Introduction
A classification problem is referred to as imbalanced when the instances in one or several classes, known as the majority classes, outnumber the instances of the other classes, called the minority classes. Such an imbalance in the data represents the so-called between-class imbalance [1], in contrast to the related issue of within-class imbalance [2], [3]. Imbalanced problems widely exist in the fields of medical diagnosis, science and engineering, and some examples includes surveillance of nosocomial infection [4], cardiac care [5] and elucidating protein–protein interactions [6] as well as fraud detection [7], [8], network intrusion detection [9] and telecommunication management [10]. Note that, in an imbalance problem, the minority classes are usually the more important classes. For instance, 11% of patients suffer from one or more nosocomial infections [4]. In the study of two-class imbalanced problems, the instances in the majority class are referred to as negative, while in its counterpart, the minority class, the instances are referred to as positive. Since in practice the minority class is more important, one should be more concerned with the positive instances. Imbalanced data learning has been widely researched [11], [12], [13], [14], [15], [16]. Typically, the approaches for solving the imbalanced problem can be divided into two categories: re-sampling methods and imbalanced learning algorithms.
The re-sampling approach is actually a re-balancing process to balance the given imbalanced data set. The studies [17], [18] on class distribution have shown that balanced data sets provide better classification performance than imbalanced ones, though some other studies [1], [19] have argued that imbalanced data sets are not necessarily responsible for the poor performance of some classifiers. Re-sampling techniques are attractive under most imbalanced circumstances. This is because re-sampling adjusts only the original training data set, instead of modifying the learning algorithm. Thus, this approach is external and transportable [18], [20], and it provides a convenient and effective way to deal with imbalanced learning problems using standard classifiers. Specifically, the re-sampling methods include the random over-sampling, which randomly appends replicated instances to the positive class, and the random under-sampling, which randomly removes instances from the majority class. Alternatively, there exist the guided over-sampling and under-sampling, respectively, of which the choices to replicate or to eliminate are informed rather than random. In addition, the synthetic minority over-sampling technique (SMOTE) [21] is a well acknowledged over-sampling method. In the SMOTE, instead of mere data oriented duplicating, the positive class is over-sampled by creating synthetic instances in the feature space formed by the positive instances and their K-nearest neighbours.
The second category, consisting of imbalanced learning algorithms, can be regarded as a process to modify or re-balance the existing learning algorithms so that they can deal with imbalanced problems effectively. The imbalanced learning algorithms include the cost-sensitive method [22], [23], [24], [25], the discrimination-based and recognition-based approaches [3]. An alternative is to adapt standard kernel-based or radial basis function (RBF) classifiers, which use a fixed common variance for every RBF kernel and choose RBF centres from input data, to imbalanced data sets by modifying the kernel construction and model selection procedure. A representative work [26] of this imbalanced learning proposes a regularised weighted least square estimator (LSE) using the orthogonal forward selection (OFS) based on the model selection criterion of maximising the leave-one-out (LOO) area under the curve (AUC) of receiver operating characteristics (ROC). In this LOO-AUC+OFS algorithm [26], the cost function of the LSE is made sensitive to the class labels, such that the errors due to minority class data samples are given a higher weight , and this weighted LSE (WLSE) reduces to the standard LSE with the weight . A well-known RBF modelling is the two staged procedure [27], in which the RBF centres are first determined using the clustering [28] and the RBF weights are then obtained using the LSE. To cope with imbalanced data sets, a natural extension of [27] is to modify the latter stage as the WLSE, where the same weighted cost function of [26] is used. This +WLSE algorithm provides a viable alternative within this imbalanced learning category.
Kernel-based learning, such as support vector machine (SVM) and RBF, is widely used for solving balanced learning problems. In particular, a powerful approach for constructing the RBF and other sparse kernel classifiers is to assign a fixed common variance for every kernel and to select input data as the candidate centres for RBF kernels by minimising the leave-one-out (LOO) misclassification rate in the efficient OFS procedure [29]. This approach has its root in regression application [30], [31], [32], [33]. Two limitations may be associated with this “fixed” RBF kernel approach. Firstly, RBF kernels cannot be flexibly tuned, as the position of each kernel is restricted to the input data and the shape of each kernel is fixed rather than determined by the learning procedure. Secondly, the common kernel variance has to be determined via cross validation, which inevitably increases the computational cost. The previous studies [34], [35], [36] have proposed to construct the tunable RBF classifier based on the OFS procedure using a global search optimisation algorithm [37] to optimise the RBF kernels one by one. This tunable RBF kernel approach is observed to produce sparser classifiers with better performance but higher computational complexity in classifier construction, in comparison with the standard fixed kernel approach. Recently, the particle swarm optimisation (PSO) algorithm [38] is adopted to minimise the LOO misclassification rate in the OFS construction of tunable RBF classifier [39], [40]. PSO [38] is an efficient population-based stochastic optimisation technique inspired by social behaviour of bird flocks or fish schools, and it has been successfully applied to wide-ranging optimisation applications [41], [42], [43], [44], [45], [46]. Owing to the efficiency of PSO, the tunable RBF modelling approach advocated in [39], [40] offers significant advantages in terms of better generalisation performance and smaller classifier size as well as lower complexity in learning process, compared with the standard fixed kernel approach. This PSO aided tunable RBF classifier offers the state-of-the-art for balanced data sets.
Although the study [1] has shown that kernel-based methods provide a relatively robust classification to imbalanced problems, the detrimental effects of a highly imbalanced data set can seriously degrade the generalisation performance of kernel-based classifiers. In order to achieve better classification performance for highly imbalanced data, an effective approach is to integrate kernel-based classifiers with re-sampling methods. The previous studies [47], [48], [49] mainly focused on SVMs. Specifically, the method [47] combined the SMOTE with different costs to bias SVMs by assigning different classes with different costs so as to shift the decision boundary away from the positive instances and to define a better boundary. The work [48] proposed ensemble systems by re-sampling data sets to form the input to the standard SVM classifier, while the method [49] introduced asymmetric misclassification costs in SVMs so as to improve classification performance. Another integration of SVM with under-sampling method used the combination of the granular support vector machine (GSVM) [50] and repetitive under-sampling (RU) to form the GSVM–RU algorithm [51].
Against this background, this contribution proposes an effective alternative to deal with two-class imbalanced classification problems by combining the SMOTE algorithm [21] and the PSO aided RBF classifier [39], [40]. Specifically, the SMOTE is first applied to generate synthetic instances in the positive class to balance the training data set. Using the resulting balanced data set, the tunable RBF classifier is then constructed by applying the PSO to minimise the LOO misclassification rate in the computationally efficient OFS procedure. In the experimental study involving a simulated imbalanced data set and three real imbalanced data sets, three benchmarks are used to compare with the proposed SMOTE+PSO-OFS method. The first benchmark combines the SMOTE [21] and the nearest neighbour classifier [52], which will be denoted as the SMOTE+. The classifier is a widely used classification method, and this combined SMOTE and represents a typical method of the re-sampling approach for imbalanced problems. The second benchmark is the algorithm advocated in [26], denoted by the LOO-AUC+OFS, which is a state-of-the-art representative of the second approach for dealing with imbalanced problems. The third benchmark, the +WLSE algorithm, as discussed previously, is also a typical method of the imbalanced learning approach. The experimental results obtained demonstrate that the proposed method is competitive to these existing state-of-the-arts methods for two-class imbalanced problems.
The rest of the paper is organised as follows. Section 2 introduces the tunable RBF model for two-class classification and the OFS procedure based on the LOO misclassification rate, while Section 3 presents the PSO algorithm for tuning the RBF kernels by minimising the LOO misclassification rate. Section 4 introduces the SMOTE method and presents the proposed combined SMOTE and PSO based RBF algorithm. The effectiveness of our approach is demonstrated by numerical examples in Section 5, and our conclusions are given in Section 6.
Section snippets
RBF classifier for two-class problems
Consider the two-class data set that contains N data instances, where denotes the class label for the feature vector , while there are N+ positive instances and N− negative instances, with . We use the data set DN to construct the RBF classifier of the form: where M is the number of RBF kernels, is the output of the M-term classifier with the M kernels, for , is the weight vector
PSO for optimising RBF parameters
Denote as the 2m-dimensional parameter vector that contains and . Then, as defined in the previous section, the problem of determining the nth RBF kernel's parameters at the nth OFS stage is to solve the following optimisation problemwhere the 2m-dimensional search space is defined bySpecifically, the search space for is specified by the distribution of the training data , namely,
Combined SMOTE and PSO optimised RBF for imbalanced classification
The SMOTE [21] over-samples the positive class by creating synthetic instances by a specified over-sampling ratio of the original minority data size, . Based on each minority data sample, denoted by , synthetic data points are generated by randomly selecting data points on the lines linking with some of its K nearest neighbours, where K is predetermined. Depending on the required SMOTE amount , one out of the K nearest positive-class data samples are randomly selected several times.
Experimental results
The effectiveness of the proposed SMOTE+PSO-OFS algorithm was investigated using a simulated imbalanced date set and three real imbalanced data sets. The first two real data sets were taken from [54], while the third real data set was from [55]. These three real data sets were chosen in the order of increasing imbalance. For each data set, the positive class was over-sampled at different rates of its original size using the SMOTE. For the synthetic data set, a separate test data set was
Conclusions
The RBF classifier performs well on balanced or slightly imbalanced data sets, and our previous work has provided an efficient and tunable RBF classifier optimised by the PSO based on the OFS procedure. For highly imbalanced data sets, however, the performance of the tunable RBF classifier may no longer be satisfactory. In order to combat challenging imbalanced classification problems, many approaches have been proposed, which aim to reduce the influence from the underlying imbalanced
Acknowledgements
This work was supported by UK EPSRC. The authors would like to thank Dr. P.A.S. Reed and Dr. K.K. Lee for their help with ADI data set.
Ming Gao received the MEng degree from Northwestern Polytechnical University (NPU), Shaan'xi, P.R. China, in 2006, and the MEng degree from the Beihang University, Beijing, P.R. China, in 2009. He is now working towards the PhD degree in the Systems Engineering School, the University of Reading (UoR), Reading, UK. His research interests are machine learning, pattern recognition, and their applications in imbalanced problems.
References (57)
- et al.
Learning from imbalanced data in surveillance of nosocomial infection
Artif. Intell. Med.
(2006) - et al.
Strategies for learning in class imbalance problems
Pattern Recognition
(2003) The use of the area under the roc curve in the evaluation of machine learning algorithms
Pattern Recognition
(1997)- et al.
The class imbalance problem: a systematic study
Intell. Data Anal.
(2002) Mining with rarity: a unifying framework
ACM SIGKDD Explor. Newsl.
(2004)Concept-learning in the presence of between-class and within-class imbalances
- et al.
Data mining for improved cardiac care
ACM SIGKDD Explor. Newsl.
(2006) - et al.
Predicting protein–protein interactions in unbalanced data using the primary structure of proteins
BMC Bioinformatics
(2010) - F. Provost, T. Fawcett, R. Kohavi, The case against accuracy estimation for comparing induction algorithms, in:...
- et al.
Adaptive fraud detection
Data Min. Knowl. Discovery
(1997)
Learning from imbalanced data
IEEE Trans. Knowl. Data Eng.
Automatically countering imbalance and its empirical relationship to cost
Data Min. Knowl. Discovery
Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs?
Learning when training data are costly: the effect of class distribution on tree induction
Artif. Intell. Res.
A multiple resampling method for learning from imbalanced data sets
J. Chem. Inf. Modeling
A study of the behavior of several methods for balancing machine learning training data
SIGKDD Explor. Newsl.
Smote: synthetic minority over-sampling technique
J. Artif. Intell. Res.
An instance-weighting method to induce cost-sensitive trees
IEEE Trans. Knowl. Data Eng.
A kernel-based two-class classifier for imbalanced data sets
IEEE Trans. Neural Networks
Fast learning in networks of locally-tuned processing units
Neural Comput.
Cited by (124)
A dynamic spectrum loss generative adversarial network for intelligent fault diagnosis with imbalanced data
2023, Engineering Applications of Artificial IntelligenceAdaptive variational autoencoding generative adversarial networks for rolling bearing fault diagnosis
2023, Advanced Engineering InformaticsPredicting online news popularity based on machine learning
2022, Computers and Electrical EngineeringCitation Excerpt :The experimental results showed that the proposed method could effectively improve classification accuracy. Gao et al. [16] utilized an RBF-based classifier combined with SMOTE and PSO to resolve two types of imbalance problems, and the model effectively improved the classification accuracy of the decision area. Rodriguez et al. [17] used SMOTE to improve the predictive power of simple Bayesian models and decision trees and noted its good predictive effect in identifying drug reactions.
A novel SMOTE-based resampling technique trough noise detection and the boosting procedure
2022, Expert Systems with ApplicationsA conditional variational autoencoding generative adversarial networks with self-modulation for rolling bearing fault diagnosis
2022, Measurement: Journal of the International Measurement Confederation
Ming Gao received the MEng degree from Northwestern Polytechnical University (NPU), Shaan'xi, P.R. China, in 2006, and the MEng degree from the Beihang University, Beijing, P.R. China, in 2009. He is now working towards the PhD degree in the Systems Engineering School, the University of Reading (UoR), Reading, UK. His research interests are machine learning, pattern recognition, and their applications in imbalanced problems.
Xia Hong received her university education at National University of Defense Technology, P.R. China (BSc, 1984; MSc, 1987), and University of Sheffield, UK (PhD, 1998), all in automatic control. She worked as a Research Assistant in Beijing Institute of Systems Engineering, Beijing, China from 1987 to 1993. She worked as a Research Fellow in the Department of Electronics and Computer Science at University of Southampton from 1997 to 2001. She is currently a Reader at School of Systems Engineering, University of Reading. She is actively engaged in research into nonlinear systems identification, data modelling, estimation and intelligent control, neural networks, pattern recognition, learning theory and their applications. She has published over 100 research papers, and coauthored a research book. She was awarded a Donald Julius Groen Prize by IMechE in 1999.
Sheng Chen received his PhD degree in control engineering from the City University, London, UK, in September 1986. He was awarded the Doctor of Sciences (DSc) degree by the University of Southampton, Southampton, UK, in 2005. From October 1986 to August 1999, he conducted research and academic appointments at the University of Sheffield, the University of Edinburgh and the University of Ports-mouth, all in UK. Since September 1999, he has been with the School of Electronics and Computer Science, University of Southampton, UK. Professor Chen's research interests include wireless communications, adaptive signal processing for communications, machine learning, and evolutionary computation methods. He has published over 400 research papers. Dr. Chen is a Fellow of IET and a Fellow of IEEE. In the database of the world's most highly cited researchers, compiled by Institute for Scientific Information (ISI) of the USA, Dr. Chen is on the list of the highly cited researchers (2004) in the engineering category.
Chris Harris received university education at Leicester (BSc), Oxford (MA) and Southampton (PhD). He previously conducted appointments at the Universities of Hull, UMIST, Oxford and Cranfield, as well as being employed by the UK Ministry of Defence. His research interests are in the area of intelligent and adaptive systems theory and its application to intelligent autonomous systems, management infrastructures, intelligent control and estimation of dynamic processes, multi-sensor data fusion and systems integration. He has authored or coauthored 12 books and over 400 research papers, and he was the associate editor of numerous international journals including Automatica, Engineering Applications of AI, International Journal of General Systems Engineering, International Journal of System Science and the International Journal on Mathematical Control and Information Theory. He was elected to the Royal Academy of Engineering in 1996, was awarded the IEE Senior Achievement medal in 1998 for his work on autonomous systems, and the highest international award in IEE, the IEE Faraday medal in 2001 for his work in Intelligent Control and Neurofuzzy System.