Elsevier

Neurocomputing

Volume 74, Issue 17, October 2011, Pages 3456-3466
Neurocomputing

A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems

https://doi.org/10.1016/j.neucom.2011.06.010Get rights and content

Abstract

This contribution proposes a powerful technique for two-class imbalanced classification problems by combining the synthetic minority over-sampling technique (SMOTE) and the particle swarm optimisation (PSO) aided radial basis function (RBF) classifier. In order to enhance the significance of the small and specific region belonging to the positive class in the decision region, the SMOTE is applied to generate synthetic instances for the positive class to balance the training data set. Based on the over-sampled training data, the RBF classifier is constructed by applying the orthogonal forward selection procedure, in which the classifier's structure and the parameters of RBF kernels are determined using a PSO algorithm based on the criterion of minimising the leave-one-out misclassification rate. The experimental results obtained on a simulated imbalanced data set and three real imbalanced data sets are presented to demonstrate the effectiveness of our proposed algorithm.

Introduction

A classification problem is referred to as imbalanced when the instances in one or several classes, known as the majority classes, outnumber the instances of the other classes, called the minority classes. Such an imbalance in the data represents the so-called between-class imbalance [1], in contrast to the related issue of within-class imbalance [2], [3]. Imbalanced problems widely exist in the fields of medical diagnosis, science and engineering, and some examples includes surveillance of nosocomial infection [4], cardiac care [5] and elucidating protein–protein interactions [6] as well as fraud detection [7], [8], network intrusion detection [9] and telecommunication management [10]. Note that, in an imbalance problem, the minority classes are usually the more important classes. For instance, 11% of patients suffer from one or more nosocomial infections [4]. In the study of two-class imbalanced problems, the instances in the majority class are referred to as negative, while in its counterpart, the minority class, the instances are referred to as positive. Since in practice the minority class is more important, one should be more concerned with the positive instances. Imbalanced data learning has been widely researched [11], [12], [13], [14], [15], [16]. Typically, the approaches for solving the imbalanced problem can be divided into two categories: re-sampling methods and imbalanced learning algorithms.

The re-sampling approach is actually a re-balancing process to balance the given imbalanced data set. The studies [17], [18] on class distribution have shown that balanced data sets provide better classification performance than imbalanced ones, though some other studies [1], [19] have argued that imbalanced data sets are not necessarily responsible for the poor performance of some classifiers. Re-sampling techniques are attractive under most imbalanced circumstances. This is because re-sampling adjusts only the original training data set, instead of modifying the learning algorithm. Thus, this approach is external and transportable [18], [20], and it provides a convenient and effective way to deal with imbalanced learning problems using standard classifiers. Specifically, the re-sampling methods include the random over-sampling, which randomly appends replicated instances to the positive class, and the random under-sampling, which randomly removes instances from the majority class. Alternatively, there exist the guided over-sampling and under-sampling, respectively, of which the choices to replicate or to eliminate are informed rather than random. In addition, the synthetic minority over-sampling technique (SMOTE) [21] is a well acknowledged over-sampling method. In the SMOTE, instead of mere data oriented duplicating, the positive class is over-sampled by creating synthetic instances in the feature space formed by the positive instances and their K-nearest neighbours.

The second category, consisting of imbalanced learning algorithms, can be regarded as a process to modify or re-balance the existing learning algorithms so that they can deal with imbalanced problems effectively. The imbalanced learning algorithms include the cost-sensitive method [22], [23], [24], [25], the discrimination-based and recognition-based approaches [3]. An alternative is to adapt standard kernel-based or radial basis function (RBF) classifiers, which use a fixed common variance for every RBF kernel and choose RBF centres from input data, to imbalanced data sets by modifying the kernel construction and model selection procedure. A representative work [26] of this imbalanced learning proposes a regularised weighted least square estimator (LSE) using the orthogonal forward selection (OFS) based on the model selection criterion of maximising the leave-one-out (LOO) area under the curve (AUC) of receiver operating characteristics (ROC). In this LOO-AUC+OFS algorithm [26], the cost function of the LSE is made sensitive to the class labels, such that the errors due to minority class data samples are given a higher weight ρ1, and this weighted LSE (WLSE) reduces to the standard LSE with the weight ρ=1. A well-known RBF modelling is the two staged procedure [27], in which the RBF centres are first determined using the κ-means clustering [28] and the RBF weights are then obtained using the LSE. To cope with imbalanced data sets, a natural extension of [27] is to modify the latter stage as the WLSE, where the same weighted cost function of [26] is used. This κ-means +WLSE algorithm provides a viable alternative within this imbalanced learning category.

Kernel-based learning, such as support vector machine (SVM) and RBF, is widely used for solving balanced learning problems. In particular, a powerful approach for constructing the RBF and other sparse kernel classifiers is to assign a fixed common variance for every kernel and to select input data as the candidate centres for RBF kernels by minimising the leave-one-out (LOO) misclassification rate in the efficient OFS procedure [29]. This approach has its root in regression application [30], [31], [32], [33]. Two limitations may be associated with this “fixed” RBF kernel approach. Firstly, RBF kernels cannot be flexibly tuned, as the position of each kernel is restricted to the input data and the shape of each kernel is fixed rather than determined by the learning procedure. Secondly, the common kernel variance has to be determined via cross validation, which inevitably increases the computational cost. The previous studies [34], [35], [36] have proposed to construct the tunable RBF classifier based on the OFS procedure using a global search optimisation algorithm [37] to optimise the RBF kernels one by one. This tunable RBF kernel approach is observed to produce sparser classifiers with better performance but higher computational complexity in classifier construction, in comparison with the standard fixed kernel approach. Recently, the particle swarm optimisation (PSO) algorithm [38] is adopted to minimise the LOO misclassification rate in the OFS construction of tunable RBF classifier [39], [40]. PSO [38] is an efficient population-based stochastic optimisation technique inspired by social behaviour of bird flocks or fish schools, and it has been successfully applied to wide-ranging optimisation applications [41], [42], [43], [44], [45], [46]. Owing to the efficiency of PSO, the tunable RBF modelling approach advocated in [39], [40] offers significant advantages in terms of better generalisation performance and smaller classifier size as well as lower complexity in learning process, compared with the standard fixed kernel approach. This PSO aided tunable RBF classifier offers the state-of-the-art for balanced data sets.

Although the study [1] has shown that kernel-based methods provide a relatively robust classification to imbalanced problems, the detrimental effects of a highly imbalanced data set can seriously degrade the generalisation performance of kernel-based classifiers. In order to achieve better classification performance for highly imbalanced data, an effective approach is to integrate kernel-based classifiers with re-sampling methods. The previous studies [47], [48], [49] mainly focused on SVMs. Specifically, the method [47] combined the SMOTE with different costs to bias SVMs by assigning different classes with different costs so as to shift the decision boundary away from the positive instances and to define a better boundary. The work [48] proposed ensemble systems by re-sampling data sets to form the input to the standard SVM classifier, while the method [49] introduced asymmetric misclassification costs in SVMs so as to improve classification performance. Another integration of SVM with under-sampling method used the combination of the granular support vector machine (GSVM) [50] and repetitive under-sampling (RU) to form the GSVM–RU algorithm [51].

Against this background, this contribution proposes an effective alternative to deal with two-class imbalanced classification problems by combining the SMOTE algorithm [21] and the PSO aided RBF classifier [39], [40]. Specifically, the SMOTE is first applied to generate synthetic instances in the positive class to balance the training data set. Using the resulting balanced data set, the tunable RBF classifier is then constructed by applying the PSO to minimise the LOO misclassification rate in the computationally efficient OFS procedure. In the experimental study involving a simulated imbalanced data set and three real imbalanced data sets, three benchmarks are used to compare with the proposed SMOTE+PSO-OFS method. The first benchmark combines the SMOTE [21] and the K¯ nearest neighbour (K¯-NN) classifier [52], which will be denoted as the SMOTE+K¯-NN. The K¯-NN classifier is a widely used classification method, and this combined SMOTE and K¯-NN represents a typical method of the re-sampling approach for imbalanced problems. The second benchmark is the algorithm advocated in [26], denoted by the LOO-AUC+OFS, which is a state-of-the-art representative of the second approach for dealing with imbalanced problems. The third benchmark, the κ-means+WLSE algorithm, as discussed previously, is also a typical method of the imbalanced learning approach. The experimental results obtained demonstrate that the proposed method is competitive to these existing state-of-the-arts methods for two-class imbalanced problems.

The rest of the paper is organised as follows. Section 2 introduces the tunable RBF model for two-class classification and the OFS procedure based on the LOO misclassification rate, while Section 3 presents the PSO algorithm for tuning the RBF kernels by minimising the LOO misclassification rate. Section 4 introduces the SMOTE method and presents the proposed combined SMOTE and PSO based RBF algorithm. The effectiveness of our approach is demonstrated by numerical examples in Section 5, and our conclusions are given in Section 6.

Section snippets

RBF classifier for two-class problems

Consider the two-class data set DN={xk,yk}k=1N that contains N data instances, where yk={±1} denotes the class label for the feature vector xkRm, while there are N+ positive instances and N negative instances, with N=N++N. We use the data set DN to construct the RBF classifier of the form: y^k(M)=i=1Mwigi(xk)=gMT(k)wMy˜k(M)=sgn(y^k(M))where M is the number of RBF kernels, y^k(M) is the output of the M-term classifier with the M kernels, gi() for 1iM, wM=[w1w2wM]T is the weight vector

PSO for optimising RBF parameters

Denote μ=[μ(1)μ(2)μ(2m)]T as the 2m-dimensional parameter vector that contains cn and Σn. Then, as defined in the previous section, the problem of determining the nth RBF kernel's parameters at the nth OFS stage is to solve the following optimisation problemμ^=argminμΓJLOO(n)(μ)where the 2m-dimensional search space Γ is defined byΓi=12m[Γi,min,Γi,max]Specifically, the search space for cn=[cn,1cn,2cn,m]T is specified by the distribution of the training data {xk=[xk,1xk,2xk,m]T}k=1N, namely,

Combined SMOTE and PSO optimised RBF for imbalanced classification

The SMOTE [21] over-samples the positive class by creating synthetic instances by a specified over-sampling ratio of the original minority data size, β%. Based on each minority data sample, denoted by xo, β% synthetic data points are generated by randomly selecting data points on the lines linking xo with some of its K nearest neighbours, where K is predetermined. Depending on the required SMOTE amount β%, one out of the K nearest positive-class data samples are randomly selected several times.

Experimental results

The effectiveness of the proposed SMOTE+PSO-OFS algorithm was investigated using a simulated imbalanced date set and three real imbalanced data sets. The first two real data sets were taken from [54], while the third real data set was from [55]. These three real data sets were chosen in the order of increasing imbalance. For each data set, the positive class was over-sampled at different rates β% of its original size using the SMOTE. For the synthetic data set, a separate test data set was

Conclusions

The RBF classifier performs well on balanced or slightly imbalanced data sets, and our previous work has provided an efficient and tunable RBF classifier optimised by the PSO based on the OFS procedure. For highly imbalanced data sets, however, the performance of the tunable RBF classifier may no longer be satisfactory. In order to combat challenging imbalanced classification problems, many approaches have been proposed, which aim to reduce the influence from the underlying imbalanced

Acknowledgements

This work was supported by UK EPSRC. The authors would like to thank Dr. P.A.S. Reed and Dr. K.K. Lee for their help with ADI data set.

Ming Gao received the MEng degree from Northwestern Polytechnical University (NPU), Shaan'xi, P.R. China, in 2006, and the MEng degree from the Beihang University, Beijing, P.R. China, in 2009. He is now working towards the PhD degree in the Systems Engineering School, the University of Reading (UoR), Reading, UK. His research interests are machine learning, pattern recognition, and their applications in imbalanced problems.

References (57)

  • D.A. Cieslak, N.V. Chawla, A. Striegel, Combating imbalance in network intrusion datasets, in: Proceedings of the 2006...
  • G.M. Weiss, H. Hirsh, Learning to predict rare events in event sequences, in: Proceedings of the Fourth IEEE...
  • F. Provost, Machine Learning from Imbalanced Data Sets 101, AAAI Workshop on Learning from Imbalanced Data Sets,...
  • H. He et al.

    Learning from imbalanced data

    IEEE Trans. Knowl. Data Eng.

    (2009)
  • V. García, J.S. Sánchez, R.A. Mollineda, R. Alejo, J.M. Sotoca, The Class Imbalance Problem in Pattern Classification...
  • N.V. Chawla et al.

    Automatically countering imbalance and its empirical relationship to cost

    Data Min. Knowl. Discovery

    (2008)
  • G.M. Weiss et al.

    Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs?

  • G.M. Weiss et al.

    Learning when training data are costly: the effect of class distribution on tree induction

    Artif. Intell. Res.

    (2003)
  • A. Estabrooks et al.

    A multiple resampling method for learning from imbalanced data sets

    J. Chem. Inf. Modeling

    (2004)
  • G.E.A.P.A. Batista et al.

    A study of the behavior of several methods for balancing machine learning training data

    SIGKDD Explor. Newsl.

    (2004)
  • C. Drummond, R.C. Holte, C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling, in: 2003...
  • N.V. Chawla et al.

    Smote: synthetic minority over-sampling technique

    J. Artif. Intell. Res.

    (2002)
  • C. Elkan, The foundations of cost-sensitive learning, in: Proceedings of the 17th International Joint Conference on...
  • K.M. Ting

    An instance-weighting method to induce cost-sensitive trees

    IEEE Trans. Knowl. Data Eng.

    (2002)
  • M.A. Maloof, Learning when data sets are imbalanced and when costs are unequal and unknown, in: 2003 International...
  • K. McCarthy, B. Zabar, G. Weiss, Does cost-sensitive learning beat sampling for classifying rare classes? in:...
  • X. Hong et al.

    A kernel-based two-class classifier for imbalanced data sets

    IEEE Trans. Neural Networks

    (2007)
  • J. Moody et al.

    Fast learning in networks of locally-tuned processing units

    Neural Comput.

    (1989)
  • Cited by (124)

    • Predicting online news popularity based on machine learning

      2022, Computers and Electrical Engineering
      Citation Excerpt :

      The experimental results showed that the proposed method could effectively improve classification accuracy. Gao et al. [16] utilized an RBF-based classifier combined with SMOTE and PSO to resolve two types of imbalance problems, and the model effectively improved the classification accuracy of the decision area. Rodriguez et al. [17] used SMOTE to improve the predictive power of simple Bayesian models and decision trees and noted its good predictive effect in identifying drug reactions.

    View all citing articles on Scopus

    Ming Gao received the MEng degree from Northwestern Polytechnical University (NPU), Shaan'xi, P.R. China, in 2006, and the MEng degree from the Beihang University, Beijing, P.R. China, in 2009. He is now working towards the PhD degree in the Systems Engineering School, the University of Reading (UoR), Reading, UK. His research interests are machine learning, pattern recognition, and their applications in imbalanced problems.

    Xia Hong received her university education at National University of Defense Technology, P.R. China (BSc, 1984; MSc, 1987), and University of Sheffield, UK (PhD, 1998), all in automatic control. She worked as a Research Assistant in Beijing Institute of Systems Engineering, Beijing, China from 1987 to 1993. She worked as a Research Fellow in the Department of Electronics and Computer Science at University of Southampton from 1997 to 2001. She is currently a Reader at School of Systems Engineering, University of Reading. She is actively engaged in research into nonlinear systems identification, data modelling, estimation and intelligent control, neural networks, pattern recognition, learning theory and their applications. She has published over 100 research papers, and coauthored a research book. She was awarded a Donald Julius Groen Prize by IMechE in 1999.

    Sheng Chen received his PhD degree in control engineering from the City University, London, UK, in September 1986. He was awarded the Doctor of Sciences (DSc) degree by the University of Southampton, Southampton, UK, in 2005. From October 1986 to August 1999, he conducted research and academic appointments at the University of Sheffield, the University of Edinburgh and the University of Ports-mouth, all in UK. Since September 1999, he has been with the School of Electronics and Computer Science, University of Southampton, UK. Professor Chen's research interests include wireless communications, adaptive signal processing for communications, machine learning, and evolutionary computation methods. He has published over 400 research papers. Dr. Chen is a Fellow of IET and a Fellow of IEEE. In the database of the world's most highly cited researchers, compiled by Institute for Scientific Information (ISI) of the USA, Dr. Chen is on the list of the highly cited researchers (2004) in the engineering category.

    Chris Harris received university education at Leicester (BSc), Oxford (MA) and Southampton (PhD). He previously conducted appointments at the Universities of Hull, UMIST, Oxford and Cranfield, as well as being employed by the UK Ministry of Defence. His research interests are in the area of intelligent and adaptive systems theory and its application to intelligent autonomous systems, management infrastructures, intelligent control and estimation of dynamic processes, multi-sensor data fusion and systems integration. He has authored or coauthored 12 books and over 400 research papers, and he was the associate editor of numerous international journals including Automatica, Engineering Applications of AI, International Journal of General Systems Engineering, International Journal of System Science and the International Journal on Mathematical Control and Information Theory. He was elected to the Royal Academy of Engineering in 1996, was awarded the IEE Senior Achievement medal in 1998 for his work on autonomous systems, and the highest international award in IEE, the IEE Faraday medal in 2001 for his work in Intelligent Control and Neurofuzzy System.

    View full text