Elsevier

Neurocomputing

Volume 138, 22 August 2014, Pages 248-259
Neurocomputing

PDFOS: PDF estimation based over-sampling for imbalanced two-class problems

https://doi.org/10.1016/j.neucom.2014.02.006Get rights and content

Abstract

This contribution proposes a novel probability density function (PDF) estimation based over-sampling (PDFOS) approach for two-class imbalanced classification problems. The classical Parzen-window kernel function is adopted to estimate the PDF of the positive class. Then according to the estimated PDF, synthetic instances are generated as the additional training data. The essential concept is to re-balance the class distribution of the original imbalanced data set under the principle that synthetic data sample follows the same statistical properties. Based on the over-sampled training data, the radial basis function (RBF) classifier is constructed by applying the orthogonal forward selection procedure, in which the classifier׳s structure and the parameters of RBF kernels are determined using a particle swarm optimisation algorithm based on the criterion of minimising the leave-one-out misclassification rate. The effectiveness of the proposed PDFOS approach is demonstrated by the empirical study on several imbalanced data sets.

Introduction

In a typical two-class imbalanced classification problem, the instances in one class outnumber the instances of the other class. The majority class is usually referred to as the negative class, while the minority one as the positive class. Machine learning based on imbalanced data, whereby the imbalance in class distribution renders the positive class instances to be submerged in the negative class, is of great interest. The problem typically arises in life threatening or safety critical applications, such as mammography for breast cancer detection [1], mobile phone fraud detection [2], and detection of oil spills in satellite radar images [3]. In addition, many engineering applications, including information retrieval and filtering [4], direct marketing [5], risk management [6], and so on, are inherently imbalanced. In these applications, the primary objectives are often to target and explore the rare cases/classes which are less probable yet highly risky/costly. The imbalance between two classes is problematic for many standard classification algorithms [7], [8], [9], [10], [11]. The performances of these algorithms deteriorate as class imbalance degree increases, or as the data samples of positive class become sparser [9]. For example, the kernel-based methods, which are regarded as robust classifiers [12], construct a decision hyperplane separating two classes. Without special countermeasure, the resultant hyperplane will tend to be placed in favour of the classification performance for the negative class, but the classification performance for the target class becomes unsatisfactory. There exist a large amount of works to deal with the imbalanced learning, and the reader is referred to the excellent survey paper [12] for more information. Typical techniques of tackling the imbalanced problem can be categorised into two categories: resampling methods, also known as external methods, and imbalanced learning algorithms, often referred to as internal methods.

Imbalanced learning algorithms are obtained by modifying some existing learning algorithms internally so that they can deal with imbalanced problems effectively, without ‘artificially’ altering or re-balancing the original imbalanced data set. For example, the kernel classifier construction or model selection procedure can be modified, in order to cope with the imbalanced distribution during the classifier construction process [11], [13]. A well-known radial basis function (RBF) modelling approach is the two-stage procedure [14], in which the RBF centres are first determined using the κ-means clustering [15] and the RBF weights are then obtained using the least squares estimate (LSE). To cope with imbalanced data sets, a natural extension of [14] is to modify the latter stage as the weighted LSE (WLSE), where the same weighted cost function of [13] is used. This κ-means+WLSE algorithm provides a viable technique for this category of imbalanced learning.

The resampling methods are external as they operate on original imbalanced data set, aiming to provide a re-balanced input to train a conventional classifier. One scheme is to assign different weights to the samples of the data set in accordance with their misclassification costs [16], [17]. There have been a large number of studies focusing on this simple yet effective methodology to combine with the conventional classifiers for the rebalanced data set. Clearly the ultimate classification performance will be dependent on the adopted resampling strategy as well as the choice of classifier. In terms of classifier development, recently, the particle swarm optimisation (PSO) algorithm [18] has been applied to minimise the leave-one-out (LOO) misclassification rate in the orthogonal forward selection (OFS) construction of tunable RBF classifier [19], [20]. PSO [18] is an efficient population-based stochastic optimisation technique inspired by social behaviour of bird flocks or fish schools, and it has been successfully applied to wide-ranging optimisation applications [21], [22], [23], [24], [25], [26], [27], [28]. Owing to the efficiency of PSO, the tunable RBF modelling approach advocated in [19], [20] offers significant advantages over many existing kernel or RBF classifier construction algorithms, in terms of better generalisation performance and smaller classifier size as well as lower complexity in learning process. With regarding to the choice of resampling strategy, we note that various resampling methods can be divided into the two basic categories, according to whether they re-balance the class distribution by under-sampling or over-sampling.

Random under-sampling is the non-heuristic method aiming to re-balance class distribution by randomly eliminating instances in the negative class [29]. Despite its simplicity, random under-sampling is considered to be one of the most effective re-sampling methods [30]. A major drawback of this technique is that it may discard data potentially important for building the classifier. Thus, many studies focus on heuristic selection techniques [31], [32], [33], [34], [35], [36], [37], [38], [39], [40] to eliminate negative class instances. The method presented in [35] selectively under-samples the negative class, while keeping all the samples of the positive class. Specifically, the negative class instances are divided into the four categories: class-label noise instances A that overlap the positive class decision region; borderline instances B that are unreliable and can easily cause misclassification; redundant instances C that do not harm classification accuracy but increase classification costs; and safe instances D that are worthy of being kept for classification process. The categories A and B are detected by the use of Tomek links concept [41], as the instances complying with Tomek links are either borderline or noisy samples. Also a SHRINK [3] system attributes the overlapping regions of both the negative and positive classes as the positive class, and searches for the best positive-class region. Alternatively, Wilson׳s edited nearest neighbour (ENN) rule [42] is introduced to eliminate noisy instances in the negative class [43]. The ENN rule removes any instance whose class label differs from the class label of at least two of its three nearest neighbours, and a neighbourhood cleaning rule (NCL) [44] modifies the ENN by removing any negative-class instance whose class label differs from that of its 3-nearest neighbours. In order to find a consistent subset, the categories C and D are identified by involving Hart׳s condensed nearest neighbor (CNN) rule [45].

Under-sampling tends to be an ideal option when the imbalance degree is not very severe. However, as pointed out in [46], the use of over-sampling is necessary when the imbalance degree is high. Random over-sampling is the non-heuristic method aiming to re-balance class distribution by randomly replicating instances in the positive class. Studies [9], [29] highlight that this method is simple yet very competitive to more complex over-sampling methods. However, over-fitting is a recognised serious problem for random over-sampling, because the exact copies of the instances in the positive class are made. In the study of imbalanced data sets in marketing analysis, over-sampling the positive instances with replacement is applied to match the number of negative instances [5]. The study [47] proposed a synthetic minority over-sampling technique (SMOTE), which aims to enhance the significance of some specific regions in the feature space by over-sampling the positive class. Instead of mere data oriented duplicating, SMOTE generates synthetic instances in the feature space formed by random samples along the line linking the instance and its k-nearest neighbours (k-NN). Although SMOTE is well acknowledged by the academic community, it still has some drawbacks, including over generalisation and large variance [48]. Thus, SMOTEBoost [49], borderline-SMOTE [50] and adaptive synthetic sampling (ADASYN) [51] were proposed to alleviate its limitations. Despite the empirical evidences that the foregoing methods have been effective in improving the classification performance for the target class, the reason behind the success of the oversampling approaches, such as SMOTE, is not fully understood. In fact, there exist little theoretical studies to justify most of the oversampling methods. This raises the fundamental questions as how to measure the quality of synthetic instances and why these can be used as training samples.

Against this background, we propose a novel oversampling approach based on the kernel density estimation from positive-class data samples. The estimation of the probability density function (PDF) from observed data samples is a fundamental problem in many machine learning and pattern recognition applications [52], [53], [54]. The Parzen window (PW) estimate is a simple yet remarkably accurate nonparametric density estimation technique [53], [54], [55]. According to the estimated PDF, synthetic instances are generated as the additional training data. The RBF classifier proposed in [20] is then applied to the rebalanced data set, to complete the classification process. In the generic density estimation application, the PW estimator has a well-known drawback, owing to the fact that it employs the full data sample set in defining the density estimate for a subsequent observation and, therefore its computational cost for testing directly scales with the sample size. Note that we apply the PW estimator for estimating the distribution of the minority class, which by nature consists of a small number of data samples. Therefore the potential disadvantage of the PW estimate does not exist in our application. In fact, if the sample size of the positive class is large, there will be no need to oversample it by introducing artificial samples, and the imbalance of the two classes can be better dealt with by removing some samples from the majority class, in other words, by undersampling the negative class.

The significance of our PDFOS+PSO-OFS method is twofold. Firstly, in comparison to the existing oversampling techniques, our PDFOS based oversampling approach has much stronger theoretical justification. This is because an ideal or “optimal” oversampling technique should generate synthetic data according to the same probability distribution which produces the observed positive-class data samples. By using the estimated PDF of the minority class to generate synthetic samples, the generated synthetic data follow the same statistical properties as the observed positive-class data samples. Therefore, the proposed PDFOS technique generates synthetic instances with better quality than the existing oversampling methods. Secondly, the PSO-OFS based RBF classifier, with its structure and parameters determined using a PSO algorithm based on minimising the LOO misclassification rate in the efficient OFS procedure, has been shown to outperform many existing classifier construction algorithms [20].

To evaluate the proposed PDFOS+PSO-OFS method, an extensive experimental study is carried out, in which three benchmarks are used for the comparison purpose. The first benchmark uses the same PSO-OFS based RBF classifier applied to the SMOTE oversampling data set [56], denoted by the SMOTE+PSO-OFS, which offers a very competitive performance to many existing methods for combating two-class imbalanced classification problems, as demonstrated in [56]. The second benchmark is the algorithm advocated in [13], denoted by the LOO-AUC+OFS, which is a state-of-the-art representative of the internal approach for dealing with imbalanced problems. The third benchmark, the κ-means+WLSE algorithm, as discussed previously, is also a typical imbalanced learning approach. The experimental results obtained demonstrate that the proposed PDFOS+PSO-OF method is competitive to these existing state-of-the-arts methods for two-class imbalanced problems.

The rest of the paper is organised as follows. Section 2 presents the proposed PDF estimation based over-sampling (PDFOS) algorithm. Section 3 describes our chosen classifier, the PSO aided tunable RBF model for two-class classification constructed by minimising the LOO misclassification rate based on the OFS procedure. The effectiveness of our approach is demonstrated by numerical examples in Section 4, and our conclusions are given in Section 5.

Section snippets

PDF estimation based over-sampling (PDFOS)

Consider the two-class data set given asDN={xk,yk}k=1N=DN+DN={xi,yi=+1}i=1N+{xl,yl=1}l=1Nwhere yk{±1} denotes the class label for the feature vector xkRm, N=N++N is the total number of instances, while there are N+ positive-class instances and N negative-class instances, respectively. The underlying classification problem is imbalanced, and this manifests as N+⪯¡N. The sample xk complies with an unknown PDF, with the assumption that instances are generated independently and

Tunable RBF modelling for classification

After the positive class has been oversampled with a required oversampling rate r, a tunable RBF classifier can then be constructed based on the expanded or rebalanced training data set using the algorithm proposed in [19], [20]. For the completeness, this PSO-OFS algorithm for constructing the tunable RBF classifier is briefly described. For notational simplicity, the oversampled two-class training data set is still denoted as DN={xk,yk}k=1N, where the number of the total instances, N, is

Experimental results

The effectiveness of the PDFOS+PSO-OFS method was examined on the six data sets summarised in Table 1 in the order of the ascending imbalanced degree (ID), which is defined as ID=N/N+. The austempered ductile iron (ADI) material data set came from the study [64], while the other five data sets were from the UCI machine learning repository [65]. Note that the data sets, Glass, Satimage and Yeast, are multiple-class data sets, which were turned into the two-class problems in this study by

Conclusions

Although re-sampling is a straightforward and effective way to deal with imbalanced classification problems, most of the existing methods lack sufficient theoretical insight and justification. This study has followed the principle of over-sampling technique that seeks to re-balance the skewed class distribution, but with the aim of maintaining the true statistical information as manifested in the observed data. This has been achieved by a PW based PDF estimator using the positive data samples,

Acknowledgements

This paper was partly funded by the Deanship of Scientific Research (DSR), King Abdulaziz University, under Grant no. (1-4-1432/HiCi). X. Hong, S. Chen and E. Khalaf acknowledge with thanks DSR technical and financial support.

Ming Gao received his B.Eng. degree from Northwestern Polytechnical University, Shaanxi, PR China, in 2006, the MEng degree from Beihang University, Beijing, PR China, in 2009, and the PhD from University of Reading, Reading, UK, in 2013.

His research interests are in data modelling, machine learning, pattern recognition, and their applications in imbalanced problems.

References (67)

  • G.M. Weiss, F. Provost, The Effect of Class Distribution on Classifier Learning: An Empirical Study, Technical Report...
  • A. Estabrooks et al.

    A multiple resampling method for learning from imbalanced data sets

    J. Chem. Inf. Model.

    (2004)
  • N. Japkowicz et al.

    The class imbalance problema systematic study

    Intell. Data Anal.

    (2002)
  • R. Akbani, S. Kwek, N. Japkowicz, Applying support vector machines to imbalanced datasets, in: Proceedings of the 15th...
  • G. Wu et al.

    KBAkernel boundary alignment considering imbalanced data distribution

    IEEE Trans. Knowl. Data Eng.

    (2005)
  • H. He et al.

    Learning from imbalanced data

    IEEE Trans. Knowl. Data Eng.

    (2009)
  • X. Hong et al.

    A kernel-based two-class classifier for imbalanced data sets

    IEEE Trans. Neural Netw.

    (2007)
  • J. Moody et al.

    Fast learning in networks of locally-tuned processing units

    Neural Comput.

    (1989)
  • S. Haykin

    Neural Networks: A Comprehensive Foundation

    (1998)
  • W. Fan, S.J. Stolfo, J. Zhang, P.K. Chan, AdaCost: misclassification cost-sensitive boosting, in: Proceedings of the...
  • J. Kennedy et al.

    Swarm Intelligence

    (2001)
  • S. Chen, X. Hong, C.J. Harris, Radial basis function classifier construction using particle swarm optimisation aided...
  • S. Chen et al.

    Particle swarm optimization aided orthogonal forward regression for unified data modelling

    IEEE Trans. Evol. Comput.

    (2010)
  • A. Ratnaweera et al.

    Self-organizing hierarchical particle swarm optimizer with time-varying acceleration coefficients

    IEEE Trans. Evol. Comput.

    (2004)
  • W.-F. Leong et al.

    PSO-based multiobjective optimization with dynamic population size and adaptive local archives

    IEEE Trans. Syst. Man Cybern. B

    (2008)
  • S. Chen et al.

    Non-linear system identification using particle swarm optimisation tuned radial basis function models

    Int. J. Bio-Inspired Comput.

    (2009)
  • M. Ramezani et al.

    Determination of capacity benefit margin in multiarea power systems using particle swarm optimization

    IEEE Trans. Power Syst.

    (2009)
  • H.-L. Wei et al.

    Lattice dynamical wavelet neural networks implemented using particle swarm optimization for spatio-temporal system identification

    IEEE Trans. Neural Netw.

    (2009)
  • S. Chen et al.

    Particle swarm optimisation aided MIMO transceiver designs

  • P. Puranik et al.

    Human perception-based color image segmentation using comprehensive learning particle swarm optimization

    J. Inf. Hiding Multim. Signal Process.

    (2011)
  • G.E.A.P.A. Batista et al.

    Study of the behavior of several methods for balancing machine learning training data

    ACM SIGKDD Expl. Newsl.

    (2004)
  • C. Drummond, R.C. Holte, C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling, in:...
  • D.W. Aha et al.

    Instance-based learning algorithms

    Mach. Learn.

    (1991)
  • Cited by (0)

    Ming Gao received his B.Eng. degree from Northwestern Polytechnical University, Shaanxi, PR China, in 2006, the MEng degree from Beihang University, Beijing, PR China, in 2009, and the PhD from University of Reading, Reading, UK, in 2013.

    His research interests are in data modelling, machine learning, pattern recognition, and their applications in imbalanced problems.

    Xia Hong received her university education at National University of Defense Technology, PR China (BSc, 1984, M.Sc., 1987), and University of Sheffield, UK (Ph.D., 1998), all in automatic control.

    She worked as a research assistant in Beijing Institute of Systems Engineering, Beijing, China from 1987 to 1993. She worked as a research fellow in the Department of Electronics and Computer Science at University of Southampton from 1997 to 2001. She is currently a Professor at School of Systems Engineering, University of Reading. She is actively engaged in research into nonlinear systems identification, data modelling, estimation and intelligent control, neural networks, pattern recognition, learning theory and their applications. She has published over 100 research papers, and coauthored a research book.

    Professor Hong was awarded a Donald Julius Groen Prize by IMechE in 1999.

    Sheng Chen received his B.Eng. degree from the East China Petroleum Institute, Dongying, China, in January 1982, and his Ph.D. degree from the City University, London, in September 1986, both in control engineering. In 2005, he was awarded the higher doctoral degree, Doctor of Sciences (D.Sc.), from the University of Southampton, Southampton, UK.

    From 1986 to 1999, He held research and academic appointments at the Universities of Sheffield, Edinburgh and Portsmouth, all in UK. Since 1999, he has been with Electronics and Computer Science, the University of Southampton, UK, where he currently holds the post of Professor in Intelligent Systems and Signal Processing. Dr. Chen׳s research interests include adaptive signal processing, wireless communications, modelling and identification of nonlinear systems, neural network and machine learning, intelligent control system design, evolutionary computation methods and optimisation. He has published over 500 research papers.

    Dr. Chen is a Fellow of IEEE and a Fellow of IET. He is a Distinguished Adjunct Professor at King Abdulaziz University, Jeddah, Saudi Arabia. Dr. Chen is an ISI highly cited researcher in engineering (March 2004).

    Chris J. Harris received his B.Sc. and M.A. degrees from the University of Leicester and the University of Oxford in UK, respectively, and his PhD degree from the University of Southampton, UK, in 1972. He was awarded the higher doctoral degree, the Doctor of Sciences (D.Sc.), by the University of Southampton in 2001.

    He is Emeritus Research Professor at the University of Southampton, having previously held senior academic appointments at Imperial College, Oxford and Manchester Universities, as well as Deputy Chief Scientist for the UK Government.

    Professor Harris was awarded the IEE senior Achievement Medal for Data Fusion research and the IEE Faraday Medal for distinguished international research in Machine Learning. He was elected to the UK Royal Academy of Engineering in 1996. He is the co-author of over 450 scientific research papers during a 45 year research career.

    Emad F. Khalaf received his B.Eng. and M.Eng. degrees in IT from Wroclaw University of Technology in Poland, in 1992, as one certificate, and the PhD degree in Computer networks from Wroclaw University of Technology, in Poland, in 2002.

    From 2003 to 2011, he worked as an assistant professor at Computer Engineering Department, Faculty of Engineering, Philadelphia University, in Jordan. Since 2012 he is an assistant professor at Electrical and Computer Engineering Department, Faculty of Engineering, King Abdulaziz University, Jeddah, Saudi Arabia. Dr. Khalaf׳s research interests are in network security and cryptography, speech classification and recognition.

    This work was supported by UK EPSRC.

    View full text