Preprocessing unbalanced data using support vector machine

https://doi.org/10.1016/j.dss.2012.01.016Get rights and content

Abstract

This paper deals with the application of support vector machine (SVM) to deal with the class imbalance problem. The objective of this paper is to examine the feasibility and efficiency of SVM as a preprocessor. Our study analyzes different classification algorithms that are employed to predict the customers with caravan car policy based on his/her sociodemographic data and history of product ownership. A series of experiments was conducted to test various computational intelligence techniques viz., Multilayer Perceptron (MLP), Logistic Regression (LR), and Random Forest (RF). Various standard balancing techniques such as under-sampling, over-sampling and Synthetic Minority Over-sampling TEchnique (SMOTE) are also employed. Subsequently, a strategy of data balancing for handling imbalanced distribution in data is proposed. The proposed approach first employs SVM as a preprocessor and the actual target values of training data are then replaced by the predictions of trained SVM. Later, this modified training data is used to train techniques such as MLP, LR, and RF. Based on the measure of sensitivity, it is observed that the proposed approach not only balances the data effectively but also provides more number of instances for minority class, which in turn enhances the performance of the intelligence techniques.

Highlights

►Support vector machine (SVM) acts as a preprocessor for unbalanced data. ►SVM generates extra data related to minority class. ►The modified training data is used to train multiple classification techniques. ►The hybrid approach performs well in terms of sensitivity.

Introduction

The class imbalance problem has been recognized in many real world applications [26] and is an evolving topic of machine learning research. It is observed from the literature that machine learning techniques tend to produce suboptimal classification models. The class imbalance problem where little or very less instances are available for the most important class of the study exists in many real world application domains, such as telecommunications [23], detection of oil spoils in satellite radar images [32], text classification [42], medical diagnosis [29], intrusion detection [34] and fraud detection [41].

Researchers have been attempting to deal with classification using unbalanced datasets. Methods to deal with imbalanced problems include, resizing training set that includes, oversampling minority class samples [35] and downsizing majority class samples [31], adjusting misclassification costs [11] and recognition based learning [32]. Detailed review reports [19], [30], [39], [51] have discussed the key issues related to problem solving with unbalanced training data using machine learning techniques. Research studies show that many standard machine learning approaches result in poor performance, specifically when dealing with medium and large scale unbalanced datasets [17], [26], [32], [49], [50]. One of the key problems when learning with imbalanced data sets is the lack of data where the number of samples is small or no sample is available for a particular class [50]. If there is a lack of data, the estimated decision boundary can be very far from the true boundary. Japkowicz and Stephen [26] reported that for simple data sets that were linearly separable, classifier performances were not susceptible to any amount of imbalance. Indeed, as the degree of data complexity increased, the class imbalance factor started affecting the generalization ability of the classifiers. Most accuracy-driven algorithms are biased toward the prevalent class. The machine learning approaches improved overall accuracy by assigning the overlapped area to the majority class, and ignored or treated the minority class as noise [49].

Since late 1960s, researchers have put their efforts toward developing strategies to deal with the class imbalance problem. In the earliest stage of this research, researchers used the condensed nearest neighbor method of under-sampling [22]. Wilson [52] proposed an Edited Nearest Neighbor (ENN) method of under-sampling. In this method, noisy samples from the majority class are removed in order to under-sample the data. Later, Kubat and Matwin [31] developed a concept of selective under-sampling by keeping the minority samples untouched. They introduced a data cleaning procedure using the Tomek–Links concept for under-sampling and removed the borderline majority samples. Based on Wilson's ENN method, the Neighbourhood Cleaning Rule is proposed to eliminate or to discard the majority class samples [33]. Later, Chawla et al. [9] proposed SMOTE (Synthetic Minority Over-sampling TEchnique), where synthetic (artificial) samples are generated rather than over-sampling by replacement. Maloof [37] reported that sampling has the same result as moving the decision threshold or adjusting the cost matrix. Barendela et al. [2] proposed a weighted distance function to be used in the classification phase of k-NN to compensate for the imbalance in the training samples without actually altering the distribution of classes. Efficiency of SVM is then analyzed to deal with the class imbalance problem [53]. They proposed SVM with a changed kernel function, which pushed the hyperplane closer to the positive class. Estabrooks et al. [13] concluded that combining different expressions of resampling approach was an effective solution. On the contrary, some researchers also reported that there was no further improvement to the predictive performance of SVM for text classification when it was preceded by strategies such as resampling in the presence of imbalanced training data [44].

Researchers have emphasized the use of clustering based preprocessing methods as an alternative for sampling of the data. Batista et al. [3], [4] proposed two hybrid sampling techniques, SMOTE + Tomek–Links and SMOTE + ENN for overlapping datasets, for better defined class clusters among majority and minority classes. Jo and Japkowicz [27] presented a cluster based over-sampling approach. Majority and minority class samples are clustered first and the clusters in the majority class are over-sampled to the largest cluster obtained for the majority class data. Han et al. [21] proposed borderline SMOTE, which identified minority samples at borderline and applied SMOTE. This is the only technique proposed to over-sample the borderline minority samples. Later, k-means based under-sampling method and the Agglomerative Hierarchical Clustering based over-sampling method to deal with unbalanced datasets are proposed [10]. Guo and Viktor [18] proposed boosting method with various over-sampling techniques to deal with hard to classify examples and concluded that boosting approach improved the prediction accuracy of the classifier. Huang et al. [25] presented Biased Minimax Probability Machine to resolve the imbalance problem.

Researchers then exerted their efforts toward developing hybrid approaches to deal with unbalanced data, where they combined over-sampling and under-sampling with different concepts into one approach. Some used a combination of under-sampling and over-sampling [35]. They used lift analysis instead of classification accuracy to measure a classifiers performance. Various hybrids, SMOTE-bootstrap hybrid [36] and a hybrid combining machine learning and unsupervised McCab feature selection method using SVM and maximum entropy method [12], and a hybrid balancing model using unsupervised clustering and decision tree boosting [6] are proposed. Later, Farquad et al. [14], [15] proposed a hybrid rule extraction from SVM approach for handling the class imbalance problem. They concluded that rules extracted using their proposed approach performed very well. Table 1 provides a chronological overview of the balancing approaches proposed by various researchers.

Researchers have never reported any preprocessing using intelligent methods to balance the data. In this paper we employ SVM as a preprocessor. SVM is one of the best intelligent algorithms used for classification and regression purposes. The best property of SVM is that it always yields global optimal solution, whereas other intelligent algorithms suffer from getting stuck with a local minimum. SVM tries to find the decision boundary between various classes without actually worrying about the number of instances available for a class. SVM is suitable for high dimensional problems and works with a small number of observations as well. Hence, trained SVM is proposed as a preprocessor in this paper.

The rest of the paper is organized as follows. Section 2 presents a brief overview of the method of SVM and motivation for the proposed approach. Section 3 explains the architecture of the proposed balancing approach. Section 4 presents a description of the dataset and the experimental method used in this research. Results and discussions are presented in Section 5. Section 6 concludes the paper. A brief overview of MLP, LR and RF is provided in Appendix A.

Section snippets

Overview of support vector machine

The SVM is a learning procedure based on the statistical learning theory [47] and it is one of the best machine learning techniques used in data mining [54]. It has been used in a wide variety of applications such as prediction of colon cancer [1], gene analysis [20], credit rating analysis [24], financial time-series forecasting [28], financial fraud detection [40], estimating manufacturing yields [43], users' web browsing behavior [55], among others.

For solving a two-class classification

Proposed balancing approach

Most of the real-world data are imbalanced in terms of the proportion of examples available for each class. This problem of imbalanced class distributions can lead the algorithms to learn overly complex models that over fit the data and have little relevance. It is observed that despite better performance of computational intelligence techniques, they are biased towards majority class instances and learn better about majority class and learn slightly or ignore minority class. In this paper we

Dataset

The dataset analyzed in this paper is used in the Coil 2000 data mining competition [46]. It is related to customer data for an insurance company. The target variable is whether or not a customer would buy caravan insurance policy. For each customer, 86 attributes are provided. They included 43 socio-demographic variables derived via the customer's zip code, which included age, customer type, religion, relationship status, education level, children in the family, ownership of the house,

Results and discussion

Identifying the potential customers who can buy caravan insurance policy is the basic intention of this study. The quantities employed to measure the quality of the classifiers are sensitivity, specificity and accuracy [16]. We place high emphasis on sensitivity alone which contributes towards filtering and finding the most possible buyers of the caravan insurance policy. Consequently, in this paper, sensitivity is given top priority ahead of specificity and accuracy. We define the performance

Conclusion

It is well known that standard machine learning algorithms are biased towards majority class when dealing with unbalanced data. In this research, the efficiency of SVM in dealing with unbalanced data is analyzed and presented. The Coil dataset [46], which is highly imbalanced and has a 94:6 ratio for class distribution, is used for empirical analysis. The proposed methodology followed a two phase approach. During the first phase the available training data is used to train SVM. Later, the

Mohammed Abdul Haque Farquad is a Research Assistant at the School of Business, The University of Hong Kong. He holds a Ph.D. in Computer Science from University of Hyderabad, Hyderabad, India. His research interests include data mining, soft computing, banking, finance, and customer relationship management. His research work has been published in Expert Systems with Applications, International Journal of Information and Decision Sciences, and in various Proceedings of International Conferences

References (55)

  • D. Sanchez et al.

    Association rules applied to credit card fraud detection

    Expert Systems with Applications

    (2009)
  • A. Sun et al.

    On strategies for imbalanced text classification using SVM: a comparative study

    Decision Support Systems

    (2009)
  • G.M. Weiss

    Learning with rare cases and small disjuncts

  • S.M. Alladi et al.

    Colon cancer prediction with genetic profiles using intelligent techniques

    Bioinformation

    (2008)
  • G.E.A.P.A. Batista et al.

    Improving rule induction precision for automated annotation by balancing skewed data sets

    Knowledge Exploration in Life Science Informatics

    (2004)
  • G.E.A.P.A. Batista et al.

    A study of the behaviour of several methods for balancing machine learning training data

    ACM SIGKDD Explorations: Special Issue on Imbalanced Data Sets

    (2004)
  • C.M. Bishop

    Neural Networks for Pattern Recognition

    (1995)
  • I. Bose et al.

    Hybrid models using unsupervised clustering for prediction of customer churn

    Journal of Organizational Computing and Electronic Commerce

    (2009)
  • M. Bosque

    Understanding 99% of Artificial Neural Networks

    (2002)
  • L. Breiman

    Random forests

    Machine Learning

    (2001)
  • N.V. Chawla et al.

    SMOTE: synthetic minority over-sampling technique

    Journal of Artificial Intelligence Research

    (2002)
  • P. Domingos

    MetaCost: a general method for making classifiers cost-sensitive

  • T. Eitrich et al.

    Classification of highly unbalanced cyp450 data of drugs using cost sensitive machine learning techniques

    Journal of Chemical Information and Modeling

    (2007)
  • A. Estabrooks et al.

    A multiple resampling method for learning from imbalanced data sets

    Computational Intelligence

    (2004)
  • M.A.H. Farquad et al.

    Data mining using rules extracted from SVM: an application to churn prediction in bank credit cards

  • M.A.H. Farquad et al.

    Rule extraction from Support Vector Machine using modified active learning based approach: an application to CRM

  • T. Fawcett et al.

    Adaptive fraud detection

    Data Mining and Knowledge Discovery

    (1997)
  • Cited by (148)

    • Kernel methods with asymmetric and robust loss function

      2023, Expert Systems with Applications
    • Understanding detour behavior in taxi services: A combined approach

      2022, Transportation Research Part C: Emerging Technologies
      Citation Excerpt :

      The first step is to balance the proportion of trips in each category in the training set. The most effective and simplest approach is to reconstruct the training samples (Farquad and Bose, 2012), and the related solutions can be classified into three groups, including over-sampling, under-sampling, and hybrid methods (Fernández et al., 2008). Over-sampling methods can create new instances from the original minority class samples.

    View all citing articles on Scopus

    Mohammed Abdul Haque Farquad is a Research Assistant at the School of Business, The University of Hong Kong. He holds a Ph.D. in Computer Science from University of Hyderabad, Hyderabad, India. His research interests include data mining, soft computing, banking, finance, and customer relationship management. His research work has been published in Expert Systems with Applications, International Journal of Information and Decision Sciences, and in various Proceedings of International Conferences published by IEEE and Springer. He is an ad-hoc referee for Information Sciences Journal, Knowledge Based System Journal and various IEEE International Conferences. He is a Program Committee member of International Conference on Data Mining 2011, Las Vegas and also a Technical committee member of the 3rd International Conference on Computer Technology and Development, China.

    Indranil Bose is Full Professor at the Indian Institute of Management Calcutta. He holds a B. Tech. from the Indian Institute of Technology, MS from the University of Iowa, and MS and Ph.D. from Purdue University. His research interests are in telecommunications, data mining, information security, and supply chain management. His publications have appeared in Communications of the ACM, Communications of AIS, Computers and Operations Research, Decision Support Systems, Ergonomics, European Journal of Operational Research, Information & Management, Journal of Organizational Computing and Electronic Commerce, Journal of the American Society for Information Science and Technology, Operations Research Letters, etc. He is listed in the International Who's Who of Professionals 2005–2006, Marquis Who's Who in the World 2006, Marquis Who's Who in Asia 2007, Marquis Who's Who in Science and Engineering 2007, and Marquis Who's Who of Emerging Leaders 2007. He serves on the editorial board of Information & Management, Communications of AIS, and several other IS journals.

    View full text