Preprocessing unbalanced data using support vector machine
Highlights
►Support vector machine (SVM) acts as a preprocessor for unbalanced data. ►SVM generates extra data related to minority class. ►The modified training data is used to train multiple classification techniques. ►The hybrid approach performs well in terms of sensitivity.
Introduction
The class imbalance problem has been recognized in many real world applications [26] and is an evolving topic of machine learning research. It is observed from the literature that machine learning techniques tend to produce suboptimal classification models. The class imbalance problem where little or very less instances are available for the most important class of the study exists in many real world application domains, such as telecommunications [23], detection of oil spoils in satellite radar images [32], text classification [42], medical diagnosis [29], intrusion detection [34] and fraud detection [41].
Researchers have been attempting to deal with classification using unbalanced datasets. Methods to deal with imbalanced problems include, resizing training set that includes, oversampling minority class samples [35] and downsizing majority class samples [31], adjusting misclassification costs [11] and recognition based learning [32]. Detailed review reports [19], [30], [39], [51] have discussed the key issues related to problem solving with unbalanced training data using machine learning techniques. Research studies show that many standard machine learning approaches result in poor performance, specifically when dealing with medium and large scale unbalanced datasets [17], [26], [32], [49], [50]. One of the key problems when learning with imbalanced data sets is the lack of data where the number of samples is small or no sample is available for a particular class [50]. If there is a lack of data, the estimated decision boundary can be very far from the true boundary. Japkowicz and Stephen [26] reported that for simple data sets that were linearly separable, classifier performances were not susceptible to any amount of imbalance. Indeed, as the degree of data complexity increased, the class imbalance factor started affecting the generalization ability of the classifiers. Most accuracy-driven algorithms are biased toward the prevalent class. The machine learning approaches improved overall accuracy by assigning the overlapped area to the majority class, and ignored or treated the minority class as noise [49].
Since late 1960s, researchers have put their efforts toward developing strategies to deal with the class imbalance problem. In the earliest stage of this research, researchers used the condensed nearest neighbor method of under-sampling [22]. Wilson [52] proposed an Edited Nearest Neighbor (ENN) method of under-sampling. In this method, noisy samples from the majority class are removed in order to under-sample the data. Later, Kubat and Matwin [31] developed a concept of selective under-sampling by keeping the minority samples untouched. They introduced a data cleaning procedure using the Tomek–Links concept for under-sampling and removed the borderline majority samples. Based on Wilson's ENN method, the Neighbourhood Cleaning Rule is proposed to eliminate or to discard the majority class samples [33]. Later, Chawla et al. [9] proposed SMOTE (Synthetic Minority Over-sampling TEchnique), where synthetic (artificial) samples are generated rather than over-sampling by replacement. Maloof [37] reported that sampling has the same result as moving the decision threshold or adjusting the cost matrix. Barendela et al. [2] proposed a weighted distance function to be used in the classification phase of k-NN to compensate for the imbalance in the training samples without actually altering the distribution of classes. Efficiency of SVM is then analyzed to deal with the class imbalance problem [53]. They proposed SVM with a changed kernel function, which pushed the hyperplane closer to the positive class. Estabrooks et al. [13] concluded that combining different expressions of resampling approach was an effective solution. On the contrary, some researchers also reported that there was no further improvement to the predictive performance of SVM for text classification when it was preceded by strategies such as resampling in the presence of imbalanced training data [44].
Researchers have emphasized the use of clustering based preprocessing methods as an alternative for sampling of the data. Batista et al. [3], [4] proposed two hybrid sampling techniques, SMOTE + Tomek–Links and SMOTE + ENN for overlapping datasets, for better defined class clusters among majority and minority classes. Jo and Japkowicz [27] presented a cluster based over-sampling approach. Majority and minority class samples are clustered first and the clusters in the majority class are over-sampled to the largest cluster obtained for the majority class data. Han et al. [21] proposed borderline SMOTE, which identified minority samples at borderline and applied SMOTE. This is the only technique proposed to over-sample the borderline minority samples. Later, k-means based under-sampling method and the Agglomerative Hierarchical Clustering based over-sampling method to deal with unbalanced datasets are proposed [10]. Guo and Viktor [18] proposed boosting method with various over-sampling techniques to deal with hard to classify examples and concluded that boosting approach improved the prediction accuracy of the classifier. Huang et al. [25] presented Biased Minimax Probability Machine to resolve the imbalance problem.
Researchers then exerted their efforts toward developing hybrid approaches to deal with unbalanced data, where they combined over-sampling and under-sampling with different concepts into one approach. Some used a combination of under-sampling and over-sampling [35]. They used lift analysis instead of classification accuracy to measure a classifiers performance. Various hybrids, SMOTE-bootstrap hybrid [36] and a hybrid combining machine learning and unsupervised McCab feature selection method using SVM and maximum entropy method [12], and a hybrid balancing model using unsupervised clustering and decision tree boosting [6] are proposed. Later, Farquad et al. [14], [15] proposed a hybrid rule extraction from SVM approach for handling the class imbalance problem. They concluded that rules extracted using their proposed approach performed very well. Table 1 provides a chronological overview of the balancing approaches proposed by various researchers.
Researchers have never reported any preprocessing using intelligent methods to balance the data. In this paper we employ SVM as a preprocessor. SVM is one of the best intelligent algorithms used for classification and regression purposes. The best property of SVM is that it always yields global optimal solution, whereas other intelligent algorithms suffer from getting stuck with a local minimum. SVM tries to find the decision boundary between various classes without actually worrying about the number of instances available for a class. SVM is suitable for high dimensional problems and works with a small number of observations as well. Hence, trained SVM is proposed as a preprocessor in this paper.
The rest of the paper is organized as follows. Section 2 presents a brief overview of the method of SVM and motivation for the proposed approach. Section 3 explains the architecture of the proposed balancing approach. Section 4 presents a description of the dataset and the experimental method used in this research. Results and discussions are presented in Section 5. Section 6 concludes the paper. A brief overview of MLP, LR and RF is provided in Appendix A.
Section snippets
Overview of support vector machine
The SVM is a learning procedure based on the statistical learning theory [47] and it is one of the best machine learning techniques used in data mining [54]. It has been used in a wide variety of applications such as prediction of colon cancer [1], gene analysis [20], credit rating analysis [24], financial time-series forecasting [28], financial fraud detection [40], estimating manufacturing yields [43], users' web browsing behavior [55], among others.
For solving a two-class classification
Proposed balancing approach
Most of the real-world data are imbalanced in terms of the proportion of examples available for each class. This problem of imbalanced class distributions can lead the algorithms to learn overly complex models that over fit the data and have little relevance. It is observed that despite better performance of computational intelligence techniques, they are biased towards majority class instances and learn better about majority class and learn slightly or ignore minority class. In this paper we
Dataset
The dataset analyzed in this paper is used in the Coil 2000 data mining competition [46]. It is related to customer data for an insurance company. The target variable is whether or not a customer would buy caravan insurance policy. For each customer, 86 attributes are provided. They included 43 socio-demographic variables derived via the customer's zip code, which included age, customer type, religion, relationship status, education level, children in the family, ownership of the house,
Results and discussion
Identifying the potential customers who can buy caravan insurance policy is the basic intention of this study. The quantities employed to measure the quality of the classifiers are sensitivity, specificity and accuracy [16]. We place high emphasis on sensitivity alone which contributes towards filtering and finding the most possible buyers of the caravan insurance policy. Consequently, in this paper, sensitivity is given top priority ahead of specificity and accuracy. We define the performance
Conclusion
It is well known that standard machine learning algorithms are biased towards majority class when dealing with unbalanced data. In this research, the efficiency of SVM in dealing with unbalanced data is analyzed and presented. The Coil dataset [46], which is highly imbalanced and has a 94:6 ratio for class distribution, is used for empirical analysis. The proposed methodology followed a two phase approach. During the first phase the available training data is used to train SVM. Later, the
Mohammed Abdul Haque Farquad is a Research Assistant at the School of Business, The University of Hong Kong. He holds a Ph.D. in Computer Science from University of Hyderabad, Hyderabad, India. His research interests include data mining, soft computing, banking, finance, and customer relationship management. His research work has been published in Expert Systems with Applications, International Journal of Information and Decision Sciences, and in various Proceedings of International Conferences
References (55)
- et al.
Strategies for learning in class imbalance problems
Pattern Recognition
(2003) - et al.
Learning from imbalanced data in surveillance of nosocomial infection
Artificial Intelligence in Medicine
(2006) An introduction to ROC analysis
Pattern Recognition Letters
(2006)Designing an expert system for fraud detection in private telecommunications networks
Expert Systems with Applications
(2009)- et al.
Credit rating analysis with support vector machines and neural networks: a market comparative study
Decision Support Systems
(2004) Financial time series forecasting using support vector machines
Neurocomputing
(2003)Machine learning for medical diagnosis: history, state of the art and perspective
Artificial Intelligence in Medicine
(2001)- et al.
A study in machine learning from imbalanced data for sentence boundary detection in speech
Computer Speech and Language
(2006) - et al.
Classification algorithm sensitivity to training data with non representative attribute noise
Decision Support Systems
(2009) - et al.
Detection of financial statement fraud and feature selection using data mining techniques
Decision Support Systems
(2011)
Association rules applied to credit card fraud detection
Expert Systems with Applications
On strategies for imbalanced text classification using SVM: a comparative study
Decision Support Systems
Learning with rare cases and small disjuncts
Colon cancer prediction with genetic profiles using intelligent techniques
Bioinformation
Improving rule induction precision for automated annotation by balancing skewed data sets
Knowledge Exploration in Life Science Informatics
A study of the behaviour of several methods for balancing machine learning training data
ACM SIGKDD Explorations: Special Issue on Imbalanced Data Sets
Neural Networks for Pattern Recognition
Hybrid models using unsupervised clustering for prediction of customer churn
Journal of Organizational Computing and Electronic Commerce
Understanding 99% of Artificial Neural Networks
Random forests
Machine Learning
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
MetaCost: a general method for making classifiers cost-sensitive
Classification of highly unbalanced cyp450 data of drugs using cost sensitive machine learning techniques
Journal of Chemical Information and Modeling
A multiple resampling method for learning from imbalanced data sets
Computational Intelligence
Data mining using rules extracted from SVM: an application to churn prediction in bank credit cards
Rule extraction from Support Vector Machine using modified active learning based approach: an application to CRM
Adaptive fraud detection
Data Mining and Knowledge Discovery
Cited by (148)
WRND: A weighted oversampling framework with relative neighborhood density for imbalanced noisy classification
2024, Expert Systems with ApplicationsR-WDLS: An efficient security region oversampling technique based on data distribution
2024, Applied Soft ComputingApplication of supervised learning algorithms for temperature prediction in nucleate flow boiling
2024, Applied Thermal EngineeringKernel methods with asymmetric and robust loss function
2023, Expert Systems with ApplicationsUnderstanding detour behavior in taxi services: A combined approach
2022, Transportation Research Part C: Emerging TechnologiesCitation Excerpt :The first step is to balance the proportion of trips in each category in the training set. The most effective and simplest approach is to reconstruct the training samples (Farquad and Bose, 2012), and the related solutions can be classified into three groups, including over-sampling, under-sampling, and hybrid methods (Fernández et al., 2008). Over-sampling methods can create new instances from the original minority class samples.
Mohammed Abdul Haque Farquad is a Research Assistant at the School of Business, The University of Hong Kong. He holds a Ph.D. in Computer Science from University of Hyderabad, Hyderabad, India. His research interests include data mining, soft computing, banking, finance, and customer relationship management. His research work has been published in Expert Systems with Applications, International Journal of Information and Decision Sciences, and in various Proceedings of International Conferences published by IEEE and Springer. He is an ad-hoc referee for Information Sciences Journal, Knowledge Based System Journal and various IEEE International Conferences. He is a Program Committee member of International Conference on Data Mining 2011, Las Vegas and also a Technical committee member of the 3rd International Conference on Computer Technology and Development, China.
Indranil Bose is Full Professor at the Indian Institute of Management Calcutta. He holds a B. Tech. from the Indian Institute of Technology, MS from the University of Iowa, and MS and Ph.D. from Purdue University. His research interests are in telecommunications, data mining, information security, and supply chain management. His publications have appeared in Communications of the ACM, Communications of AIS, Computers and Operations Research, Decision Support Systems, Ergonomics, European Journal of Operational Research, Information & Management, Journal of Organizational Computing and Electronic Commerce, Journal of the American Society for Information Science and Technology, Operations Research Letters, etc. He is listed in the International Who's Who of Professionals 2005–2006, Marquis Who's Who in the World 2006, Marquis Who's Who in Asia 2007, Marquis Who's Who in Science and Engineering 2007, and Marquis Who's Who of Emerging Leaders 2007. He serves on the editorial board of Information & Management, Communications of AIS, and several other IS journals.