Preprocessing unbalanced data using support vector machine

doi:10.1016/j.dss.2012.01.016

Decision Support Systems

Volume 53, Issue 1, April 2012, Pages 226-233

https://doi.org/10.1016/j.dss.2012.01.016 Get rights and content

Abstract

This paper deals with the application of support vector machine (SVM) to deal with the class imbalance problem. The objective of this paper is to examine the feasibility and efficiency of SVM as a preprocessor. Our study analyzes different classification algorithms that are employed to predict the customers with caravan car policy based on his/her sociodemographic data and history of product ownership. A series of experiments was conducted to test various computational intelligence techniques viz., Multilayer Perceptron (MLP), Logistic Regression (LR), and Random Forest (RF). Various standard balancing techniques such as under-sampling, over-sampling and Synthetic Minority Over-sampling TEchnique (SMOTE) are also employed. Subsequently, a strategy of data balancing for handling imbalanced distribution in data is proposed. The proposed approach first employs SVM as a preprocessor and the actual target values of training data are then replaced by the predictions of trained SVM. Later, this modified training data is used to train techniques such as MLP, LR, and RF. Based on the measure of sensitivity, it is observed that the proposed approach not only balances the data effectively but also provides more number of instances for minority class, which in turn enhances the performance of the intelligence techniques.

Highlights

►Support vector machine (SVM) acts as a preprocessor for unbalanced data. ►SVM generates extra data related to minority class. ►The modified training data is used to train multiple classification techniques. ►The hybrid approach performs well in terms of sensitivity.

Introduction

The class imbalance problem has been recognized in many real world applications [26] and is an evolving topic of machine learning research. It is observed from the literature that machine learning techniques tend to produce suboptimal classification models. The class imbalance problem where little or very less instances are available for the most important class of the study exists in many real world application domains, such as telecommunications [23], detection of oil spoils in satellite radar images [32], text classification [42], medical diagnosis [29], intrusion detection [34] and fraud detection [41].

Researchers have been attempting to deal with classification using unbalanced datasets. Methods to deal with imbalanced problems include, resizing training set that includes, oversampling minority class samples [35] and downsizing majority class samples [31], adjusting misclassification costs [11] and recognition based learning [32]. Detailed review reports [19], [30], [39], [51] have discussed the key issues related to problem solving with unbalanced training data using machine learning techniques. Research studies show that many standard machine learning approaches result in poor performance, specifically when dealing with medium and large scale unbalanced datasets [17], [26], [32], [49], [50]. One of the key problems when learning with imbalanced data sets is the lack of data where the number of samples is small or no sample is available for a particular class [50]. If there is a lack of data, the estimated decision boundary can be very far from the true boundary. Japkowicz and Stephen [26] reported that for simple data sets that were linearly separable, classifier performances were not susceptible to any amount of imbalance. Indeed, as the degree of data complexity increased, the class imbalance factor started affecting the generalization ability of the classifiers. Most accuracy-driven algorithms are biased toward the prevalent class. The machine learning approaches improved overall accuracy by assigning the overlapped area to the majority class, and ignored or treated the minority class as noise [49].

Since late 1960s, researchers have put their efforts toward developing strategies to deal with the class imbalance problem. In the earliest stage of this research, researchers used the condensed nearest neighbor method of under-sampling [22]. Wilson [52] proposed an Edited Nearest Neighbor (ENN) method of under-sampling. In this method, noisy samples from the majority class are removed in order to under-sample the data. Later, Kubat and Matwin [31] developed a concept of selective under-sampling by keeping the minority samples untouched. They introduced a data cleaning procedure using the Tomek–Links concept for under-sampling and removed the borderline majority samples. Based on Wilson's ENN method, the Neighbourhood Cleaning Rule is proposed to eliminate or to discard the majority class samples [33]. Later, Chawla et al. [9] proposed SMOTE (Synthetic Minority Over-sampling TEchnique), where synthetic (artificial) samples are generated rather than over-sampling by replacement. Maloof [37] reported that sampling has the same result as moving the decision threshold or adjusting the cost matrix. Barendela et al. [2] proposed a weighted distance function to be used in the classification phase of k-NN to compensate for the imbalance in the training samples without actually altering the distribution of classes. Efficiency of SVM is then analyzed to deal with the class imbalance problem [53]. They proposed SVM with a changed kernel function, which pushed the hyperplane closer to the positive class. Estabrooks et al. [13] concluded that combining different expressions of resampling approach was an effective solution. On the contrary, some researchers also reported that there was no further improvement to the predictive performance of SVM for text classification when it was preceded by strategies such as resampling in the presence of imbalanced training data [44].

Researchers have emphasized the use of clustering based preprocessing methods as an alternative for sampling of the data. Batista et al. [3], [4] proposed two hybrid sampling techniques, SMOTE + Tomek–Links and SMOTE + ENN for overlapping datasets, for better defined class clusters among majority and minority classes. Jo and Japkowicz [27] presented a cluster based over-sampling approach. Majority and minority class samples are clustered first and the clusters in the majority class are over-sampled to the largest cluster obtained for the majority class data. Han et al. [21] proposed borderline SMOTE, which identified minority samples at borderline and applied SMOTE. This is the only technique proposed to over-sample the borderline minority samples. Later, k-means based under-sampling method and the Agglomerative Hierarchical Clustering based over-sampling method to deal with unbalanced datasets are proposed [10]. Guo and Viktor [18] proposed boosting method with various over-sampling techniques to deal with hard to classify examples and concluded that boosting approach improved the prediction accuracy of the classifier. Huang et al. [25] presented Biased Minimax Probability Machine to resolve the imbalance problem.

Researchers then exerted their efforts toward developing hybrid approaches to deal with unbalanced data, where they combined over-sampling and under-sampling with different concepts into one approach. Some used a combination of under-sampling and over-sampling [35]. They used lift analysis instead of classification accuracy to measure a classifiers performance. Various hybrids, SMOTE-bootstrap hybrid [36] and a hybrid combining machine learning and unsupervised McCab feature selection method using SVM and maximum entropy method [12], and a hybrid balancing model using unsupervised clustering and decision tree boosting [6] are proposed. Later, Farquad et al. [14], [15] proposed a hybrid rule extraction from SVM approach for handling the class imbalance problem. They concluded that rules extracted using their proposed approach performed very well. Table 1 provides a chronological overview of the balancing approaches proposed by various researchers.

Researchers have never reported any preprocessing using intelligent methods to balance the data. In this paper we employ SVM as a preprocessor. SVM is one of the best intelligent algorithms used for classification and regression purposes. The best property of SVM is that it always yields global optimal solution, whereas other intelligent algorithms suffer from getting stuck with a local minimum. SVM tries to find the decision boundary between various classes without actually worrying about the number of instances available for a class. SVM is suitable for high dimensional problems and works with a small number of observations as well. Hence, trained SVM is proposed as a preprocessor in this paper.

The rest of the paper is organized as follows. Section 2 presents a brief overview of the method of SVM and motivation for the proposed approach. Section 3 explains the architecture of the proposed balancing approach. Section 4 presents a description of the dataset and the experimental method used in this research. Results and discussions are presented in Section 5. Section 6 concludes the paper. A brief overview of MLP, LR and RF is provided in Appendix A.

Section snippets

Overview of support vector machine

The SVM is a learning procedure based on the statistical learning theory [47] and it is one of the best machine learning techniques used in data mining [54]. It has been used in a wide variety of applications such as prediction of colon cancer [1], gene analysis [20], credit rating analysis [24], financial time-series forecasting [28], financial fraud detection [40], estimating manufacturing yields [43], users' web browsing behavior [55], among others.

For solving a two-class classification

Proposed balancing approach

Most of the real-world data are imbalanced in terms of the proportion of examples available for each class. This problem of imbalanced class distributions can lead the algorithms to learn overly complex models that over fit the data and have little relevance. It is observed that despite better performance of computational intelligence techniques, they are biased towards majority class instances and learn better about majority class and learn slightly or ignore minority class. In this paper we

Dataset

The dataset analyzed in this paper is used in the Coil 2000 data mining competition [46]. It is related to customer data for an insurance company. The target variable is whether or not a customer would buy caravan insurance policy. For each customer, 86 attributes are provided. They included 43 socio-demographic variables derived via the customer's zip code, which included age, customer type, religion, relationship status, education level, children in the family, ownership of the house,

Results and discussion

Identifying the potential customers who can buy caravan insurance policy is the basic intention of this study. The quantities employed to measure the quality of the classifiers are sensitivity, specificity and accuracy [16]. We place high emphasis on sensitivity alone which contributes towards filtering and finding the most possible buyers of the caravan insurance policy. Consequently, in this paper, sensitivity is given top priority ahead of specificity and accuracy. We define the performance

Conclusion

It is well known that standard machine learning algorithms are biased towards majority class when dealing with unbalanced data. In this research, the efficiency of SVM in dealing with unbalanced data is analyzed and presented. The Coil dataset [46], which is highly imbalanced and has a 94:6 ratio for class distribution, is used for empirical analysis. The proposed methodology followed a two phase approach. During the first phase the available training data is used to train SVM. Later, the

References (55)

R. Barandela et al.
Strategies for learning in class imbalance problems
Pattern Recognition
(2003)
G. Cohen et al.
Learning from imbalanced data in surveillance of nosocomial infection
Artificial Intelligence in Medicine
(2006)
T. Fawcett
An introduction to ROC analysis
Pattern Recognition Letters
(2006)
C.S. Hilas
Designing an expert system for fraud detection in private telecommunications networks
Expert Systems with Applications
(2009)
Z. Huang et al.
Credit rating analysis with support vector machines and neural networks: a market comparative study
Decision Support Systems
(2004)
K.J. Kim
Financial time series forecasting using support vector machines
Neurocomputing
(2003)
I. Kononenko
Machine learning for medical diagnosis: history, state of the art and perspective
Artificial Intelligence in Medicine
(2001)
Y. Liu et al.
A study in machine learning from imbalanced data for sentence boundary detection in speech
Computer Speech and Language
(2006)
M. Mannino et al.
Classification algorithm sensitivity to training data with non representative attribute noise
Decision Support Systems
(2009)
P. Ravisankar et al.
Detection of financial statement fraud and feature selection using data mining techniques
Decision Support Systems
(2011)

D. Sanchez et al.

Association rules applied to credit card fraud detection

Expert Systems with Applications

(2009)

A. Sun et al.

On strategies for imbalanced text classification using SVM: a comparative study

Decision Support Systems

(2009)

G.M. Weiss

Learning with rare cases and small disjuncts

S.M. Alladi et al.

Colon cancer prediction with genetic profiles using intelligent techniques

Bioinformation

(2008)

G.E.A.P.A. Batista et al.

Improving rule induction precision for automated annotation by balancing skewed data sets

Knowledge Exploration in Life Science Informatics

(2004)

G.E.A.P.A. Batista et al.

A study of the behaviour of several methods for balancing machine learning training data

ACM SIGKDD Explorations: Special Issue on Imbalanced Data Sets

(2004)

C.M. Bishop

Neural Networks for Pattern Recognition

(1995)

I. Bose et al.

Hybrid models using unsupervised clustering for prediction of customer churn

Journal of Organizational Computing and Electronic Commerce

(2009)

M. Bosque

Understanding 99% of Artificial Neural Networks

(2002)

L. Breiman

Random forests

Machine Learning

(2001)

N.V. Chawla et al.

SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research

(2002)

P. Domingos

MetaCost: a general method for making classifiers cost-sensitive

T. Eitrich et al.

Classification of highly unbalanced cyp450 data of drugs using cost sensitive machine learning techniques

Journal of Chemical Information and Modeling

(2007)

A. Estabrooks et al.

A multiple resampling method for learning from imbalanced data sets

Computational Intelligence

(2004)

M.A.H. Farquad et al.

Data mining using rules extracted from SVM: an application to churn prediction in bank credit cards

M.A.H. Farquad et al.

Rule extraction from Support Vector Machine using modified active learning based approach: an application to CRM

T. Fawcett et al.

Adaptive fraud detection

Data Mining and Knowledge Discovery

(1997)

Cited by (148)

WRND: A weighted oversampling framework with relative neighborhood density for imbalanced noisy classification
2024, Expert Systems with Applications
Imbalanced data and label noise are ubiquitous challenges in data mining and machine learning that severely impair classification performance. The synthetic minority oversampling technique (SMOTE) and its variants have been proposed, but they are easily constrained by hyperparameter optimization such as $k$ -nearest neighbors, their performance deteriorates owing to noise, they rarely consider data distribution information, and they cause high complexity. Furthermore, SMOTE-based methods perform random linear interpolation between each minority class sample and its randomly selected $k$ -nearest neighbors, regardless of sample differences and distribution information. To address the above problems, an adaptive, robust, and general weighted oversampling framework based on relative neighborhood density (WRND) is proposed. It can combine with most SMOTE-based sampling algorithms easily and improve performance. First, it adaptively distinguishes and filters noisy and outlier samples by introducing the natural neighbor, which inherently avoids the extra noise and overlapping samples introduced by the synthesis of noisy samples. The relative neighborhood density of each sample can then be obtained, which reflects the intra-class and inter-class distribution information within the natural neighborhood. To alleviate the blindness of SMOTE-based methods, the number and locations of synthetic samples are informedly assigned based on distribution information and reasonable generalization of natural neighborhoods of original samples. Extensive experiments on 23 benchmark datasets and six classic classifiers with eight pairs of representative sampling algorithms and two state-of-the-art frameworks, significantly demonstrate the effectiveness of the WRND framework. Code and framework are available at https://github.com/dream-lm/WRND_framework.
R-WDLS: An efficient security region oversampling technique based on data distribution
2024, Applied Soft Computing
Within the real-life domain, theoretical and practical research on imbalanced classification has been a hot topic of interest for data mining and machine learning. Data level processing is always in a prominent position since it is independent of the classifier, and the synthetic minority oversampling technique (SMOTE) is an outstanding representative. However, SMOTE simply considers neighborhood information and generates new samples in a linear interpolation manner, resulting in the generation of incorrect samples. In this paper, we propose an oversampling method based on the relative density of weighted k-nearest neighbor samples and the local shadow samples of a random synthetic affine linear combination (R-WDLS), which fully utilizes the nearest-neighbor and intraclass standard deviation information of the minority class samples. First, the noise and outlier samples in the original data are filtered according to the sample density. Then, for the retained minority samples, new points are generated around them with Gaussian distribution to expand the diversity of the minority samples. Finally, multiple points are selected to generate new samples that satisfy the conditions. The proposed R-WDLS is validated through sufficient comparison experiments with seven oversampling methods under eight classifiers on 24 real datasets. Friedman and Nemenyi post-hoc statistical tests show that R-WDLS obtains consistency-optimal results under all evaluation metrics, which is unmatched by other methods.
Application of supervised learning algorithms for temperature prediction in nucleate flow boiling
2024, Applied Thermal Engineering
This work investigates the use of supervised learning algorithms to predict temperatures in an experimental test bench, which was initially designed for studying nucleate boiling phenomena with ethylene glycol/water mixtures. The proposed predictive model consists of three stages of machine learning. In the first one, a supervised algorithm block is employed to determine whether the critical heat flux (CHF) will be reached within the test bench limits. This classification relies on input parameters including bulk temperature, tilt angle, pressure, and inlet velocity. Once the CHF condition is established, another machine learning algorithm predicts the specific heat flux at which CHF will occur. Subsequently, based on the classification generated by the first block, the evolution of temperature in response to increases in heat flux is predicted using either the previously estimated heat flux or the physical limits of the experimental facility as the stopping criterion. To accomplish all these predictions, the study compares the performance of various algorithms including artificial neural networks, random forest, support vector machine, AdaBoost, and XGBoost. These algorithms were specifically trained using cross-validation and grid search methods to optimize their effectiveness. Results for the CHF classification purpose demonstrate that the support vector machine algorithm performs the best, achieving an F1-score of 0.872 on the testing dataset, while the boosting methods (AdaBoost and XGBoost) exhibit signs of overfitting. In predicting the CHF value, the artificial neural network achieved the lower nMAE on the testing dataset (6.18%). Finally, the validation of the temperature forecasting models, trained on a dataset composed of 314,476 samples, reveals similar performances across all methods, with R² values greater than 0.95.
TDMO: Dynamic multi-dimensional oversampling for exploring data distribution based on extreme gradient boosting learning
2023, Information Sciences
The synthetic minority oversampling technique (SMOTE) is the most general and popular solution for imbalanced data. Although SMOTE is effective in solving the class imbalance problem in most cases, it insufficiently exploits the data prior distribution. Additionally, most existing SMOTE variants randomly produce new instances between a minority sample and its nearest neighbors, which carries the risk of noise propagation. To address this, in this paper, local distribution trust estimation based on extreme gradient boosting (XGBoost) and dynamic multi-dimensional oversampling (TDMO) is proposed as a novel approach to exploring data distributions. First, undersampling and XGBoost techniques are introduced to train multiple balanced subsets to identify the internal structure of the original data and obtain the classification prediction accuracy of each instance, called the confidence level (CL). Then, instances with low CL (i.e., noise) are filtered out, and the densities of the two classes in the neighborhood of the non-noise instances are evaluated to create candidate samples to expand the diversity of the minority class. Finally, the minority class is enhanced by combining multiple samples in a multi-dimensional feature space. Extensive experimental results demonstrate that TDMO outperformed the comparative oversampling methods clearly and obtained the optimal classification results.
Kernel methods with asymmetric and robust loss function
2023, Expert Systems with Applications
The least squares support vector machine (LSSVM) has achieved great success in various fields, but it still has certain limitations. Firstly, it treats all points equally and does not take into account the impact of different sample locations. Secondly, it is vulnerable to outliers and noise. Moreover, when class-imbalanced data are encountered, the decision boundary is biased toward the minority class. To address above problems, this paper introduces a bounded linear-exponential (BLINEX) loss function into LSSVM (LSKB), where the asymmetry and boundedness of the BLINEX loss function allow LSKB to assign distinct weights to each sample and be insensitive to noise and outliers. Further, this paper extends LSKB to a cost sensitive (CSLSKB) form to make it adapt to the class-imbalanced data. A fast optimization algorithm based on NESVM is developed to solve non-convex LSKB and CSLSKB (NEBLL). Numerous experiments demonstrate their effectiveness in dealing with class-balanced as well as class-imbalanced problems.
Understanding detour behavior in taxi services: A combined approach
2022, Transportation Research Part C: Emerging Technologies
Citation Excerpt :
The first step is to balance the proportion of trips in each category in the training set. The most effective and simplest approach is to reconstruct the training samples (Farquad and Bose, 2012), and the related solutions can be classified into three groups, including over-sampling, under-sampling, and hybrid methods (Fernández et al., 2008). Over-sampling methods can create new instances from the original minority class samples.
Taxi is one of the most important ways for citizens' daily travel, but taxi service faces a typical problem that greedy drivers may deliberately take unnecessary detours to overcharge passengers. An in-depth analysis of drivers' detour behavior is necessary to ensure high-quality service. In this paper, two kinds of detour patterns, namely kind detours and malicious detours, are defined and identified based on taxi datasets collected from three metropolitan cities in two countries. To better understand the detour choices of drivers, we explore the factors that may influence different detour patterns in terms of drivers, spatio-temporal distribution, land use, and network characteristics, and find that these two types of detours have distinctly different features. Based on these analyses, the detour behavior is modeled as a multi-class problem taking into account various features such as actual time, driver trip grids, driver average daily trips, origin/destination trip degrees, origin/destination land use, etc. Considering that our dataset is imbalanced due to significantly fewer detour trips than normal driving trips, a combined model of hybrid sampling and ensemble learning is used to predict detour choices at the beginning of the trip. Results show that the proposed method is useful and powerful in the prediction of detour behavior. This paper is a quantitative study to empirically reveal the factors influencing different detour patterns and to perform ex ante predictions of detour choices, which facilitates managers to understand detour behavior and develop appropriate interventions.

View all citing articles on Scopus

Mohammed Abdul Haque Farquad is a Research Assistant at the School of Business, The University of Hong Kong. He holds a Ph.D. in Computer Science from University of Hyderabad, Hyderabad, India. His research interests include data mining, soft computing, banking, finance, and customer relationship management. His research work has been published in Expert Systems with Applications, International Journal of Information and Decision Sciences, and in various Proceedings of International Conferences published by IEEE and Springer. He is an ad-hoc referee for Information Sciences Journal, Knowledge Based System Journal and various IEEE International Conferences. He is a Program Committee member of International Conference on Data Mining 2011, Las Vegas and also a Technical committee member of the 3rd International Conference on Computer Technology and Development, China.

Indranil Bose is Full Professor at the Indian Institute of Management Calcutta. He holds a B. Tech. from the Indian Institute of Technology, MS from the University of Iowa, and MS and Ph.D. from Purdue University. His research interests are in telecommunications, data mining, information security, and supply chain management. His publications have appeared in Communications of the ACM, Communications of AIS, Computers and Operations Research, Decision Support Systems, Ergonomics, European Journal of Operational Research, Information & Management, Journal of Organizational Computing and Electronic Commerce, Journal of the American Society for Information Science and Technology, Operations Research Letters, etc. He is listed in the International Who's Who of Professionals 2005–2006, Marquis Who's Who in the World 2006, Marquis Who's Who in Asia 2007, Marquis Who's Who in Science and Engineering 2007, and Marquis Who's Who of Emerging Leaders 2007. He serves on the editorial board of Information & Management, Communications of AIS, and several other IS journals.

View full text

Preprocessing unbalanced data using support vector machine

Abstract

Highlights

Introduction

Section snippets

Overview of support vector machine

Proposed balancing approach

Dataset

Results and discussion

Conclusion

Pattern Recognition

Artificial Intelligence in Medicine

Pattern Recognition Letters

Expert Systems with Applications

Decision Support Systems

Neurocomputing

Artificial Intelligence in Medicine

Computer Speech and Language

Decision Support Systems

Decision Support Systems

Expert Systems with Applications

Decision Support Systems

Colon cancer prediction with genetic profiles using intelligent techniques

Bioinformation

Improving rule induction precision for automated annotation by balancing skewed data sets

Knowledge Exploration in Life Science Informatics

A study of the behaviour of several methods for balancing machine learning training data

ACM SIGKDD Explorations: Special Issue on Imbalanced Data Sets

Neural Networks for Pattern Recognition

Hybrid models using unsupervised clustering for prediction of customer churn

Journal of Organizational Computing and Electronic Commerce

Understanding 99% of Artificial Neural Networks

Random forests

Machine Learning

SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research

MetaCost: a general method for making classifiers cost-sensitive

Classification of highly unbalanced cyp450 data of drugs using cost sensitive machine learning techniques

Journal of Chemical Information and Modeling

A multiple resampling method for learning from imbalanced data sets

Computational Intelligence

Data mining using rules extracted from SVM: an application to churn prediction in bank credit cards

Rule extraction from Support Vector Machine using modified active learning based approach: an application to CRM

Adaptive fraud detection

Data Mining and Knowledge Discovery