Under-sampling class imbalanced datasets by combining clustering analysis and instance selection
Introduction
In the current era of big data, data mining and analysis are becoming increasingly important for making effective decisions. Among the various data mining techniques, classification analysis is one of techniques most widely used for various business and engineering problems, such as bankruptcy prediction [21], cancer prediction [18], churn prediction [23], face detection [35], fraud detection [31], and software fault prediction [24].
In general, the developed classifiers (or prediction models) usually perform well over evenly distributed data of different classes. However, in practice, the data collected for training the classifiers are usually class imbalanced, i.e., the numbers of data samples in different classes are highly different. In the example of a two-class dataset, the two classes contain 10 and 1000 data samples. In particular, the distribution of data in the feature space is usually skewed in class-imbalanced datasets [8]. Furthermore, datasets with a skewed distribution will usually have some other problematic characteristics, such as data sample overlap, small sample sizes, and small disjuncts [4], [13].
The characteristics of class-imbalanced datasets mentioned above differ from the assumption of a relatively balanced distribution of data for most classification algorithms. This difference means that it is very difficult for classifiers to correctly predict the small (or minority) class, and they are likely to misclassify the testing samples into the prevalent (or majority) class [8], [28]. However, in many real-world problems, e.g., credit card fraud detection (non-fraud vs. fraud cases), bankruptcy prediction (non-bankrupt vs. bankrupt cases), and various disease detection predictions (non-infected vs. infected cases), the accuracy of detection, prediction or classification of the data in the minority class is critical.
In the literature, three types of approaches have been used to tackle the class imbalance problem. They are the data-level [1], [19], algorithm-level [34], [36], and cost-sensitive methods [8], [22]. Among these, the data-level methods are most widely used for class-imbalanced datasets [13].
The data-level methods aim at reducing the imbalance ratio between the majority and minority classes by either undersampling the data in the majority class [19], [33] or oversampling the data in the minority class [1], [7]. As the dataset size has been steadily increasing, the undersampling approach should be a better choice than the oversampling approach.
To reduce the number of data samples in the majority class, cluster-based sampling methods were introduced. Such methods can outperform the random sampling approach [20], [29], [33]. In general, the cluster-based sampling methods are based on grouping a number of clusters from a given majority class dataset, after which a number of representative data samples are selected from each of the clusters. However, there are several limitations of the cluster-based sampling methods that directly affect the reduced majority class dataset and the final classification performance. For instance, it is difficult to decide on the number of clusters needed for the optimal clustering result. In addition, the representative data samples to be selected from each cluster need to be carefully defined. In other words, the original data distribution in the majority class may be changed.
In the related studies of data preprocessing, instance selection is used to filter out unrepresentative data samples (or outliers) from a given training dataset, which can make the classifiers outperform those trained on the original training dataset without performing instance selection [14]. In general, this method can be used to reduce the dataset size of the majority class. However, since the existing instance selection algorithms are designed to distinguish between good and noisy data samples from the multiclass datasets, they cannot handle datasets that only contain one class, i.e., the majority class dataset.
In this study, we present a novel approach called cluster-based instance selection (CBIS) that is derived by combining clustering analysis and instance selection techniques. The characteristics of the clustering analysis and instance selection techniques mean that they complement each other for effective undersampling of the majority class dataset. CBIS is a two-step approach that first uses a clustering technique to group a number of data samples in the majority class, where each data sample belongs to a specific cluster. In particular, each cluster can be regarded as a ‘subclass’ of the majority class. Each data sample is then associated with a new class label, and as a result, a multiclass dataset for the majority class dataset is produced. Next, the instance selection technique is performed over the generated multiclass dataset to reduce the dataset size for the undersampling purpose. Our experimental results obtained with various domain datasets show that the proposed CBIS approach performs better than many state-of-the-art data-level approaches.
The rest of this paper is organized as follows. Section 2 gives an overview of the class imbalance problem and several representative data-level methods. In addition, the clustering analysis and instance selection techniques are also briefly described. Section 3 introduces the proposed CBIS approach. A description of the experimental setup and results is given in Section 4. Finally, Section 5 concludes the paper.
Section snippets
Class-imbalanced datasets
In class-imbalanced datasets, the imbalance ratio between the minority and the majority classes can be as drastic as 1:100, 1:1000 or even larger [8]. Although there is no exact answer to the question of what magnitude of imbalance ratio will lead to a deterioration of classification performance, in some applications a ratio of 1:35 can render some methods inadequate for building an effective classifier, while in other cases, a ratio of 1:10 is difficult to deal with [28].
In addition to the
The proposed CBIS approach
Fig. 1 shows a block diagram of the proposed cluster-based instance selection (CBIS) approach for undersampling class-imbalanced datasets. It comprises two steps. For instance, let us examine a two-class classification problem, given a two-class (training) dataset D that contains majority and minority class datasets denoted by Dmajority and Dmin ority, respectively. The first step is based on using the clustering analysis algorithm to group similar data samples of Dmajority into a number of G
Experimental setup
In this paper, 44 two-class imbalanced datasets from the KEEL dataset repository are used for the experiments [13]. The imbalance ratios of these datasets are between 1.8 and 129 with the number of features and data samples ranging from 4 to 20, and 130 to 5500, respectively. Each dataset is divided into five training and testing subsets based on fivefold cross-validation.
The top five best-performing approaches obtained from Galar et al. [13], who compared 37 related approaches, are used as the
Conclusion
In this paper, we introduce a novel undersampling approach called cluster-based instance selection (CBIS). CBIS is composed of two components: clustering analysis and instance selection. The clustering analysis component is used to cluster similar data samples in the majority class dataset into a number of groups that can be regarded as ‘subclasses’ of the majority class. Afterward, the instance selection component is used to filter out unrepresentative data samples from each of the
Acknowledgments
The work of the second author was supported in part by the Ministry of Science and Technology of Taiwan under Grant MOST 106-2410-H-182-024, in part by the Featured Areas Research Center Program within the Framework of the Higher Education Sprout Project of theMinistry of Education (MOE) of Taiwan under Grants EMRPD1H0421 and EMRPD1H0551 of the Healthy Aging Research Center, Chang Gung University, and in part by Chang Gung Memorial Hospital, Linkou under Grant NERPD2G0301T. The work of the
References (36)
An introduction to ROC analysis
Pattern Recognit. Lett.
(2006)- et al.
Machine learning applications in cancer prognosis and prediction
Comput. Struct. Biotechnol. J.
(2015) - et al.
Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy
Appl. Soft Comput.
(2016) - et al.
Clustering-based undersampling in class-imbalanced data
Inf. Sci.
(2017) - et al.
Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data
Fuzzy Sets Syst.
(2015) A systematic review of machine learning techniques for software fault prediction
Appl. Soft Comput.
(2015)- et al.
Intelligent financial fraud detection: a comprehensive review
Comput. Secur.
(2016) - et al.
Cluster-based under-sampling approaches for imbalanced data distributions
Expert Syst. Appl.
(2009) - et al.
Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data
Knowl.-Based Syst.
(2015) - et al.
A survey on face detection in the wild: past, present and future
Comput. Vis. Image Understand.
(2015)
KRNN: k rare-class nearest neighbour classification
Pattern Recognit.
To combat multi-class imbalanced problems by means of over-sampling techniques
IEEE Trans. Knowl. Data Eng.
Instance-based learning algorithms
Mach. Learn.
New applications of ensembles of classifiers
Pattern Anal. Appl.
A study of the behavior of several methods for balancing machine learning training data
ACM SIGKDD Explor. Newsl.
Bagging predictors
Mach. Learn.
Using evolutionary algorithms as instance selection for data reduction: an experimental study
IEEE Trans. Evolut. Comput.
SMOTE: synthetic minority over-sampling technique
J. Artif. Intell. Res.
Cited by (234)
Multi-feature vision transformer for automatic defect detection and quantification in composites using thermography
2024, NDT and E InternationalA majority affiliation based under-sampling method for class imbalance problem
2024, Information SciencesAdaptive K-means clustering based under-sampling methods to solve the class imbalance problem
2024, Data and Information ManagementAn adaptive imbalance modified online broad learning system-based fault diagnosis for imbalanced chemical process data stream
2023, Expert Systems with ApplicationsA new oversampling approach based differential evolution on the safe set for highly imbalanced datasets
2023, Expert Systems with ApplicationsA density-based oversampling approach for class imbalance and data overlap
2023, Computers and Industrial Engineering