Elsevier

Information Sciences

Volume 477, March 2019, Pages 47-54
Information Sciences

Under-sampling class imbalanced datasets by combining clustering analysis and instance selection

https://doi.org/10.1016/j.ins.2018.10.029Get rights and content

Highlights

  • A novel approach for under-sampling class imbalanced datasets is proposed.

  • It is based on combining clustering analysis and instance selection.

  • Instance selection is used for the clustering result of the majority class dataset.

  • The proposed approach outperforms five baseline approaches over 44 datasets.

Abstract

Class-imbalanced datasets, i.e., those with the number of data samples in one class being much larger than that in another class, occur in many real-world problems. Using these datasets, it is very difficult to construct effective classifiers based on the current classification algorithms, especially for distinguishing small or minority classes from the majority class. To solve the class imbalance problem, the under/oversampling techniques have been widely used to reduce and enlarge the numbers of data samples in the majority and minority classes, respectively. Moreover, the combinations of certain sampling approaches with ensemble classifiers have shown reasonably good performance. In this paper, a novel undersampling approach called cluster-based instance selection (CBIS) that combines clustering analysis and instance selection is introduced. The clustering analysis component groups similar data samples of the majority class dataset into ‘subclasses’, while the instance selection component filters out unrepresentative data samples from each of the ‘subclasses’. The experimental results based on the KEEL dataset repository show that the CBIS approach can make bagging and boosting-based MLP ensemble classifiers perform significantly better than six state-of-the-art approaches, regardless of what kinds of clustering (affinity propagation and k-means) and instance selection (IB3, DROP3 and GA) algorithms are used.

Introduction

In the current era of big data, data mining and analysis are becoming increasingly important for making effective decisions. Among the various data mining techniques, classification analysis is one of techniques most widely used for various business and engineering problems, such as bankruptcy prediction [21], cancer prediction [18], churn prediction [23], face detection [35], fraud detection [31], and software fault prediction [24].

In general, the developed classifiers (or prediction models) usually perform well over evenly distributed data of different classes. However, in practice, the data collected for training the classifiers are usually class imbalanced, i.e., the numbers of data samples in different classes are highly different. In the example of a two-class dataset, the two classes contain 10 and 1000 data samples. In particular, the distribution of data in the feature space is usually skewed in class-imbalanced datasets [8]. Furthermore, datasets with a skewed distribution will usually have some other problematic characteristics, such as data sample overlap, small sample sizes, and small disjuncts [4], [13].

The characteristics of class-imbalanced datasets mentioned above differ from the assumption of a relatively balanced distribution of data for most classification algorithms. This difference means that it is very difficult for classifiers to correctly predict the small (or minority) class, and they are likely to misclassify the testing samples into the prevalent (or majority) class [8], [28]. However, in many real-world problems, e.g., credit card fraud detection (non-fraud vs. fraud cases), bankruptcy prediction (non-bankrupt vs. bankrupt cases), and various disease detection predictions (non-infected vs. infected cases), the accuracy of detection, prediction or classification of the data in the minority class is critical.

In the literature, three types of approaches have been used to tackle the class imbalance problem. They are the data-level [1], [19], algorithm-level [34], [36], and cost-sensitive methods [8], [22]. Among these, the data-level methods are most widely used for class-imbalanced datasets [13].

The data-level methods aim at reducing the imbalance ratio between the majority and minority classes by either undersampling the data in the majority class [19], [33] or oversampling the data in the minority class [1], [7]. As the dataset size has been steadily increasing, the undersampling approach should be a better choice than the oversampling approach.

To reduce the number of data samples in the majority class, cluster-based sampling methods were introduced. Such methods can outperform the random sampling approach [20], [29], [33]. In general, the cluster-based sampling methods are based on grouping a number of clusters from a given majority class dataset, after which a number of representative data samples are selected from each of the clusters. However, there are several limitations of the cluster-based sampling methods that directly affect the reduced majority class dataset and the final classification performance. For instance, it is difficult to decide on the number of clusters needed for the optimal clustering result. In addition, the representative data samples to be selected from each cluster need to be carefully defined. In other words, the original data distribution in the majority class may be changed.

In the related studies of data preprocessing, instance selection is used to filter out unrepresentative data samples (or outliers) from a given training dataset, which can make the classifiers outperform those trained on the original training dataset without performing instance selection [14]. In general, this method can be used to reduce the dataset size of the majority class. However, since the existing instance selection algorithms are designed to distinguish between good and noisy data samples from the multiclass datasets, they cannot handle datasets that only contain one class, i.e., the majority class dataset.

In this study, we present a novel approach called cluster-based instance selection (CBIS) that is derived by combining clustering analysis and instance selection techniques. The characteristics of the clustering analysis and instance selection techniques mean that they complement each other for effective undersampling of the majority class dataset. CBIS is a two-step approach that first uses a clustering technique to group a number of data samples in the majority class, where each data sample belongs to a specific cluster. In particular, each cluster can be regarded as a ‘subclass’ of the majority class. Each data sample is then associated with a new class label, and as a result, a multiclass dataset for the majority class dataset is produced. Next, the instance selection technique is performed over the generated multiclass dataset to reduce the dataset size for the undersampling purpose. Our experimental results obtained with various domain datasets show that the proposed CBIS approach performs better than many state-of-the-art data-level approaches.

The rest of this paper is organized as follows. Section 2 gives an overview of the class imbalance problem and several representative data-level methods. In addition, the clustering analysis and instance selection techniques are also briefly described. Section 3 introduces the proposed CBIS approach. A description of the experimental setup and results is given in Section 4. Finally, Section 5 concludes the paper.

Section snippets

Class-imbalanced datasets

In class-imbalanced datasets, the imbalance ratio between the minority and the majority classes can be as drastic as 1:100, 1:1000 or even larger [8]. Although there is no exact answer to the question of what magnitude of imbalance ratio will lead to a deterioration of classification performance, in some applications a ratio of 1:35 can render some methods inadequate for building an effective classifier, while in other cases, a ratio of 1:10 is difficult to deal with [28].

In addition to the

The proposed CBIS approach

Fig. 1 shows a block diagram of the proposed cluster-based instance selection (CBIS) approach for undersampling class-imbalanced datasets. It comprises two steps. For instance, let us examine a two-class classification problem, given a two-class (training) dataset D that contains majority and minority class datasets denoted by Dmajority and Dmin ority, respectively. The first step is based on using the clustering analysis algorithm to group similar data samples of Dmajority into a number of G

Experimental setup

In this paper, 44 two-class imbalanced datasets from the KEEL dataset repository are used for the experiments [13]. The imbalance ratios of these datasets are between 1.8 and 129 with the number of features and data samples ranging from 4 to 20, and 130 to 5500, respectively. Each dataset is divided into five training and testing subsets based on fivefold cross-validation.

The top five best-performing approaches obtained from Galar et al. [13], who compared 37 related approaches, are used as the

Conclusion

In this paper, we introduce a novel undersampling approach called cluster-based instance selection (CBIS). CBIS is composed of two components: clustering analysis and instance selection. The clustering analysis component is used to cluster similar data samples in the majority class dataset into a number of groups that can be regarded as ‘subclasses’ of the majority class. Afterward, the instance selection component is used to filter out unrepresentative data samples from each of the

Acknowledgments

The work of the second author was supported in part by the Ministry of Science and Technology of Taiwan under Grant MOST 106-2410-H-182-024, in part by the Featured Areas Research Center Program within the Framework of the Higher Education Sprout Project of theMinistry of Education (MOE) of Taiwan under Grants EMRPD1H0421 and EMRPD1H0551 of the Healthy Aging Research Center, Chang Gung University, and in part by Chang Gung Memorial Hospital, Linkou under Grant NERPD2G0301T. The work of the

References (36)

  • X. Zhang et al.

    KRNN: k rare-class nearest neighbour classification

    Pattern Recognit.

    (2017)
  • L. Abdi et al.

    To combat multi-class imbalanced problems by means of over-sampling techniques

    IEEE Trans. Knowl. Data Eng.

    (2016)
  • D.W. Aha et al.

    Instance-based learning algorithms

    Mach. Learn.

    (1991)
  • R. Barandela et al.

    New applications of ensembles of classifiers

    Pattern Anal. Appl.

    (2003)
  • G.E. Batista et al.

    A study of the behavior of several methods for balancing machine learning training data

    ACM SIGKDD Explor. Newsl.

    (2004)
  • L. Breiman

    Bagging predictors

    Mach. Learn.

    (1996)
  • J.R. Cano et al.

    Using evolutionary algorithms as instance selection for data reduction: an experimental study

    IEEE Trans. Evolut. Comput.

    (2003)
  • N.V. Chawla et al.

    SMOTE: synthetic minority over-sampling technique

    J. Artif. Intell. Res.

    (2002)
  • Cited by (234)

    View all citing articles on Scopus
    View full text