Elsevier

Information Sciences

Volume 624, May 2023, Pages 50-67
Information Sciences

A hybrid imbalanced classification model based on data density

https://doi.org/10.1016/j.ins.2022.12.046Get rights and content

Abstract

Imbalanced data are widely available in the real world, and it is difficult to effectively identify the minority class instances in imbalanced data. Various imbalanced classification models have been proposed. However, these models neglect the data density and the location of instances which can be important factors affecting classification performance. To tackle this issue, this paper proposes a hybrid imbalanced classification model based on data density (HICD). In data-level, the density-based resampling method is presented. The data partition Algorithm is given, which divides the data space into five regions based on the data density. The corresponding subsets are generated by sampling from the divided regions to improve the recognition of different classes of instances. In algorithm-level, we construct the corresponding ensemble models for different classes of instances. In addition, the model selection Algorithm is presented. On this basis, an appropriate model is selected for each instance based on its distribution. The performance of the proposed HICD was evaluated on 18 imbalanced datasets in the real world in terms of recall, the area under the roc curve (AUC), and G-mean. The experimental results validate that our method has better performance than other competitive algorithms in imbalanced classification.

Introduction

Imbalanced data is a type of dataset that the number of instances is not uniform in the different classes. In an imbalanced dataset, a small number of instances exist in minority class, while a large number of instances exist in majority class. There are some challenges in training a model that can effectively identify the minority class instances in the imbalanced dataset, which is the imbalanced classification problem [1]. In imbalanced classification, the classifier will bias the majority class and neglect the minority class instances. Hence, this classifier will yield a low recall, and lower the AUC and G-mean. However, in the real world, the minority class instances always contain more crucial information, such as medical diagnosis [2], [3], [4], [5], fraud detection [6], [7], [8] or credit evaluation [9], [10], [11]. Only a few fraudulent instances and many normal instances are included in fraud detection. However, it is more critical to accurately identify fraudulent instances to prevent financial losses. As a result, it is especially vital to enhance the recall of minority class instances regarding imbalanced classification.

To address the pressing need for imbalanced classification, a series of solutions have been proposed [12], [13], [14], [15]. These methods consist of three classes. The first type resorts to the data-level method of sampling original datasets, which mainly includes over-sampling targeting the minority class, under-sampling targeting the majority class, and hybrid-sampling targeting the minority class and majority class [16], [17], [18]. Another common set of approaches is algorithm-level methods [19], [20], [21]. Besides, algorithm-level methods are free of changes to the data distribution and instead enhance the classification performance by improving the classifiers. Algorithm-level methods have cost-sensitive learning [22], ensemble learning [23], etc. The last category is hybrid methods, which combines data-level with algorithm-level methods [24], [25], [26]. In hybrid methods, data-level methods can alter the data distribution to reduce the degree of data imbalance, moreover, algorithm-level methods can change the learning process to improve the classification performance.

In contrast, hybrid methods have great advantages. Therefore, to improve the classification performance of imbalanced data, this paper proposes a hybrid imbalanced classification model based on data density. There are three main types of data-level methods: over-sampling, under-sampling, and hybrid-sampling. Under-sampling methods may lose crucial information when removing majority class instances. Over-sampling methods can lead to an overfitting problem. Hybrid-sampling methods can overcome these deficiencies. Since the class distribution of imbalanced data is not uniform, the data density and the location of instances may affect the effectiveness of the sampling algorithm. In this paper, the density-based resampling method is proposed. Based on data density, the data space is divided into different regions. Instances from different regions are sampled using the Bootstrap method to generate corresponding subsets for different classes of instances. To stabilize the classification performance, several weak classifiers can be combined to determine the output. This approach is ensemble learning capable of improving the classification accuracy and reducing the classification error [27]. In this paper, we construct corresponding ensemble models for different classes of instances. In addition, the model selection Algorithm is given. With the model selection algorithm, a suitable model is chosen for each instance based on its distribution.

To obtain a better binary imbalanced data classification model, this paper focuses on a hybrid imbalanced classification model based on data density. The main contributions are as follows:

  • (1)

    To improve the classification performance of imbalanced data, we propose HICD for the binary imbalanced data classification task.

  • (2)

    To find the characteristics of different classes of instances, we present the data partition algorithm, which divides the data space into five regions based on data density. Furthermore, by sampling from the divided regions, we propose the density-based resampling method.

  • (3)

    For different classes of instances, the corresponding ensemble models are constructed. In addition, we present the model selection algorithm. On this basis, an appropriate model is selected for each instance based on its distribution.

The rest of this paper is arranged as follows. Section 2 introduces recent related works. Section 3 describes the HICD in detail. Section 4 presents the experimental setting and compares HICD with other state-of-the-art methods on 18 datasets. Section 5 concludes the paper.

Section snippets

Related works

To solve imbalanced classification, there are three types of methods: data-level, algorithm-level, and hybrid methods. Data-level methods include under-sampling, over-sampling, and hybrid-sampling. The random under-sampling Algorithm (RUS) is a commonly used under-sampling approach that changes the data distribution by randomly removing majority class instances [28]. However, in the process of removing majority class instances, valuable information may be lost [29]. In order to address this

HICD: Hybrid imbalanced classification model based on data density

This paper proposes a hybrid imbalanced classification model based on data density (HICD), the framework of which is shown in Fig. 1. The HICD consists of two parts: density-based resampling and model selection. The details of density-based resampling are given in Section 3.1 and the details of model selection are given in Section 3.2. Finally, the details of the HICD are given in Section 3.3.

Datasets

A total of 18 datasets are collected from KEEL Repository (http://sci2s.ugr.es/keel/imbalanced.phphttp://sci2s.ugr.es/keel/imbalanced.php) to evaluate the proposed HICD. The datasets used in the experiments are shown in Table 1.

Table 1 describes the detailed information in these datasets, including the name of the dataset, number of attributes (5–34), number of instances (172–1484), data distribution (number of majority class instances, number of minority class instances), and the imbalance

Conclusion

In this paper, the HICD for binary imbalanced data classification is put forward. In data-level, the density-based resampling method is proposed. To find the characteristics of different classes of instances, the data partition Algorithm is given. With the data partition algorithm, the data space is divided into five regions based on the data density. The instances from the divided regions are sampled to generate corresponding subsets for different classes of instances. In algorithm-level, we

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by Natural Science Foundation of Hebei Province of China (Project No. G2019202350).

References (51)

  • M. Tahir et al.

    Inverse random under sampling for class imbalance problem and its application to multi-label classification

    Pattern Recognition

    (2012)
  • S.-J. Yen et al.

    Cluster-based under-sampling approaches for imbalanced data distributions

    Expert Systems with Applications

    (2009)
  • D. Devi et al.

    Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance

    Pattern Recognition Letters

    (2017)
  • W.-C. Lin et al.

    Clustering-based undersampling in class-imbalanced data

    Information Sciences

    (2017)
  • G. Douzas et al.

    Self-organizing map oversampling (SOMO) for imbalanced data set learning

    Expert Systems with Applications

    (2017)
  • G. Douzas et al.

    Improving imbalanced learning through a heuristic oversampling method based on k-means and smote

    Information Sciences

    (2018)
  • Y. Freund et al.

    A desicion-theoretic generalization of on-line learning and an application to boosting

    Journal of Computer and System Sciences

    (1997)
  • J. Sun et al.

    Imbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on smote and bagging with differentiated sampling rates

    Information Sciences

    (2018)
  • B. Krawczyk

    Learning from imbalanced data: open challenges and future directions

    Progress in Artificial Intelligence

    (2016)
  • H. Nasrollahpour et al.

    Ultrasensitive bioassaying of her-2 protein for diagnosis of breast cancer using reduced graphene oxide/chitosan as nanobiocompatible platform

    Cancer Nanotechnology

    (2021)
  • C. Wu et al.

    A greedy deep learning method for medical disease analysis

    IEEE Access

    (2018)
  • W. Wei et al.

    Effective detection of sophisticated online banking fraud on extremely imbalanced data

    World Wide Web-Internet and Web Information Systems

    (2013)
  • S. Daliri

    Using harmony search Algorithm in neural networks to improve fraud detection in banking system

    Computational Intelligence and Neuroscience

    (2020)
  • J. Yang et al.

    A fingerprint recognition scheme based on assembling invariant moments for cloud computing communications

    IEEE Systems Journal

    (2011)
  • F. Xia, R. Hao, J. Li, N. Xiong, L.T. Yang, Y. Zhang, Adaptive gts allocation in ieee 802.15.4 for real-time wireless...
  • Cited by (4)

    View full text