A hybrid imbalanced classification model based on data density
Introduction
Imbalanced data is a type of dataset that the number of instances is not uniform in the different classes. In an imbalanced dataset, a small number of instances exist in minority class, while a large number of instances exist in majority class. There are some challenges in training a model that can effectively identify the minority class instances in the imbalanced dataset, which is the imbalanced classification problem [1]. In imbalanced classification, the classifier will bias the majority class and neglect the minority class instances. Hence, this classifier will yield a low recall, and lower the AUC and G-mean. However, in the real world, the minority class instances always contain more crucial information, such as medical diagnosis [2], [3], [4], [5], fraud detection [6], [7], [8] or credit evaluation [9], [10], [11]. Only a few fraudulent instances and many normal instances are included in fraud detection. However, it is more critical to accurately identify fraudulent instances to prevent financial losses. As a result, it is especially vital to enhance the recall of minority class instances regarding imbalanced classification.
To address the pressing need for imbalanced classification, a series of solutions have been proposed [12], [13], [14], [15]. These methods consist of three classes. The first type resorts to the data-level method of sampling original datasets, which mainly includes over-sampling targeting the minority class, under-sampling targeting the majority class, and hybrid-sampling targeting the minority class and majority class [16], [17], [18]. Another common set of approaches is algorithm-level methods [19], [20], [21]. Besides, algorithm-level methods are free of changes to the data distribution and instead enhance the classification performance by improving the classifiers. Algorithm-level methods have cost-sensitive learning [22], ensemble learning [23], etc. The last category is hybrid methods, which combines data-level with algorithm-level methods [24], [25], [26]. In hybrid methods, data-level methods can alter the data distribution to reduce the degree of data imbalance, moreover, algorithm-level methods can change the learning process to improve the classification performance.
In contrast, hybrid methods have great advantages. Therefore, to improve the classification performance of imbalanced data, this paper proposes a hybrid imbalanced classification model based on data density. There are three main types of data-level methods: over-sampling, under-sampling, and hybrid-sampling. Under-sampling methods may lose crucial information when removing majority class instances. Over-sampling methods can lead to an overfitting problem. Hybrid-sampling methods can overcome these deficiencies. Since the class distribution of imbalanced data is not uniform, the data density and the location of instances may affect the effectiveness of the sampling algorithm. In this paper, the density-based resampling method is proposed. Based on data density, the data space is divided into different regions. Instances from different regions are sampled using the Bootstrap method to generate corresponding subsets for different classes of instances. To stabilize the classification performance, several weak classifiers can be combined to determine the output. This approach is ensemble learning capable of improving the classification accuracy and reducing the classification error [27]. In this paper, we construct corresponding ensemble models for different classes of instances. In addition, the model selection Algorithm is given. With the model selection algorithm, a suitable model is chosen for each instance based on its distribution.
To obtain a better binary imbalanced data classification model, this paper focuses on a hybrid imbalanced classification model based on data density. The main contributions are as follows:
- (1)
To improve the classification performance of imbalanced data, we propose HICD for the binary imbalanced data classification task.
- (2)
To find the characteristics of different classes of instances, we present the data partition algorithm, which divides the data space into five regions based on data density. Furthermore, by sampling from the divided regions, we propose the density-based resampling method.
- (3)
For different classes of instances, the corresponding ensemble models are constructed. In addition, we present the model selection algorithm. On this basis, an appropriate model is selected for each instance based on its distribution.
The rest of this paper is arranged as follows. Section 2 introduces recent related works. Section 3 describes the HICD in detail. Section 4 presents the experimental setting and compares HICD with other state-of-the-art methods on 18 datasets. Section 5 concludes the paper.
Section snippets
Related works
To solve imbalanced classification, there are three types of methods: data-level, algorithm-level, and hybrid methods. Data-level methods include under-sampling, over-sampling, and hybrid-sampling. The random under-sampling Algorithm (RUS) is a commonly used under-sampling approach that changes the data distribution by randomly removing majority class instances [28]. However, in the process of removing majority class instances, valuable information may be lost [29]. In order to address this
HICD: Hybrid imbalanced classification model based on data density
This paper proposes a hybrid imbalanced classification model based on data density (HICD), the framework of which is shown in Fig. 1. The HICD consists of two parts: density-based resampling and model selection. The details of density-based resampling are given in Section 3.1 and the details of model selection are given in Section 3.2. Finally, the details of the HICD are given in Section 3.3.
Datasets
A total of 18 datasets are collected from KEEL Repository (http://sci2s.ugr.es/keel/imbalanced.phphttp://sci2s.ugr.es/keel/imbalanced.php) to evaluate the proposed HICD. The datasets used in the experiments are shown in Table 1.
Table 1 describes the detailed information in these datasets, including the name of the dataset, number of attributes (5–34), number of instances (172–1484), data distribution (number of majority class instances, number of minority class instances), and the imbalance
Conclusion
In this paper, the HICD for binary imbalanced data classification is put forward. In data-level, the density-based resampling method is proposed. To find the characteristics of different classes of instances, the data partition Algorithm is given. With the data partition algorithm, the data space is divided into five regions based on the data density. The instances from the divided regions are sampled to generate corresponding subsets for different classes of instances. In algorithm-level, we
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was supported by Natural Science Foundation of Hebei Province of China (Project No. G2019202350).
References (51)
- et al.
Core dataset extraction from unlabeled medical big data for lesion localization
Big Data Research
(2021) - et al.
Twd-sfnn: Three-way decisions with a single hidden layer feedforward neural network
Information Sciences
(2021) - et al.
Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending
Information Sciences
(2020) - et al.
Internet financing credit risk evaluation using multiple structural interacting elastic net feature selection
Pattern Recognition
(2021) - et al.
SMOTE-NaN-DE: addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution
Knowledge-Based Systems
(2021) - et al.
Balancing exploration and exploitation: a novel active learner for imbalanced data
Knowledge-Based Systems
(2020) - et al.
A systematic study of the class imbalance problem in convolutional neural networks
Neural Networks
(2018) - et al.
Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification
Information Sciences
(2019) - et al.
A weighted hybrid ensemble method for classifying imbalanced data
Knowledge-Based Systems
(2020) - et al.
A novel hybrid ensemble model based on tree-based method and deep learning method for default prediction
Expert Systems with Applications
(2021)
Inverse random under sampling for class imbalance problem and its application to multi-label classification
Pattern Recognition
Cluster-based under-sampling approaches for imbalanced data distributions
Expert Systems with Applications
Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance
Pattern Recognition Letters
Clustering-based undersampling in class-imbalanced data
Information Sciences
Self-organizing map oversampling (SOMO) for imbalanced data set learning
Expert Systems with Applications
Improving imbalanced learning through a heuristic oversampling method based on k-means and smote
Information Sciences
A desicion-theoretic generalization of on-line learning and an application to boosting
Journal of Computer and System Sciences
Imbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on smote and bagging with differentiated sampling rates
Information Sciences
Learning from imbalanced data: open challenges and future directions
Progress in Artificial Intelligence
Ultrasensitive bioassaying of her-2 protein for diagnosis of breast cancer using reduced graphene oxide/chitosan as nanobiocompatible platform
Cancer Nanotechnology
A greedy deep learning method for medical disease analysis
IEEE Access
Effective detection of sophisticated online banking fraud on extremely imbalanced data
World Wide Web-Internet and Web Information Systems
Using harmony search Algorithm in neural networks to improve fraud detection in banking system
Computational Intelligence and Neuroscience
A fingerprint recognition scheme based on assembling invariant moments for cloud computing communications
IEEE Systems Journal
Cited by (4)
AWGAN: An adaptive weighting GAN approach for oversampling imbalanced datasets
2024, Information SciencesA hybrid multi-criteria meta-learner based classifier for imbalanced data
2024, Knowledge-Based Systems