A hybrid imbalanced classification model based on data density

doi:10.1016/j.ins.2022.12.046

Information Sciences

Volume 624, May 2023, Pages 50-67

https://doi.org/10.1016/j.ins.2022.12.046 Get rights and content

Abstract

Imbalanced data are widely available in the real world, and it is difficult to effectively identify the minority class instances in imbalanced data. Various imbalanced classification models have been proposed. However, these models neglect the data density and the location of instances which can be important factors affecting classification performance. To tackle this issue, this paper proposes a hybrid imbalanced classification model based on data density (HICD). In data-level, the density-based resampling method is presented. The data partition Algorithm is given, which divides the data space into five regions based on the data density. The corresponding subsets are generated by sampling from the divided regions to improve the recognition of different classes of instances. In algorithm-level, we construct the corresponding ensemble models for different classes of instances. In addition, the model selection Algorithm is presented. On this basis, an appropriate model is selected for each instance based on its distribution. The performance of the proposed HICD was evaluated on 18 imbalanced datasets in the real world in terms of recall, the area under the roc curve (AUC), and G-mean. The experimental results validate that our method has better performance than other competitive algorithms in imbalanced classification.

Introduction

Imbalanced data is a type of dataset that the number of instances is not uniform in the different classes. In an imbalanced dataset, a small number of instances exist in minority class, while a large number of instances exist in majority class. There are some challenges in training a model that can effectively identify the minority class instances in the imbalanced dataset, which is the imbalanced classification problem [1]. In imbalanced classification, the classifier will bias the majority class and neglect the minority class instances. Hence, this classifier will yield a low recall, and lower the AUC and G-mean. However, in the real world, the minority class instances always contain more crucial information, such as medical diagnosis [2], [3], [4], [5], fraud detection [6], [7], [8] or credit evaluation [9], [10], [11]. Only a few fraudulent instances and many normal instances are included in fraud detection. However, it is more critical to accurately identify fraudulent instances to prevent financial losses. As a result, it is especially vital to enhance the recall of minority class instances regarding imbalanced classification.

To address the pressing need for imbalanced classification, a series of solutions have been proposed [12], [13], [14], [15]. These methods consist of three classes. The first type resorts to the data-level method of sampling original datasets, which mainly includes over-sampling targeting the minority class, under-sampling targeting the majority class, and hybrid-sampling targeting the minority class and majority class [16], [17], [18]. Another common set of approaches is algorithm-level methods [19], [20], [21]. Besides, algorithm-level methods are free of changes to the data distribution and instead enhance the classification performance by improving the classifiers. Algorithm-level methods have cost-sensitive learning [22], ensemble learning [23], etc. The last category is hybrid methods, which combines data-level with algorithm-level methods [24], [25], [26]. In hybrid methods, data-level methods can alter the data distribution to reduce the degree of data imbalance, moreover, algorithm-level methods can change the learning process to improve the classification performance.

In contrast, hybrid methods have great advantages. Therefore, to improve the classification performance of imbalanced data, this paper proposes a hybrid imbalanced classification model based on data density. There are three main types of data-level methods: over-sampling, under-sampling, and hybrid-sampling. Under-sampling methods may lose crucial information when removing majority class instances. Over-sampling methods can lead to an overfitting problem. Hybrid-sampling methods can overcome these deficiencies. Since the class distribution of imbalanced data is not uniform, the data density and the location of instances may affect the effectiveness of the sampling algorithm. In this paper, the density-based resampling method is proposed. Based on data density, the data space is divided into different regions. Instances from different regions are sampled using the Bootstrap method to generate corresponding subsets for different classes of instances. To stabilize the classification performance, several weak classifiers can be combined to determine the output. This approach is ensemble learning capable of improving the classification accuracy and reducing the classification error [27]. In this paper, we construct corresponding ensemble models for different classes of instances. In addition, the model selection Algorithm is given. With the model selection algorithm, a suitable model is chosen for each instance based on its distribution.

To obtain a better binary imbalanced data classification model, this paper focuses on a hybrid imbalanced classification model based on data density. The main contributions are as follows:

(1)
To improve the classification performance of imbalanced data, we propose HICD for the binary imbalanced data classification task.
(2)
To find the characteristics of different classes of instances, we present the data partition algorithm, which divides the data space into five regions based on data density. Furthermore, by sampling from the divided regions, we propose the density-based resampling method.
(3)
For different classes of instances, the corresponding ensemble models are constructed. In addition, we present the model selection algorithm. On this basis, an appropriate model is selected for each instance based on its distribution.

The rest of this paper is arranged as follows. Section 2 introduces recent related works. Section 3 describes the HICD in detail. Section 4 presents the experimental setting and compares HICD with other state-of-the-art methods on 18 datasets. Section 5 concludes the paper.

Section snippets

Related works

To solve imbalanced classification, there are three types of methods: data-level, algorithm-level, and hybrid methods. Data-level methods include under-sampling, over-sampling, and hybrid-sampling. The random under-sampling Algorithm (RUS) is a commonly used under-sampling approach that changes the data distribution by randomly removing majority class instances [28]. However, in the process of removing majority class instances, valuable information may be lost [29]. In order to address this

HICD: Hybrid imbalanced classification model based on data density

This paper proposes a hybrid imbalanced classification model based on data density (HICD), the framework of which is shown in Fig. 1. The HICD consists of two parts: density-based resampling and model selection. The details of density-based resampling are given in Section 3.1 and the details of model selection are given in Section 3.2. Finally, the details of the HICD are given in Section 3.3.

Datasets

A total of 18 datasets are collected from KEEL Repository (http://sci2s.ugr.es/keel/imbalanced.phphttp://sci2s.ugr.es/keel/imbalanced.php) to evaluate the proposed HICD. The datasets used in the experiments are shown in Table 1.

Table 1 describes the detailed information in these datasets, including the name of the dataset, number of attributes (5–34), number of instances (172–1484), data distribution (number of majority class instances, number of minority class instances), and the imbalance

Conclusion

In this paper, the HICD for binary imbalanced data classification is put forward. In data-level, the density-based resampling method is proposed. To find the characteristics of different classes of instances, the data partition Algorithm is given. With the data partition algorithm, the data space is divided into five regions based on the data density. The instances from the divided regions are sampled to generate corresponding subsets for different classes of instances. In algorithm-level, we

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by Natural Science Foundation of Hebei Province of China (Project No. G2019202350).

References (51)

K. Guo et al.
Core dataset extraction from unlabeled medical big data for lesion localization
Big Data Research
(2021)
S. Cheng et al.
Twd-sfnn: Three-way decisions with a single hidden layer feedforward neural network
Information Sciences
(2021)
K. Niu et al.
Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending
Information Sciences
(2020)
L. Cui et al.
Internet financing credit risk evaluation using multiple structural interacting elastic net feature selection
Pattern Recognition
(2021)
J. Li et al.
SMOTE-NaN-DE: addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution
Knowledge-Based Systems
(2021)
A. Tharwat et al.
Balancing exploration and exploitation: a novel active learner for imbalanced data
Knowledge-Based Systems
(2020)
M. Buda et al.
A systematic study of the class imbalance problem in convolutional neural networks
Neural Networks
(2018)
X. Tao et al.
Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification
Information Sciences
(2019)
J. Zhao et al.
A weighted hybrid ensemble method for classifying imbalanced data
Knowledge-Based Systems
(2020)
H. He et al.
A novel hybrid ensemble model based on tree-based method and deep learning method for default prediction
Expert Systems with Applications
(2021)

M. Tahir et al.

Inverse random under sampling for class imbalance problem and its application to multi-label classification

Pattern Recognition

(2012)

S.-J. Yen et al.

Cluster-based under-sampling approaches for imbalanced data distributions

Expert Systems with Applications

(2009)

D. Devi et al.

Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance

Pattern Recognition Letters

(2017)

W.-C. Lin et al.

Clustering-based undersampling in class-imbalanced data

Information Sciences

(2017)

G. Douzas et al.

Self-organizing map oversampling (SOMO) for imbalanced data set learning

Expert Systems with Applications

(2017)

G. Douzas et al.

Improving imbalanced learning through a heuristic oversampling method based on k-means and smote

Information Sciences

(2018)

Y. Freund et al.

A desicion-theoretic generalization of on-line learning and an application to boosting

Journal of Computer and System Sciences

(1997)

J. Sun et al.

Imbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on smote and bagging with differentiated sampling rates

Information Sciences

(2018)

B. Krawczyk

Learning from imbalanced data: open challenges and future directions

Progress in Artificial Intelligence

(2016)

H. Nasrollahpour et al.

Ultrasensitive bioassaying of her-2 protein for diagnosis of breast cancer using reduced graphene oxide/chitosan as nanobiocompatible platform

Cancer Nanotechnology

(2021)

C. Wu et al.

A greedy deep learning method for medical disease analysis

IEEE Access

(2018)

W. Wei et al.

Effective detection of sophisticated online banking fraud on extremely imbalanced data

World Wide Web-Internet and Web Information Systems

(2013)

S. Daliri

Using harmony search Algorithm in neural networks to improve fraud detection in banking system

Computational Intelligence and Neuroscience

(2020)

J. Yang et al.

A fingerprint recognition scheme based on assembling invariant moments for cloud computing communications

IEEE Systems Journal

(2011)

F. Xia, R. Hao, J. Li, N. Xiong, L.T. Yang, Y. Zhang, Adaptive gts allocation in ieee 802.15.4 for real-time wireless...

Cited by (4)

AWGAN: An adaptive weighting GAN approach for oversampling imbalanced datasets
2024, Information Sciences
Oversampling is a widely employed technique for addressing imbalanced datasets, facing challenges like class overlaps, intra-class imbalance, and noise. In this paper, we introduce an adaptive weighted oversampling algorithm grounded in generative adversarial networks, which we term AWGAN. To begin, our method computes the local and global densities for each instance, confirming its distribution within its local neighborhood, thereby enabling accurate identification and elimination of noisy instances. Subsequently, we devise a weight calculation strategy based on boundary division. Minority class instances are classified into safe and boundary instances, and weights are calculated based on the density of each instance and its distance from the surrounding instances, assigning different weights to overlapping and non-overlapping regions, and sparse and dense region instances, in order to solve the problems of class overlap and intra-class imbalance. Finally, GAN is used to construct a balanced dataset by adaptively generating minority class instances that match the real data distribution based on the weights. We evaluate AWGAN against six traditional oversampling methods and five GAN-based oversampling methods. The experimental results demonstrate that AWGAN significantly enhances classifier performance, as evident in its F1-Score, AUC, G-mean, and MCC on 21 diverse datasets.
A hybrid multi-criteria meta-learner based classifier for imbalanced data
2024, Knowledge-Based Systems
Numerous imbalanced datasets exist in modern machine learning dilemmas. Challenges of generalization and fairness stem from the existence of underrepresented classes with sensitive characteristics. Additionally, defining an optimal criterion is crucial when dealing with imbalanced data. To address this, it is common in practice to employ composite metrics and loss functions that evenly distribute the cost across each class.
However, relying solely on a single metric or loss function, frequently yields sub-par outcomes due to the tendency of the optimization criteria to be overly responsive to imbalanced data, resulting in excessive adaptation and flawed generalization. In order to overcome these limitations, various heuristics can be applied to optimize the criterion.
To this end, we aim at addressing the problem of imbalanced data classification by leveraging a multicriteria ensemble procedure, the Hybrid Multi-criteria Meta-learner (HML). The suggested approach focuses on optimizing precision, recall together with balanced accuracy in the Multi-Objective optimization phase and brings forth optimal ensembles, from which the concluding Meta-learner can be chosen after a clustering step, based on a multicriteria decision-making aspect contingent on the selected metrics. Comprehensive experiments portrayed the effectiveness of HML, offering both quality assurance and addresses a vast array of imbalanced data problems.
TDMO: Dynamic multi-dimensional oversampling for exploring data distribution based on extreme gradient boosting learning
2023, Information Sciences
The synthetic minority oversampling technique (SMOTE) is the most general and popular solution for imbalanced data. Although SMOTE is effective in solving the class imbalance problem in most cases, it insufficiently exploits the data prior distribution. Additionally, most existing SMOTE variants randomly produce new instances between a minority sample and its nearest neighbors, which carries the risk of noise propagation. To address this, in this paper, local distribution trust estimation based on extreme gradient boosting (XGBoost) and dynamic multi-dimensional oversampling (TDMO) is proposed as a novel approach to exploring data distributions. First, undersampling and XGBoost techniques are introduced to train multiple balanced subsets to identify the internal structure of the original data and obtain the classification prediction accuracy of each instance, called the confidence level (CL). Then, instances with low CL (i.e., noise) are filtered out, and the densities of the two classes in the neighborhood of the non-noise instances are evaluated to create candidate samples to expand the diversity of the minority class. Finally, the minority class is enhanced by combining multiple samples in a multi-dimensional feature space. Extensive experimental results demonstrate that TDMO outperformed the comparative oversampling methods clearly and obtained the optimal classification results.
Comparison of different approaches using Random Forest for imbalanced credit data
2023, Bank i Kredyt

View full text

A hybrid imbalanced classification model based on data density

Abstract

Introduction

Section snippets

Related works

HICD: Hybrid imbalanced classification model based on data density

Datasets

Conclusion

Declaration of Competing Interest

Acknowledgements

Big Data Research

Information Sciences

Information Sciences

Pattern Recognition

Knowledge-Based Systems

Knowledge-Based Systems

Neural Networks

Information Sciences

Knowledge-Based Systems

Expert Systems with Applications

Pattern Recognition

Expert Systems with Applications

Pattern Recognition Letters

Information Sciences

Expert Systems with Applications

Information Sciences

Journal of Computer and System Sciences

Information Sciences

Learning from imbalanced data: open challenges and future directions

Progress in Artificial Intelligence

Ultrasensitive bioassaying of her-2 protein for diagnosis of breast cancer using reduced graphene oxide/chitosan as nanobiocompatible platform

Cancer Nanotechnology

A greedy deep learning method for medical disease analysis

IEEE Access

Effective detection of sophisticated online banking fraud on extremely imbalanced data

World Wide Web-Internet and Web Information Systems

Using harmony search Algorithm in neural networks to improve fraud detection in banking system

Computational Intelligence and Neuroscience

A fingerprint recognition scheme based on assembling invariant moments for cloud computing communications

IEEE Systems Journal