Combined weighted multi-objective optimizer for instance reduction in two-class imbalanced data problem

doi:10.1016/j.engappai.2020.103500

Engineering Applications of Artificial Intelligence

Volume 90, April 2020, 103500

https://doi.org/10.1016/j.engappai.2020.103500 Get rights and content

Abstract

Instance reduction from class-balanced data has been investigated in much research. However, there is a lack of studies on class-imbalanced data. Learning from imbalanced data lately has attracted a lot of attention due to the practical applications. In the case of two-class imbalanced data, the instances from one class, majority class, are more numerous than the instances from the other class, which is a minority class. The present paper aims to introduce a new instance reduction method that preserves between-class distributions in the balanced data and handles minority class instance reduction in two-class imbalanced data, efficiently. The proposed method solves the instance reduction issue from an unconstrained multi-objective optimization problem aspect. Accordingly, a new combined weighted optimizer is designed. By employing the chaotic krill herd evolutionary algorithm, both the minority and majority class spaces with the accelerated convergence are explored. Through this method, the original data set is purged of those instances that decrease accuracy, and Gmean. The performance has been evaluated on both imbalanced and balanced data sets collected from the UCI repository by the 10-fold cross-validation method. Evaluations show that the proposed method outperforms state-of-the-art methods in terms of classification accuracy, Gmean, reduction rates, and computational time.

Introduction

Instance reduction is one of the most important preprocessing steps in many machine learning tasks (Luan and Dong, 2018, Yang et al., 2019, Shakiba and Hooshmandasl, 2016). Because of the difficulties in handling the voluminous data, removing redundant, erroneous, or noisy instances is needed to perform before applying the data mining tasks such as instance-based learning methods, e.g., kNN and SVM. Instance reduction alleviates the high storage requirement and sensitivity to noise. Also, it decreases the complexity of computation needed to learn a high-quality classifier (Song and Chen, 2018, Yu et al., 2018). One of the most challenging areas in this field is handling between-class distributions (Wang and Mao, 2019). This means that the between-class distribution before and after data reduction should not be changed significantly. An inappropriate reduction method may eliminate more instances from one class, causing imbalanced data sets.

Instance reduction from class-balanced data has been investigated in much research. However, there is a lack of studies on class-imbalanced data. Recently learning from imbalanced data has attracted a lot of attention due to the practical applications in many domains, such as computer vision, network intrusion detection, medical diagnosis, and fraud detection (Kaur et al., 2019). The two-class imbalanced data deal with data containing instances from two classes and instances from one class, majority class, outnumber instances from the other class, that is minority class.

The problem of imbalanced data may cause some difficulties in learning tasks. Most of the studies employ traditional methods for learning from imbalanced data, but nevertheless, acceptable results may not be achieved because the traditional methods often give good coverage of the majority instances, but minority classes are neglected. Even if the obtained accuracy is significant, the result is not reliable because the cardinality of the minority class is very small compared to the cardinality of the majority class. Therefore, maintaining the between-class distribution is an important issue in the instance reduction problem. Although the minority class instances are typically essential in imbalanced data classification, they can be treated as outliers or noise. Accordingly, these instances should not be reduced when an instance reduction method is applied on imbalanced data sets. Hence, utilizing special methods is a necessity.

Many different approaches have been proposed (Díez-Pastor et al., 2015b, Díez-Pastor et al., 2015a, Dong et al., 2018, Fernández et al., 2018b, Fernández et al., 2018a) to deal with the instance reduction for imbalanced data. Among them, one can mention data-level methods that can be categorized into two main groups: undersampling in which the size of the majority class is decreased (Prasad et al., 2019), and oversampling in which the size of the minority class is increased (Krawczyk et al., 2019), cost-sensitive (Ling et al., 2006), and ensemble-based methods (Galar et al., 2011). The evolutionary-based methods with integration of the undersampling techniques have been gained attention. Some studies suggest that evolutionary-based methods outperform the non-evolutionary models in both instance reduction (de Haro-García et al., 2018, García-Pedrajas et al., 2014, Wang et al., 2015) and imbalanced data sets analysis (García and Herrera, 2009). Krill Herd Algorithm (KHA), as one of the most potent evolutionary algorithms, has been applied widely in recent years and has been showing acceptable results (Jensi and Jiji, 2016, Adhvaryyu et al., 2017, Niu et al., 2017). This evolutionary algorithm can effectively explore/exploit the solution spaces of different landscapes and dimensionality, and finally, converge to acceptable regions within the solution space (Mozaffari et al., 2017). However, its global exploration ability is not as strong as its local exploration ability, and it cannot always converge rapidly. The modification of the KHA based on the chaotic concept has been presented for tackling this problem.

In this paper, an evolutionary undersampling method for balanced and imbalanced data distributions is proposed. The proposed method generates a reduced set composed of those instances that enhance the performance of the classifiers. Also, it reduces the computational time needed to learn a classifier. The proposed method controls between-class distribution and protects minority class instances.

In the proposed method, instance reduction is viewed as an unconstrained multi-objective optimization problem. By using the chaotic krill herd evolutionary algorithm, both the minority and majority class spaces are explored with the accelerated convergence. The krill individuals (instances) are evaluated by a new combined weighted fitness function regarding contradicting criteria: accuracy, Geometry mean (Gmean), and reduction rate. Note that accuracy and Gmean contradict the reduction rate. In some cases, accuracy and consequently Gmean go against the reduction rate, i.e., accuracy and Gmean are getting better while the reduction rate is getting low or vice versa. Utilizing the designed decision surface, the krill individuals that have the best fitness are found. The output of the proposed method is a set that is purged of those instances that decrease accuracy and Gmean.

The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 elaborates on the proposed method, while the experimental results using benchmark data sets are presented in Section 4. A discussion on the conducted experiments is presented in Section 5. Finally, the conclusions and future work are drawn in Section 6.

Section snippets

Related work

Instance reduction is a preprocessing step developed to enhance learning tasks, especially for instance-based methods that need to decide on storing instances that are preferable for generalization. Instance reduction has been rarely addressed in the context of two-class imbalanced data. Various methods have been developed to remove noisy and redundant instances from underlying balanced and imbalanced data sets. Instance Reduction Algorithm using Hyperrectangle Clustering (IRAHC) (Hamidzadeh et

Proposed method

The proposed method is designed for selecting a significant subset of instances from both two-class balanced data and two-class imbalanced data. Taking into account the high storage requirement, computational cost, and sensitivity to noise of instance-based learning methods, a new combined weighted multi-objective optimizer with the aim of obtaining the training set for a learning algorithm, e.g., kNN and SVM is introduced.

In the proposed method, instance reduction is viewed as an unconstrained

Experimental results

In this section, the experimental results of the proposed method are compared with some state-of-the-art methods. Thorough experiments have been conducted on three scenarios: Scenario 1, expressed in Section 4.1, contains balanced data sets experiments. Scenario 2, represented in Section 4.2, contains imbalanced data sets experiments. Finally, scenario 3, elaborated on Section 4.3, contains synthetic imbalanced data set experiments. Since the focus of the proposed method is on two-class

Discussion

The comparison of the results obtained from Table 5, Table 10 with Tables 7, 9, 11, and 13 proves the contributions of the instance reduction methods on increasing the accuracy and alleviating the computational time of the instance-based classifier. As the experimental results on both balanced and imbalanced data sets show, ISCKHAD achieves the highest accuracy/Gmean in most of the data sets. Besides, it provides high reduction rates due to its capability for removing redundant and noisy data.

Conclusions and future work

In the present paper, an improved evolutionary algorithm is designed to remove noisy and redundant data. The proposed method, as a combined weighted multi-objective optimizer, is established such that it controls between-class distribution and protects minority class instances. In this paper, three decision surfaces, WDDS, CA, and CM are introduced and compared with other methods. The experimental results show that ISCKHAD (the proposed instance reduction method that uses WDDS as its decision

CRediT authorship contribution statement

Javad Hamidzadeh: Conceptualization, Methodology, Validation, Investigation. Niloufar Kashefi: Software, Writing - original draft, Resources. Mona Moradi: Software, Data curation, Writing - review & editing.

References (65)

AcamporaG. et al.
A multi-objective evolutionary approach to training set selection for support vector machine
Knowl.-Based Syst.
(2018)
AdhvaryyuP. et al.
Dynamic optimal power flow of combined heat and power system with valve-point effect using Krill Herd algorithm
Energy
(2017)
ArdehM.A. et al.
EXPLICA: An explorative imperialist competitive algorithm based on the notion of explorers with an expansive retention policy
Appl. Soft Comput.
(2017)
Arnaiz-GonzálezÁ. et al.
Instance selection for regression: Adapting DROP
Neurocomputing
(2016)
CanoJ.-R. et al.
Prototype selection to improve monotonic nearest neighbor
Eng. Appl. Artif. Intell.
(2017)
ChenC.-H. et al.
Bare-bones imperialist competitive algorithm for a compensatory neural fuzzy controller
Neurocomputing
(2016)
Díez-PastorJ.F. et al.
Diversity techniques improve the performance of the best imbalance learning ensembles
Inform. Sci.
(2015)
Díez-PastorJ.F. et al.
Random balance: ensembles of variable priors classifiers for imbalanced data
Knowl.-Based Syst.
(2015)
DornaikaF. et al.
Kernel sparse modeling for prototype selection
Knowl.-Based Syst.
(2016)
HamidzadehJ. et al.
LMIRA: large margin instance reduction algorithm
Neurocomputing
(2014)

HamidzadehJ. et al.

IRAHC: instance reduction algorithm using hyperrectangle clustering

Pattern Recognit.

(2015)

de Haro-GarcíaA. et al.

Instance selection based on boosting for instance-based learners

Pattern Recognit.

(2019)

HeY.-Y. et al.

Comparison of different chaotic maps in particle swarm optimization algorithm for long-term cascaded hydroelectric system scheduling

Chaos Solitons Fractals

(2009)

JensiR. et al.

An improved krill herd algorithm with global exploration capability for solving numerical function optimization problems and its application to data clustering

Appl. Soft Comput.

(2016)

LiJ. et al.

Adaptive multi-objective swarm fusion for imbalanced data classification

Inf. Fusion

(2018)

LinW.-C. et al.

Clustering-based undersampling in class-imbalanced data

Inform. Sci.

(2017)

LiuC. et al.

An efficient instance selection algorithm to reconstruct training set for support vector machine

Knowl.-Based Syst.

(2017)

LuanC. et al.

Experimental identification of hard data sets for classification and feature selection methods with insights on method selection

Data Knowl. Eng.

(2018)

MukherjeeA. et al.

Chaos embedded krill herd algorithm for optimal VAR dispatch problem of power system

Int. J. Electr. Power Energy Syst.

(2016)

NiuP. et al.

Model turbine heat rate by fast learning network with tuning based on ameliorated krill herd algorithm

Knowl.-Based Syst.

(2017)

SadhuA.K. et al.

A modified imperialist competitive algorithm for multi-robot stick-carrying application

Robot. Auton. Syst.

(2016)

ShakibaA. et al.

Data volume reduction in covering approximation spaces with respect to twenty-two types of covering based rough sets

Internat. J. Approx. Reason.

(2016)

SunY. et al.

Cost-sensitive boosting for classification of imbalanced data

Pattern Recognit.

(2007)

TsaiC.-F. et al.

Under-sampling class imbalanced datasets by combining clustering analysis and instance selection

Inform. Sci.

(2019)

VluymansS. et al.

EPRENNID: An evolutionary prototype reduction based ensemble for nearest neighbor classification of imbalanced data

Neurocomputing

(2016)

YangX. et al.

Pseudo-label neighborhood rough set: measures and attribute reductions

Internat. J. Approx. Reason.

(2019)

BlakeC.L.

UCI Repository of Machine Learning Databases

CarboneraJ.L.

An efficient approach for instance selection

Carbonera, J.L., Abel, M., 2015. A density-based approach for instance selection. In: 2015 IEEE 27th International...

CarboneraJ.L. et al.

A novel density-based approach for instance selection

ChangC.-C. et al.

LIBSVM: a library for support vector machines

ACM Trans. Intell. Syst. Technol.

(2011)

ChawlaN.V. et al.

SMOTEBoost: Improving prediction of the minority class in boosting

Cited by (18)

A localized decomposition evolutionary algorithm for imbalanced multi-objective optimization
2024, Engineering Applications of Artificial Intelligence
Multi-objective evolutionary algorithms based on decomposition (MOEA/Ds) convert a multi-objective optimization problem (MOP) into a set of scalar subproblems, which are then optimized in a collaborative manner. However, when tackling imbalanced MOPs, the performance of most MOEA/Ds will evidently deteriorate, as a few solutions will replace most of the others in the evolutionary process, resulting in a significant loss of diversity. To address this issue, this paper suggests a localized decomposition evolutionary algorithm (LDEA) for imbalanced MOPs. A localized decomposition method is proposed to assign a local region for each subproblem, where the inside solutions are associated and the solution update is restricted inside (i.e., solutions are only replaced by offspring within the same local region). Once off-spring are generated within an originally empty region, the best one is reserved for this subproblem to extend diversity. Meanwhile, the subproblem with the largest number of associated solutions will be found and one of its associated solutions with the worst aggregated value will be removed. Moreover, to speed up convergence for each subproblem while balancing the population's diversity, LDEA only evolves the best-associated solution in each subproblem and correspondingly tailors two decomposition methods in the environmental selection. When compared to nine competitive MOEAs, LDEA has shown the advantages in tackling two benchmark sets of imbalanced MOPs, one benchmark set of balanced yet complicated MOPs, and one real-world MOP.
A Genetic Algorithm-based sequential instance selection framework for ensemble learning
2024, Expert Systems with Applications
The accumulation of large amounts of historical data has led to the wide application of ensemble learning over the past few decades, but the balance between the individual accuracy of base classifiers (BCs) and the diversity among these BCs is rarely considered in the construction of ensemble models. Since such a balance is crucial to the success of ensemble models, this paper proposes a Genetic Algorithm-based sequential instance selection framework to address this research gap. The novelties of the proposed framework include: transforming the balance between the individual accuracy of BCs and the diversity among BCs into a general combinatorial optimization model and designing a Genetic Algorithm-based evolutionary instance selection method to solve this model. The proposed framework not only overcomes the inherent limitations of the Genetic Algorithm in some high-dimensional tasks but also provides an explicit and automatic way to balance the accuracy and diversity by searching appropriate training data subsets for different component BCs. Based on obtained training data subsets, the component BCs of the ensemble model are generated sequentially, and their predictions are further combined with the weighted majority voting rule. Using 30 real datasets collected from various practical applications, such as medicine, business, and industry, the effectiveness of the proposed framework in constructing powerful ensemble models is examined and compared with six benchmark ensemble learning methods. In addition, the capability of the proposed framework to improve convergence performances is also examined by the comparison with the traditional Genetic Algorithm.
Switching synthesizing-incorporated and cluster-based synthetic oversampling for imbalanced binary classification
2023, Engineering Applications of Artificial Intelligence
Oversampling is a popular yet useful method to fulfill the binary classification of imbalanced data, however many existing results of oversampling are very likely to generate redundant/unsafe/noise samples due primarily to the inadequate consideration of the data distribution. To address this issue, we propose a novel oversampling approach, namely Switching Synthesizing-Incorporated and Cluster-Based Synthetic Oversampling (SSI-CBSO). The core idea of SSI-CBSO is four-fold: (1) noise samples are removed by using K nearest neighbor strategy and Fuzzy C-Means clustering is adopted for the filtered data in the minority class; (2) the number of samples that need to be synthesized is adaptively assigned to each cluster concerning the inter-class distance and the intra-cluster similarity; (3) to better reflect the data distribution, a new method in terms of the concept of the hypersphere is put forward to measure the cluster density in a high dimensional; and (4) a new principle based on the Mahalanobis distance is provided for a better selection of the target sample. Then, a switching synthesizing strategy is established to guarantee the safety of the synthesized samples. Finally, experiments on 13 binary imbalanced data sets by using five evaluation metrics with four classifiers verify that our proposed SSI-CBSO approach can obtain desirable results.
Fault-Seg-Net: A method for seismic fault segmentation based on multi-scale feature fusion with imbalanced classification
2023, Computers and Geotechnics
Fault identification has important geological significance and practical production value. Due to the effects of earth filtering and environmental noise, it is difficult to identify minor faults, and manual fault identification is inefficient. In this study, an end-to-end deep learning semantic segmentation network Fault-Seg-Net is proposed to identify fault on seismic images, which simultaneously learns global semantic features and local detailed features. In Fault-Seg-Net, a multi-scale residual module is designed to expand the receptive field to mine fine-grained fault features from the low-dimensional feature space. Fault-Seg-Attention module is designed to model long-distance dependencies of pixel spatial location to compensate for the spatial continuity loss. In addition, a compound loss is used to guide the model training to handle imbalanced seismic image segmentation tasks. Experimental results on synthetic datasets have verified that Fault-Seg-Net can achieve high Precision (88.6%), Recall (89.2%), Dice (88.8%) and mIoU (81.5%) simultaneously, which is significantly better than traditional image processing methods and deep learning semantic segmentation networks. Experimental results on real large-scale field datasets have verified that Fault-Seg-Net has important practical value and strong robustness. This study provides an effective solution for intelligent seismic fault identification under complex geological environment.
Imbalanced data classification: Using transfer learning and active sampling
2023, Engineering Applications of Artificial Intelligence
Recently, deep learning models have made great breakthroughs in the field of computer vision, relying on large-scale class-balanced datasets. However, most of them do not consider the class-imbalanced data. In reality, the class-imbalanced distribution can lead to the degradation of model performance, reducing the generalization of these models. In addition, in the era of big data, many applications need to use real-time visual data. These data come from different mobile devices, which continuously generate a huge number of visual data. However, there are few studies using real-time data from information systems, real-time data is easy to capture but difficult to use. In order to solve the above problems, we propose a new model (Transfer Learning Classifier, TLC) based on transfer learning to deal with class-imbalanced data. The model includes active sampling module, real-time data augmentation module and DenseNet module. Among them, (1) the newly proposed active sampling module can dynamically adjust the number of samples with skewed distribution; (2) the data augmentation module can expand the real-time data to avoid over-fitting and insufficient data; (3) the DenseNet module is a standard DenseNet network pre-trained on the ImageNet dataset and transferred to TLC for relearning, and then we adjust the memory usage of the standard DenseNet to make it more efficient. In addition, we have applied a new end-to-end real-time data storage and analysis system. A large number of experiments have been carried out on four different long mantissa data sets. Experimental results show that the proposed TLC model can effectively deal with the static data as well as the real-time data, and the classification effect of imbalanced data is better than that of existing models.
Evolving ensembles using multi-objective genetic programming for imbalanced classification
2022, Knowledge-Based Systems
Citation Excerpt :
They use the raw imbalanced training data in the learning process without the need to manually rebalance the class distributions. Some studies suggest that evolutionary-based methods outperform non-evolutionary models in imbalanced datasets analysis [10]. Compared with traditional sampling-based methods, it has two major advantages.
Multi-objective Genetic Programming (MGP) plays a prominent role in generating Pareto optimal classifier sets and making trade-offs among multiple classes adaptively. However, the existing MGP algorithms show poor performance and are difficult to implement when dealing with imbalanced classification problems. This work proposes a new MGP-based algorithm designed for imbalanced classification. Firstly, an efficient evolutionary strategy with nondominated sorting, environmental selection, and an archiving mechanism is designed to optimize the false positive rate, the false negative rate and reduce the size of the resulting tree. Then, a weighted ensemble decision is made according to each classifier’s performance in the majority and minority classes to obtain final classification results. Experimental results on 21 binary-class datasets and 17 multi-class datasets show that the proposed method outperforms existing ones in several commonly used imbalanced classification metrics.

View all citing articles on Scopus

^☆: No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2020.103500.

View full text

Combined weighted multi-objective optimizer for instance reduction in two-class imbalanced data problem☆

Abstract

Introduction

Section snippets

Related work

Proposed method

Experimental results

Discussion

Conclusions and future work

CRediT authorship contribution statement

Knowl.-Based Syst.

Energy

Appl. Soft Comput.

Neurocomputing

Eng. Appl. Artif. Intell.

Neurocomputing

Inform. Sci.

Knowl.-Based Syst.

Knowl.-Based Syst.

Neurocomputing

Pattern Recognit.

Pattern Recognit.

Chaos Solitons Fractals

Appl. Soft Comput.

Inf. Fusion

Inform. Sci.

Knowl.-Based Syst.

Data Knowl. Eng.

Int. J. Electr. Power Energy Syst.

Knowl.-Based Syst.

Robot. Auton. Syst.

Internat. J. Approx. Reason.

Pattern Recognit.

Inform. Sci.

Neurocomputing

Internat. J. Approx. Reason.

UCI Repository of Machine Learning Databases

An efficient approach for instance selection

A novel density-based approach for instance selection

LIBSVM: a library for support vector machines

ACM Trans. Intell. Syst. Technol.

SMOTEBoost: Improving prediction of the minority class in boosting