Combined weighted multi-objective optimizer for instance reduction in two-class imbalanced data problem☆
Introduction
Instance reduction is one of the most important preprocessing steps in many machine learning tasks (Luan and Dong, 2018, Yang et al., 2019, Shakiba and Hooshmandasl, 2016). Because of the difficulties in handling the voluminous data, removing redundant, erroneous, or noisy instances is needed to perform before applying the data mining tasks such as instance-based learning methods, e.g., kNN and SVM. Instance reduction alleviates the high storage requirement and sensitivity to noise. Also, it decreases the complexity of computation needed to learn a high-quality classifier (Song and Chen, 2018, Yu et al., 2018). One of the most challenging areas in this field is handling between-class distributions (Wang and Mao, 2019). This means that the between-class distribution before and after data reduction should not be changed significantly. An inappropriate reduction method may eliminate more instances from one class, causing imbalanced data sets.
Instance reduction from class-balanced data has been investigated in much research. However, there is a lack of studies on class-imbalanced data. Recently learning from imbalanced data has attracted a lot of attention due to the practical applications in many domains, such as computer vision, network intrusion detection, medical diagnosis, and fraud detection (Kaur et al., 2019). The two-class imbalanced data deal with data containing instances from two classes and instances from one class, majority class, outnumber instances from the other class, that is minority class.
The problem of imbalanced data may cause some difficulties in learning tasks. Most of the studies employ traditional methods for learning from imbalanced data, but nevertheless, acceptable results may not be achieved because the traditional methods often give good coverage of the majority instances, but minority classes are neglected. Even if the obtained accuracy is significant, the result is not reliable because the cardinality of the minority class is very small compared to the cardinality of the majority class. Therefore, maintaining the between-class distribution is an important issue in the instance reduction problem. Although the minority class instances are typically essential in imbalanced data classification, they can be treated as outliers or noise. Accordingly, these instances should not be reduced when an instance reduction method is applied on imbalanced data sets. Hence, utilizing special methods is a necessity.
Many different approaches have been proposed (Díez-Pastor et al., 2015b, Díez-Pastor et al., 2015a, Dong et al., 2018, Fernández et al., 2018b, Fernández et al., 2018a) to deal with the instance reduction for imbalanced data. Among them, one can mention data-level methods that can be categorized into two main groups: undersampling in which the size of the majority class is decreased (Prasad et al., 2019), and oversampling in which the size of the minority class is increased (Krawczyk et al., 2019), cost-sensitive (Ling et al., 2006), and ensemble-based methods (Galar et al., 2011). The evolutionary-based methods with integration of the undersampling techniques have been gained attention. Some studies suggest that evolutionary-based methods outperform the non-evolutionary models in both instance reduction (de Haro-García et al., 2018, García-Pedrajas et al., 2014, Wang et al., 2015) and imbalanced data sets analysis (García and Herrera, 2009). Krill Herd Algorithm (KHA), as one of the most potent evolutionary algorithms, has been applied widely in recent years and has been showing acceptable results (Jensi and Jiji, 2016, Adhvaryyu et al., 2017, Niu et al., 2017). This evolutionary algorithm can effectively explore/exploit the solution spaces of different landscapes and dimensionality, and finally, converge to acceptable regions within the solution space (Mozaffari et al., 2017). However, its global exploration ability is not as strong as its local exploration ability, and it cannot always converge rapidly. The modification of the KHA based on the chaotic concept has been presented for tackling this problem.
In this paper, an evolutionary undersampling method for balanced and imbalanced data distributions is proposed. The proposed method generates a reduced set composed of those instances that enhance the performance of the classifiers. Also, it reduces the computational time needed to learn a classifier. The proposed method controls between-class distribution and protects minority class instances.
In the proposed method, instance reduction is viewed as an unconstrained multi-objective optimization problem. By using the chaotic krill herd evolutionary algorithm, both the minority and majority class spaces are explored with the accelerated convergence. The krill individuals (instances) are evaluated by a new combined weighted fitness function regarding contradicting criteria: accuracy, Geometry mean (Gmean), and reduction rate. Note that accuracy and Gmean contradict the reduction rate. In some cases, accuracy and consequently Gmean go against the reduction rate, i.e., accuracy and Gmean are getting better while the reduction rate is getting low or vice versa. Utilizing the designed decision surface, the krill individuals that have the best fitness are found. The output of the proposed method is a set that is purged of those instances that decrease accuracy and Gmean.
The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 elaborates on the proposed method, while the experimental results using benchmark data sets are presented in Section 4. A discussion on the conducted experiments is presented in Section 5. Finally, the conclusions and future work are drawn in Section 6.
Section snippets
Related work
Instance reduction is a preprocessing step developed to enhance learning tasks, especially for instance-based methods that need to decide on storing instances that are preferable for generalization. Instance reduction has been rarely addressed in the context of two-class imbalanced data. Various methods have been developed to remove noisy and redundant instances from underlying balanced and imbalanced data sets. Instance Reduction Algorithm using Hyperrectangle Clustering (IRAHC) (Hamidzadeh et
Proposed method
The proposed method is designed for selecting a significant subset of instances from both two-class balanced data and two-class imbalanced data. Taking into account the high storage requirement, computational cost, and sensitivity to noise of instance-based learning methods, a new combined weighted multi-objective optimizer with the aim of obtaining the training set for a learning algorithm, e.g., kNN and SVM is introduced.
In the proposed method, instance reduction is viewed as an unconstrained
Experimental results
In this section, the experimental results of the proposed method are compared with some state-of-the-art methods. Thorough experiments have been conducted on three scenarios: Scenario 1, expressed in Section 4.1, contains balanced data sets experiments. Scenario 2, represented in Section 4.2, contains imbalanced data sets experiments. Finally, scenario 3, elaborated on Section 4.3, contains synthetic imbalanced data set experiments. Since the focus of the proposed method is on two-class
Discussion
The comparison of the results obtained from Table 5, Table 10 with Tables 7, 9, 11, and 13 proves the contributions of the instance reduction methods on increasing the accuracy and alleviating the computational time of the instance-based classifier. As the experimental results on both balanced and imbalanced data sets show, ISCKHAD achieves the highest accuracy/Gmean in most of the data sets. Besides, it provides high reduction rates due to its capability for removing redundant and noisy data.
Conclusions and future work
In the present paper, an improved evolutionary algorithm is designed to remove noisy and redundant data. The proposed method, as a combined weighted multi-objective optimizer, is established such that it controls between-class distribution and protects minority class instances. In this paper, three decision surfaces, WDDS, CA, and CM are introduced and compared with other methods. The experimental results show that ISCKHAD (the proposed instance reduction method that uses WDDS as its decision
CRediT authorship contribution statement
Javad Hamidzadeh: Conceptualization, Methodology, Validation, Investigation. Niloufar Kashefi: Software, Writing - original draft, Resources. Mona Moradi: Software, Data curation, Writing - review & editing.
References (65)
- et al.
A multi-objective evolutionary approach to training set selection for support vector machine
Knowl.-Based Syst.
(2018) - et al.
Dynamic optimal power flow of combined heat and power system with valve-point effect using Krill Herd algorithm
Energy
(2017) - et al.
EXPLICA: An explorative imperialist competitive algorithm based on the notion of explorers with an expansive retention policy
Appl. Soft Comput.
(2017) - et al.
Instance selection for regression: Adapting DROP
Neurocomputing
(2016) - et al.
Prototype selection to improve monotonic nearest neighbor
Eng. Appl. Artif. Intell.
(2017) - et al.
Bare-bones imperialist competitive algorithm for a compensatory neural fuzzy controller
Neurocomputing
(2016) - et al.
Diversity techniques improve the performance of the best imbalance learning ensembles
Inform. Sci.
(2015) - et al.
Random balance: ensembles of variable priors classifiers for imbalanced data
Knowl.-Based Syst.
(2015) - et al.
Kernel sparse modeling for prototype selection
Knowl.-Based Syst.
(2016) - et al.
LMIRA: large margin instance reduction algorithm
Neurocomputing
(2014)
IRAHC: instance reduction algorithm using hyperrectangle clustering
Pattern Recognit.
Instance selection based on boosting for instance-based learners
Pattern Recognit.
Comparison of different chaotic maps in particle swarm optimization algorithm for long-term cascaded hydroelectric system scheduling
Chaos Solitons Fractals
An improved krill herd algorithm with global exploration capability for solving numerical function optimization problems and its application to data clustering
Appl. Soft Comput.
Adaptive multi-objective swarm fusion for imbalanced data classification
Inf. Fusion
Clustering-based undersampling in class-imbalanced data
Inform. Sci.
An efficient instance selection algorithm to reconstruct training set for support vector machine
Knowl.-Based Syst.
Experimental identification of hard data sets for classification and feature selection methods with insights on method selection
Data Knowl. Eng.
Chaos embedded krill herd algorithm for optimal VAR dispatch problem of power system
Int. J. Electr. Power Energy Syst.
Model turbine heat rate by fast learning network with tuning based on ameliorated krill herd algorithm
Knowl.-Based Syst.
A modified imperialist competitive algorithm for multi-robot stick-carrying application
Robot. Auton. Syst.
Data volume reduction in covering approximation spaces with respect to twenty-two types of covering based rough sets
Internat. J. Approx. Reason.
Cost-sensitive boosting for classification of imbalanced data
Pattern Recognit.
Under-sampling class imbalanced datasets by combining clustering analysis and instance selection
Inform. Sci.
EPRENNID: An evolutionary prototype reduction based ensemble for nearest neighbor classification of imbalanced data
Neurocomputing
Pseudo-label neighborhood rough set: measures and attribute reductions
Internat. J. Approx. Reason.
UCI Repository of Machine Learning Databases
An efficient approach for instance selection
A novel density-based approach for instance selection
LIBSVM: a library for support vector machines
ACM Trans. Intell. Syst. Technol.
SMOTEBoost: Improving prediction of the minority class in boosting
Cited by (18)
A localized decomposition evolutionary algorithm for imbalanced multi-objective optimization
2024, Engineering Applications of Artificial IntelligenceA Genetic Algorithm-based sequential instance selection framework for ensemble learning
2024, Expert Systems with ApplicationsSwitching synthesizing-incorporated and cluster-based synthetic oversampling for imbalanced binary classification
2023, Engineering Applications of Artificial IntelligenceFault-Seg-Net: A method for seismic fault segmentation based on multi-scale feature fusion with imbalanced classification
2023, Computers and GeotechnicsImbalanced data classification: Using transfer learning and active sampling
2023, Engineering Applications of Artificial IntelligenceEvolving ensembles using multi-objective genetic programming for imbalanced classification
2022, Knowledge-Based SystemsCitation Excerpt :They use the raw imbalanced training data in the learning process without the need to manually rebalance the class distributions. Some studies suggest that evolutionary-based methods outperform non-evolutionary models in imbalanced datasets analysis [10]. Compared with traditional sampling-based methods, it has two major advantages.
- ☆
No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2020.103500.