Combined weighted multi-objective optimizer for instance reduction in two-class imbalanced data problem

https://doi.org/10.1016/j.engappai.2020.103500Get rights and content

Abstract

Instance reduction from class-balanced data has been investigated in much research. However, there is a lack of studies on class-imbalanced data. Learning from imbalanced data lately has attracted a lot of attention due to the practical applications. In the case of two-class imbalanced data, the instances from one class, majority class, are more numerous than the instances from the other class, which is a minority class. The present paper aims to introduce a new instance reduction method that preserves between-class distributions in the balanced data and handles minority class instance reduction in two-class imbalanced data, efficiently. The proposed method solves the instance reduction issue from an unconstrained multi-objective optimization problem aspect. Accordingly, a new combined weighted optimizer is designed. By employing the chaotic krill herd evolutionary algorithm, both the minority and majority class spaces with the accelerated convergence are explored. Through this method, the original data set is purged of those instances that decrease accuracy, and Gmean. The performance has been evaluated on both imbalanced and balanced data sets collected from the UCI repository by the 10-fold cross-validation method. Evaluations show that the proposed method outperforms state-of-the-art methods in terms of classification accuracy, Gmean, reduction rates, and computational time.

Introduction

Instance reduction is one of the most important preprocessing steps in many machine learning tasks (Luan and Dong, 2018, Yang et al., 2019, Shakiba and Hooshmandasl, 2016). Because of the difficulties in handling the voluminous data, removing redundant, erroneous, or noisy instances is needed to perform before applying the data mining tasks such as instance-based learning methods, e.g., kNN and SVM. Instance reduction alleviates the high storage requirement and sensitivity to noise. Also, it decreases the complexity of computation needed to learn a high-quality classifier (Song and Chen, 2018, Yu et al., 2018). One of the most challenging areas in this field is handling between-class distributions (Wang and Mao, 2019). This means that the between-class distribution before and after data reduction should not be changed significantly. An inappropriate reduction method may eliminate more instances from one class, causing imbalanced data sets.

Instance reduction from class-balanced data has been investigated in much research. However, there is a lack of studies on class-imbalanced data. Recently learning from imbalanced data has attracted a lot of attention due to the practical applications in many domains, such as computer vision, network intrusion detection, medical diagnosis, and fraud detection (Kaur et al., 2019). The two-class imbalanced data deal with data containing instances from two classes and instances from one class, majority class, outnumber instances from the other class, that is minority class.

The problem of imbalanced data may cause some difficulties in learning tasks. Most of the studies employ traditional methods for learning from imbalanced data, but nevertheless, acceptable results may not be achieved because the traditional methods often give good coverage of the majority instances, but minority classes are neglected. Even if the obtained accuracy is significant, the result is not reliable because the cardinality of the minority class is very small compared to the cardinality of the majority class. Therefore, maintaining the between-class distribution is an important issue in the instance reduction problem. Although the minority class instances are typically essential in imbalanced data classification, they can be treated as outliers or noise. Accordingly, these instances should not be reduced when an instance reduction method is applied on imbalanced data sets. Hence, utilizing special methods is a necessity.

Many different approaches have been proposed (Díez-Pastor et al., 2015b, Díez-Pastor et al., 2015a, Dong et al., 2018, Fernández et al., 2018b, Fernández et al., 2018a) to deal with the instance reduction for imbalanced data. Among them, one can mention data-level methods that can be categorized into two main groups: undersampling in which the size of the majority class is decreased (Prasad et al., 2019), and oversampling in which the size of the minority class is increased (Krawczyk et al., 2019), cost-sensitive (Ling et al., 2006), and ensemble-based methods (Galar et al., 2011). The evolutionary-based methods with integration of the undersampling techniques have been gained attention. Some studies suggest that evolutionary-based methods outperform the non-evolutionary models in both instance reduction (de Haro-García et al., 2018, García-Pedrajas et al., 2014, Wang et al., 2015) and imbalanced data sets analysis (García and Herrera, 2009). Krill Herd Algorithm (KHA), as one of the most potent evolutionary algorithms, has been applied widely in recent years and has been showing acceptable results (Jensi and Jiji, 2016, Adhvaryyu et al., 2017, Niu et al., 2017). This evolutionary algorithm can effectively explore/exploit the solution spaces of different landscapes and dimensionality, and finally, converge to acceptable regions within the solution space (Mozaffari et al., 2017). However, its global exploration ability is not as strong as its local exploration ability, and it cannot always converge rapidly. The modification of the KHA based on the chaotic concept has been presented for tackling this problem.

In this paper, an evolutionary undersampling method for balanced and imbalanced data distributions is proposed. The proposed method generates a reduced set composed of those instances that enhance the performance of the classifiers. Also, it reduces the computational time needed to learn a classifier. The proposed method controls between-class distribution and protects minority class instances.

In the proposed method, instance reduction is viewed as an unconstrained multi-objective optimization problem. By using the chaotic krill herd evolutionary algorithm, both the minority and majority class spaces are explored with the accelerated convergence. The krill individuals (instances) are evaluated by a new combined weighted fitness function regarding contradicting criteria: accuracy, Geometry mean (Gmean), and reduction rate. Note that accuracy and Gmean contradict the reduction rate. In some cases, accuracy and consequently Gmean go against the reduction rate, i.e., accuracy and Gmean are getting better while the reduction rate is getting low or vice versa. Utilizing the designed decision surface, the krill individuals that have the best fitness are found. The output of the proposed method is a set that is purged of those instances that decrease accuracy and Gmean.

The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 elaborates on the proposed method, while the experimental results using benchmark data sets are presented in Section 4. A discussion on the conducted experiments is presented in Section 5. Finally, the conclusions and future work are drawn in Section 6.

Section snippets

Related work

Instance reduction is a preprocessing step developed to enhance learning tasks, especially for instance-based methods that need to decide on storing instances that are preferable for generalization. Instance reduction has been rarely addressed in the context of two-class imbalanced data. Various methods have been developed to remove noisy and redundant instances from underlying balanced and imbalanced data sets. Instance Reduction Algorithm using Hyperrectangle Clustering (IRAHC) (Hamidzadeh et

Proposed method

The proposed method is designed for selecting a significant subset of instances from both two-class balanced data and two-class imbalanced data. Taking into account the high storage requirement, computational cost, and sensitivity to noise of instance-based learning methods, a new combined weighted multi-objective optimizer with the aim of obtaining the training set for a learning algorithm, e.g., kNN and SVM is introduced.

In the proposed method, instance reduction is viewed as an unconstrained

Experimental results

In this section, the experimental results of the proposed method are compared with some state-of-the-art methods. Thorough experiments have been conducted on three scenarios: Scenario 1, expressed in Section 4.1, contains balanced data sets experiments. Scenario 2, represented in Section 4.2, contains imbalanced data sets experiments. Finally, scenario 3, elaborated on Section 4.3, contains synthetic imbalanced data set experiments. Since the focus of the proposed method is on two-class

Discussion

The comparison of the results obtained from Table 5, Table 10 with Tables 7, 9, 11, and 13 proves the contributions of the instance reduction methods on increasing the accuracy and alleviating the computational time of the instance-based classifier. As the experimental results on both balanced and imbalanced data sets show, ISCKHAD achieves the highest accuracy/Gmean in most of the data sets. Besides, it provides high reduction rates due to its capability for removing redundant and noisy data.

Conclusions and future work

In the present paper, an improved evolutionary algorithm is designed to remove noisy and redundant data. The proposed method, as a combined weighted multi-objective optimizer, is established such that it controls between-class distribution and protects minority class instances. In this paper, three decision surfaces, WDDS, CA, and CM are introduced and compared with other methods. The experimental results show that ISCKHAD (the proposed instance reduction method that uses WDDS as its decision

CRediT authorship contribution statement

Javad Hamidzadeh: Conceptualization, Methodology, Validation, Investigation. Niloufar Kashefi: Software, Writing - original draft, Resources. Mona Moradi: Software, Data curation, Writing - review & editing.

References (65)

  • HamidzadehJ. et al.

    IRAHC: instance reduction algorithm using hyperrectangle clustering

    Pattern Recognit.

    (2015)
  • de Haro-GarcíaA. et al.

    Instance selection based on boosting for instance-based learners

    Pattern Recognit.

    (2019)
  • HeY.-Y. et al.

    Comparison of different chaotic maps in particle swarm optimization algorithm for long-term cascaded hydroelectric system scheduling

    Chaos Solitons Fractals

    (2009)
  • JensiR. et al.

    An improved krill herd algorithm with global exploration capability for solving numerical function optimization problems and its application to data clustering

    Appl. Soft Comput.

    (2016)
  • LiJ. et al.

    Adaptive multi-objective swarm fusion for imbalanced data classification

    Inf. Fusion

    (2018)
  • LinW.-C. et al.

    Clustering-based undersampling in class-imbalanced data

    Inform. Sci.

    (2017)
  • LiuC. et al.

    An efficient instance selection algorithm to reconstruct training set for support vector machine

    Knowl.-Based Syst.

    (2017)
  • LuanC. et al.

    Experimental identification of hard data sets for classification and feature selection methods with insights on method selection

    Data Knowl. Eng.

    (2018)
  • MukherjeeA. et al.

    Chaos embedded krill herd algorithm for optimal VAR dispatch problem of power system

    Int. J. Electr. Power Energy Syst.

    (2016)
  • NiuP. et al.

    Model turbine heat rate by fast learning network with tuning based on ameliorated krill herd algorithm

    Knowl.-Based Syst.

    (2017)
  • SadhuA.K. et al.

    A modified imperialist competitive algorithm for multi-robot stick-carrying application

    Robot. Auton. Syst.

    (2016)
  • ShakibaA. et al.

    Data volume reduction in covering approximation spaces with respect to twenty-two types of covering based rough sets

    Internat. J. Approx. Reason.

    (2016)
  • SunY. et al.

    Cost-sensitive boosting for classification of imbalanced data

    Pattern Recognit.

    (2007)
  • TsaiC.-F. et al.

    Under-sampling class imbalanced datasets by combining clustering analysis and instance selection

    Inform. Sci.

    (2019)
  • VluymansS. et al.

    EPRENNID: An evolutionary prototype reduction based ensemble for nearest neighbor classification of imbalanced data

    Neurocomputing

    (2016)
  • YangX. et al.

    Pseudo-label neighborhood rough set: measures and attribute reductions

    Internat. J. Approx. Reason.

    (2019)
  • BlakeC.L.

    UCI Repository of Machine Learning Databases

  • CarboneraJ.L.

    An efficient approach for instance selection

  • Carbonera, J.L., Abel, M., 2015. A density-based approach for instance selection. In: 2015 IEEE 27th International...
  • CarboneraJ.L. et al.

    A novel density-based approach for instance selection

  • ChangC.-C. et al.

    LIBSVM: a library for support vector machines

    ACM Trans. Intell. Syst. Technol.

    (2011)
  • ChawlaN.V. et al.

    SMOTEBoost: Improving prediction of the minority class in boosting

  • Cited by (18)

    • Imbalanced data classification: Using transfer learning and active sampling

      2023, Engineering Applications of Artificial Intelligence
    • Evolving ensembles using multi-objective genetic programming for imbalanced classification

      2022, Knowledge-Based Systems
      Citation Excerpt :

      They use the raw imbalanced training data in the learning process without the need to manually rebalance the class distributions. Some studies suggest that evolutionary-based methods outperform non-evolutionary models in imbalanced datasets analysis [10]. Compared with traditional sampling-based methods, it has two major advantages.

    View all citing articles on Scopus

    No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2020.103500.

    View full text