Combining multi-label classifiers based on projections of the output space using Evolutionary algorithms

https://doi.org/10.1016/j.knosys.2020.105770Get rights and content

Abstract

The multi-label classification task has gained a lot of attention in the last decade thanks to its good application to many real-world problems where each object could be attached to several labels simultaneously. Several approaches based on ensembles for multi-label classification have been proposed in the literature; however, the vast majority are based on randomly selecting the different aspects that make the ensemble diverse and they do not consider the characteristics of the data to build it. In this paper we propose an evolutionary method called Evolutionary AlGorithm for multi-Label Ensemble opTimization, EAGLET, for the selection of simple, accurate and diverse multi-label classifiers to build an ensemble considering the characteristics of the data, such as the relationship among labels and the imbalance degree of the labels. In order to model the relationships among labels, each classifier of the ensemble is focused on a small subset of the label space, resulting in models with a relative low computational complexity and lower imbalance in the output space. The resulting ensemble is generated incrementally given the population of multi-label classifiers, so the member that best fits to the ensemble generated so far, considering both predictive performance and diversity, is selected. The experimental study comparing EAGLET with state-of-the-art methods in multi-label classification over a wide set of sixteen datasets and five evaluation measures, demonstrated that EAGLET significantly outperformed standard MLC methods and obtained better and more consistent results than state-of-the-art multi-label ensembles.

Introduction

A large number of problems of classification tasks can be represented as a Multi-Label Classification (MLC) problem, where each of the instances may have several labels associated with them simultaneously [1]. For example, in multimedia annotation or text categorization problems, each could be categorized using several labels or groups simultaneously. Many real-world problems have been successfully solved using this framework, such as protein classification [2], decision support systems for medical diagnosis [3], and image retrieval [4]. The fact of having more than one label associated with each instance, poses new classification challenges that need to be addressed, such as modeling the compound relationships among labels and dealing with the imbalance of the output space. Although imbalance is a difficulty which exists in many other problems, and has been widely studied in the literature [5], [6], [7], the problem of dealing with the dependencies among many output labels emerged with MLC. Some studies have already demonstrated that addressing these challenges and dealing with the main characteristics of the multi-labeled data, the predictive results are improved [8], [9], [10].

Ensemble-based approaches have been studied and successfully used in many areas of data mining. Ensembles of classifiers are based on the combination of several base classifiers to improve the overall predictive performance; some studies have shown that ensembles outperform single classifiers [11]. Similarly, Ensembles of Multi-Label Classifiers (EMLCs) aim to improve the prediction of simple multi-label classifiers by joining predictions of several multi-label base classifiers. Several approaches were proposed in the literature to build EMLCs, mostly based on building models over different subsets of the training data [12], using different feature spaces [13], or using different subsets of the output space [9]. A thorough description of EMLCs can be found in [14] and [15].

Despite the fact that EMLCs tend to perform better than single classifiers, the selection of the ensemble members is not a trivial key point [16]. Using accurate classifiers in the ensemble is obviously necessary. However, an ensemble that contains base classifiers that are very similar to each other, although being accurate, not only may not perform as well as expected, but it can even perform worse than individual classifiers. An ensemble containing diverse base classifiers should lead to a better accuracy thanks to the diversity in their outputs, although formal proof of this dependency does not exist [17], [18]. Therefore, the selection of base classifiers is key in generating the ensemble. On the other hand, in multi-label scenarios one usually deals with problems of high dimensional output space, so the problems can be intractable with certain algorithmic approaches [19]. As a consequence, selecting a good technique to solve the problem is another key point to be taken into account.

Evolutionary algorithms (EAs) are biology-inspired search algorithms that have been successfully used in different fields of data mining [20], [21], [22], [23]. EAs have been also used in multi-label learning tasks such as optimization of base multi-label classifiers [24], and generation of ensembles for both classification and regression problems [25], [26]. EAs not only provide a valuable framework to obtain an optimal structure for the EMLC, but also allows to consider the whole characteristics of the data when building the EMLC. Our proposed method, called Evolutionary AlGorithm for multi-Label Ensemble opTimization and hereafter referred as EAGLET, is able to take advantage of the useful tips that the characteristics of the data provide. EAGLET focuses on building an EMLC by selecting simple, accurate and diverse classifiers. Each base classifier is focused on a small subset of labels, considering the relationship among labels and being able to model the compound dependencies among them with a relatively low computational cost. Modeling subsets of labels implies low imbalance in each of the multi-label classifiers, not only making easier the learning phase, but also improving the predictive performance of each model. The imbalance of the data is considered when selecting the members of the ensemble; therefore, EAGLET select accurate but also diverse classifiers in such a way that individuals predicting labels that infrequently appear in the ensemble are more likely to be selected. In this way, EAGLET ensures that all labels are included in the ensemble, while not neglecting infrequent labels.

The experimental study carried out over 16 multi-labeldatasets and using five evaluation measures demonstrated that EAGLET outperformed other method based on evolutionary algorithms to construct EMLCs [25]. Further, EAGLET outperformed other standard and baseline MLC methods, as well as obtained more consistent performance than state-of-the-art EMLCs, being the only one that did not perform statistically worse than any of the methods.

The rest of the article is organized as follows: Section 2 includes background and related work in MLC, Section 3 presents our proposal of evolutionary algorithm for the combination of accurate and diverse multi-label classifiers into an EMLC, Section 4 shows the experimental setup, Section 5 describes and discusses the results, and finally Section 6 ends with conclusions.

Section snippets

Formal definition of MLC

Let be D a multi-label dataset composed by a set of m instances, defined as D={(xi,Yi)|1im}. Let X=X1××Xd be the d-dimensional input space, and Y={λ1,λ2,,λq} the output space composed by q>1 labels. Each multi-label instance is composed by an input vector x and a set of relevant labels associated with it YY. Note that each different Y is also called labelset [1].

The goal of MLC is to construct a predictive model h:X2Y which provides a set of relevant labels for an unknown instance. Thus,

EAGLET

In this section, we introduce EAGLET. First, the structure of the multi-label classifier obtained as solution is presented and analyzed. Then, the main aspects of the evolutionary algorithm are presented, including description of the individuals and their initialization, the genetic operators, the fitness function, and how the ensemble is generated from the individuals.

Experimental studies

In this section, the datasets and evaluation measures used for assessing the algorithms are described, and the experimental settings are explained.

Results and discussion

In this section, a summary of the results of the different experiments defined in previous section is presented. The supplementary material available at the KDIS Research Group website contains full tables with results, including not only the five evaluation measures presented in the paper, but also many more.3 First, the effect of the parameters of EAGLET was analyzed and their default values were selected. Then, EAGLET was compared

Conclusions

In this paper we proposed an evolutionary algorithm, EAGLET, focused on creation of an EMLC where each of the members is a multi-label classifier able to predict a subset of k labels. EAGLET considers characteristics of the data such as the imbalance of the dataset when building the ensemble and the relationship among labels in the prediction phase, considering the relationship among small subsets of k labels.

EAGLET evolves a population of multi-label classifiers with subset of labels. Then, it

CRediT authorship contribution statement

Jose M. Moyano: Formal analysis, Investigation, Software, Validation, Writing - original draft, Writing - review & editing. Eva L. Gibaja: Conceptualization, Formal analysis, Investigation, Methodology, Supervision, Writing - review & editing. Krzysztof J. Cios: Conceptualization, Formal analysis, Investigation, Methodology, Supervision, Writing - review & editing. Sebastián Ventura: Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Supervision,

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was supported by the Spanish Ministry of Economy and Competitiveness and the European Regional Development Fund, project TIN2017-83445-P. This research was also supported by the Spanish Ministry of Education under FPU Grant FPU15/02948.

References (55)

  • NandaS.J. et al.

    A survey on nature inspired metaheuristic algorithms for partitional clustering

    Swarm Evol. Comput.

    (2014)
  • FarisH. et al.

    An efficient binary salp swarm algorithm with crossover scheme for feature selection problems

    Knowl.-Based Syst.

    (2018)
  • TaradehM. et al.

    An evolutionary gravitational search-based feature selection

    Inform. Sci.

    (2019)
  • MoyanoJ.M. et al.

    An evolutionary approach to build ensembles of multi-label classifiers

    Inf. Fusion

    (2019)
  • BoutellM. et al.

    Learning multi-label scene classification

    Pattern Recognit.

    (2004)
  • ZhangN. et al.

    Multi layer ELM-RBF for multi-label learning

    Appl. Soft Comput.

    (2016)
  • LinC. et al.

    LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy

    Neurocomputing

    (2014)
  • MoyanoJ.M. et al.

    MLDA: a tool for analyzing multi-label datasets

    Knowl.-Based Syst.

    (2017)
  • PereiraR.B. et al.

    Correlation analysis of performance measures for multi-label classification

    Inf. Process. Manage.

    (2018)
  • RokachL. et al.

    Ensemble methods for multi-label classification

    Expert Syst. Appl.

    (2014)
  • GibajaE. et al.

    Multi-label learning: a review of the state of the art and ongoing research

    Wiley Interdiscip. Rev.: Data Min. Knowl. Discov.

    (2014)
  • TanA.C. et al.

    Multi-class protein fold classification using a new ensemble machine learning approach

    Genome Inf.

    (2003)
  • LinH.-J. et al.

    Content-based image retrieval trained by adaboost for mobile application

    Int. J. Pattern Recognit. Artif. Intell.

    (2006)
  • ReadJ. et al.

    Classifier chains for multi-label classification

    Mach. Learn.

    (2011)
  • TsoumakasG. et al.

    Random k-labelsets for multi-label classification

    IEEE Trans. Knowl. Data Eng.

    (2011)
  • Álvar Arnaiz-GonzálezM. et al.

    Local sets for multi-label instance selection

    Appl. Soft Comput.

    (2018)
  • TsoumakasG. et al.

    A taxonomy and short review of ensemble selection

  • Cited by (10)

    • A dual evolutionary bagging for class imbalance learning

      2022, Expert Systems with Applications
      Citation Excerpt :

      Expanding EVO-bagging to classify the extreme imbalance datasets and analyzing the error boundary based on the theories proposed by (Abu Arqub, 2019; Abu Arqub & Rashaideh, 2018) will be our future work. Also, to reasonably set the values of parameters in evolutionary operators, combining adaptive genetic operators (Moyano, Gibaja, Cios, & Ventura, 2020; Moyano & Ventura, 2022) with the proposed dual framework will be our future work. Yinan Guo: Conceptualization, Methodology.

    • Auto-adaptive Grammar-Guided Genetic Programming algorithm to build Ensembles of Multi-Label Classifiers

      2022, Information Fusion
      Citation Excerpt :

      Despite the good performance of the EMLCs proposed in the literature, some of them do not consider the characteristics of the data to build the ensemble, but just generate diversity in the ensemble by following random procedures [10]. Two recent studies proposed to build EMLCs by using Evolutionary Algorithms (EAs) while considering the characteristics of the multi-labeled data, outperforming the rest of the methods [10,11]. However, these EAs require the configuration of a wide range of hyper-parameters that should be tuned to maximize their performance.

    View all citing articles on Scopus

    The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

    View full text