Elsevier

Information Sciences

Volume 589, April 2022, Pages 636-654
Information Sciences

Recursive elimination current algorithms and a distributed computing scheme to accelerate wrapper feature selection

https://doi.org/10.1016/j.ins.2021.12.086Get rights and content

Highlights

  • A new reflection on the relationship between feature selection problems and search algorithms within the limitations of the NFL theorems.

  • A novel set of algorithms for wrapper feature selection based on the recursion technique.

  • An asynchronous distributed computing scheme to accelerate processing speed for wrapper feature selection.

  • Non-parametric and adaptive search algorithms to handle wrapper feature selection problems with high-performance efficiency and scalability.

Abstract

Feature selection (FS) for classification tasks in machine learning and data mining has attracted significant attention. Recently, increasing metaheuristic optimization algorithms have been applied to solve wrapper FS problems. Nevertheless, the algorithms that improve existing optimizers inevitably bring higher complexity and require higher computational cost in wrapper mode. In this work, we present recursive elimination current (REC) algorithms, a novel set of algorithms for wrapper FS, which consists of the simplest feature subset representation, an ingenious structure enlightened by the recursion technique in computer science and necessary components. To some extent, the proposed algorithms, recurrent REC (REC2) and distributed REC (DiREC), dispose of the issues, including but not limited to keeping diversity of population and scalability for high-dimensionality, which are often discussed in metaheuristics-based FS research. Thereinto, DiREC is a distributed computing scheme proposed to accelerate the FS process by distributing the tasks to different computing units in a cluster of computers. A series of experiments were carried out on several representative benchmark datasets, such as Madelon from UCI machine learning repository, using REC2 and DiREC with various numbers of logical processors. It is demonstrated that the proposed algorithms perform efficiently in wrapper mode, and the distributed computing scheme is effective and yields computational time savings.

Introduction

Feature selection (FS) is an important preprocessing technique in machine learning and data mining, aiming for higher classification accuracy, better generalization ability and simpler learning models. However, removing irrelevant and redundant features from a full set of available features is challenging due to the highly complicated interaction between feature values and class labels. The wrapper [1], one of the most popular feature selection techniques, incorporates a supervised learning algorithm as a black box to evaluate the selected candidate subsets in their classification performance (or prediction power), and then guides the search process to find the optimal subset with the best performance. Compared to the other two techniques, i.e., the filter and embedded methods, wrapper is designed to decouple search algorithms from the intrinsic statistical characteristics of the datasets, thereby paying more attention to search ability. Provided the same learning algorithm as the FS stage, the subsequent prediction task using the selected features by the wrapper often achieves better performance [2]. Therefore, this paper focuses on wrapper FS algorithms.

A typical wrapper FS process needs two indispensable components, which include a search algorithm to generate diverse feature subsets and an evaluator to measure these subsets. Appropriate evaluation criteria can lead the search for better feature subsets. In general, the primary objective of wrapper FS is maximizing the generalization ability, i.e., the prediction ability for unseen instances, of the final classifier. In real-life FS research, the prediction ability is always estimated based on known instances limited in quantity. It is frequently scored through invoking machine learning algorithms on the datasets prepared in independent training/validation or n-fold cross-validation manners [2]. Therefore, wrapper FS processes are computationally costly. A single evaluation cost on a feature subset is positively correlated to the number of selected features. Reducing the number of features is thus deemed as another objective, which is self-evident for the term “feature selection”. Simultaneously considering these two objectives for FS, plenty of research has been presented in the literature. Two currently most common objective formulations are the aggregated weighted-sum form and the Pareto-optimal multi-objective form. To be clear, these two objectives are not always in conflict. According to the curse of dimensionality, given limited available instances, when the number of features used is too large, the classification performance will be decreased by superfluous features. On the other hand, when too few features are used, the classification performance will decline because some informative features are neglected. Hence, reasonable feature subset evaluation is conducive to the confirmation of the optimal subset.

A search algorithm for FS is an ensemble of effective operators, which are properly organized to generate promising feature subsets. In the search process of wrapper FS, new subsets are always dynamically generated, then evaluated, which is repeated in cycles. The traditional FS algorithms include sequential ones such as sequential forward selection (SFS) [3], sequential backward selection (SBS) [4] and their improved versions, sequential forward/backward floating selection (SFFS/SBFS) [5], improved forward floating selection (IFFS) [6], etc. Practice has shown that these algorithms are of low search efficiency, stagnation in local optima, and deficiency in scalability to handle high-dimensional datasets. In the past three decades, with the prosperity of stochastic optimization algorithms, especially the population-based metaheuristics, which are known for global search ability, there has been an explosion of research on FS. Since the first FS study based on genetic algorithm (GA) [7] was published in 1989, substantial metaheuristics have been applied to FS to play the role of a search algorithm, such as differential evolution (DE), particle swarm optimization (PSO), whale optimization algorithm (WOA), etc. Up to now, plenty of research has tried to involve more neoteric metaheuristics and various improvements to solve FS problems. Several studies [8], [9], [10], [11] have comprehensively surveyed the fruitful achievements on FS using metaheuristic algorithms. Nevertheless, drawbacks remain when carrying out research based on existing metaheuristic mechanisms.

The introduction of population-based metaheuristic algorithms for FS problems is always paired with proper adaptations. Notice that these algorithms were originally invented for real-valued parameter optimization problems, while FS is a binary-valued combinatorial optimization problem. The adaptations require transfer functions that transform the original real-valued vectors into binary ones for feature subset representation. However, optimality cannot be theoretically guaranteed to maximization, even though plenty of studies have used transfer functions, such as S/V-shaped transfers in [12], and simple threshold θ transfer in [13]. Counter-intuitively, the metaheuristic algorithms even naturally conflict with the FS problems. This is because the population is designed to converge to local or global optima, while the FS process requires as many different feature subsets as possible to support the superiority of the (quasi-/near-) optimal solution. Repeatedly generating the same solutions are pointless; even worse, they waste the resources in excessive evaluation. Therefore, exploration and exploitation are frequently found to be enhanced in metaheuristic FS research such as [14], [15], where improved operators for both global and local search are applied to diversify the population or solutions in order to visit more different feature subsets.

Another defect of directly using metaheuristic algorithms for FS lies in the poor interpretability of the search process. The emergence of optimal subsets, caused by stochastic operators, is disposable. In other words, better feature subsets are generated by chance, not for certain. The “generation chain” of the optimal subsets is obscure, which cannot explain why the previous subset can cause the next one better. Finally, although the metaheuristic algorithms are thought of with simple mechanisms and structures, as for practitioners, it is still challenging that high learning, programming and test costs are required. For example, the hyper parameters brought by the algorithms are difficult to recognize and set with reasonable values for all cases. In most related research, parameter adjustment is an important step that should be addressed in advance by experience before formal experiments [16]. In fact, there exist no such optimal settings for an algorithm when facing datasets with diverse dimensionality. Aforesaid arguments can lead to an in-depth discussion, that is, whether a better FS search algorithm can always be achieved by resorting to improving metaheuristic algorithms tirelessly?

This paper, to overcome the aforementioned drawbacks, initiates a brand-new FS algorithm design named recursive elimination current (REC), which is inspired by the recursion technique in computer science and guided by the implications of the No Free Lunch (NFL) theorems [17]. Based on REC, two intact algorithms are further established. The first one, named recurrent REC (REC2), is a promising alternative to the existing feature subset search algorithms and is simpler than the metaheuristic ones. The second one (distributed REC, DiREC for short) is a distributed computing scheme of the former to accelerate the wrapper FS process by distributing the computational burden to different processors in a cluster of computers connected through local area network (LAN). Our algorithms are substantially different from those developed on the basis of metaheuristic optimizers. By using the two algorithms, experiments were carried on several benchmark datasets that are often investigated in FS research. Experimental results are partly compared to those of several published studies, and our analysis demonstrates that the proposed algorithms are effective. The prominent characteristics of REC2 and DiREC include non-parametric nature, good self-adaption and scalability for high dimensionality. In particular, DiREC can reduce the processing time remarkably, though it does not mean that the more processors there are, the shorter the processing time.

Specifically, the main contributions of this work are summarized as follows:

  • 1.

    A retrospection of the objective formulations for wrapper FS is performed, which suggests that the number of features could be manipulated elaborately rather than just treated as part of objectives;

  • 2.

    A new reflection on the relationship between FS and search algorithms is driven within the limitations of the NFL theorems to guide our novel design for FS algorithms;

  • 3.

    A simple feature elimination structure design (recursive elimination current, REC) is conceived to execute a quick search for potential better subsets, which further constitutes the REC2 algorithm at a higher level;

  • 4.

    The novel research on an asynchronous distributed computing scheme (DiREC) for wrapper FS is presented and implemented in practical terms.

The remaining sections of this paper are organized as follows. Section 2 outlines and analyzes the background and related work in terms of objectives, search algorithms and computing modes of the FS process. Then, Section 3 presents our methodology for the novel feature selection algorithms. Section 4 introduces the experimental design and the comparison means. Section 5 discusses and compares the experimental results. Finally, Section 6 draws conclusion and points out our future work.

Section snippets

Objective formulations of wrapper feature selection

Wrapper FS evaluates candidate feature subsets based on the performance of the classifier models trained on the datasets restricted by these feature subsets, so as to determine the possible optimal subset. Generally, classification accuracy or error rate is utilized as the main factor, followed by the size of feature subsets. In early research, some algorithms sequentially add/remove one to several features [6] or randomly choose a number of features [18] to produce different feature subsets.

Feature subset representation and criterion

In this work, we use index sets to represent the subsets of selected features. Considering a dataset with n features, the full set of features is represented as SA={1,2,,n}. Then, a feature subset Si of this dataset can be defined as Si={j,k,,m}SA where j,k,,m denote that the j,k,,mth features are selected. This representation form can be found in traditional FS methods, such as SFS, SBS, etc. It is more explicit than that of the popular metaheuristic methods which widely use the 0–1

Dataset description and subset evaluation

In the past research of FS, datasets from UCI machine learning repository 1 and other famous sources such as ASU feature selection datasets 2, etc., were commonly opted for experiments of proposed algorithms [14], [23], [40], [15]. However, before giving our experimental scheme, we suggest the readers think over the already published FS results by focusing on a representative dataset Madelon to draw

Overall search ability

The main experimental results of this study are given in Table 3, Table 4, which are based on the records from Algorithm 2 in different runs. These two tables cover four aspects of the search ability of FS algorithms, which are about classification accuracy, number of selected features, visited feature subsets and run time.

Conclusion and future work

Wrapper FS requires a high-performance search algorithm to find the optimal feature subset. Although a large number of FS algorithms have been proposed with the help of population-based metaheuristic optimization algorithms, there is still no satisfactory answer to the question of why the improved algorithms are efficient for solving FS problems. This paper provides a new way to solve the wrapper FS problems. First, the objective formulations are analyzed to make the point that the number of

CRediT authorship contribution statement

Wei Liu: Conceptualization, Methodology, Software, Writing - original draft, Writing - review & editing. Jianyu Wang: Data curation, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (47)

  • D. Rodrigues et al.

    A multi-objective artificial butterfly optimization approach for feature selection

    Appl. Soft Comput.

    (2020)
  • A. Deniz et al.

    Robust multiobjective evolutionary feature subset selection algorithm for binary classification using machine learning techniques

    Neurocomputing

    (2017)
  • Y. Zhang et al.

    Binary differential evolution with self-learning for multi-objective feature selection

    Inf. Sci.

    (2020)
  • Y. Zhou et al.

    Many-objective optimization of feature selection based on two-level particle cooperation

    Inf. Sci.

    (2020)
  • Y. Zhou et al.

    A problem-specific non-dominated sorting genetic algorithm for supervised feature selection

    Inf. Sci.

    (2021)
  • D. Albashish et al.

    Binary biogeography-based optimization based SVM-RFE for feature selection

    Appl. Soft Comput.

    (2021)
  • X. Zhang et al.

    Gaussian mutational chaotic fruit fly-built optimization and feature selection

    Expert Syst. Appl.

    (2020)
  • A. Deniz et al.

    On initial population generation in feature subset selection

    Expert Syst. Appl.

    (2019)
  • X. Song et al.

    Feature selection using bare-bones particle swarm optimization with mutual information

    Pattern Recognit.

    (2021)
  • C. Huang et al.

    A distributed PSO-SVM hybrid system with feature selection and parameter optimization

    Appl. Soft Comput.

    (2008)
  • V. Bolón-Canedo et al.

    Distributed feature selection: An application to microarray data classification

    Appl. Soft Comput.

    (2015)
  • L. Moran-Fernandez et al.

    Centralized vs. distributed feature selection methods based on data complexity measures

    Knowl. Based Syst.

    (2017)
  • W. Pedrycz et al.

    Evolutionary feature selection via structure retention

    Expert Syst. Appl.

    (2012)
  • Cited by (11)

    View all citing articles on Scopus
    View full text