Recursive elimination current algorithms and a distributed computing scheme to accelerate wrapper feature selection

doi:10.1016/j.ins.2021.12.086

Information Sciences

Volume 589, April 2022, Pages 636-654

https://doi.org/10.1016/j.ins.2021.12.086 Get rights and content

Highlights

•
A new reflection on the relationship between feature selection problems and search algorithms within the limitations of the NFL theorems.
•
A novel set of algorithms for wrapper feature selection based on the recursion technique.
•
An asynchronous distributed computing scheme to accelerate processing speed for wrapper feature selection.
•
Non-parametric and adaptive search algorithms to handle wrapper feature selection problems with high-performance efficiency and scalability.

Abstract

Feature selection (FS) for classification tasks in machine learning and data mining has attracted significant attention. Recently, increasing metaheuristic optimization algorithms have been applied to solve wrapper FS problems. Nevertheless, the algorithms that improve existing optimizers inevitably bring higher complexity and require higher computational cost in wrapper mode. In this work, we present recursive elimination current (REC) algorithms, a novel set of algorithms for wrapper FS, which consists of the simplest feature subset representation, an ingenious structure enlightened by the recursion technique in computer science and necessary components. To some extent, the proposed algorithms, recurrent REC (REC²) and distributed REC (DiREC), dispose of the issues, including but not limited to keeping diversity of population and scalability for high-dimensionality, which are often discussed in metaheuristics-based FS research. Thereinto, DiREC is a distributed computing scheme proposed to accelerate the FS process by distributing the tasks to different computing units in a cluster of computers. A series of experiments were carried out on several representative benchmark datasets, such as Madelon from UCI machine learning repository, using REC² and DiREC with various numbers of logical processors. It is demonstrated that the proposed algorithms perform efficiently in wrapper mode, and the distributed computing scheme is effective and yields computational time savings.

Introduction

Feature selection (FS) is an important preprocessing technique in machine learning and data mining, aiming for higher classification accuracy, better generalization ability and simpler learning models. However, removing irrelevant and redundant features from a full set of available features is challenging due to the highly complicated interaction between feature values and class labels. The wrapper [1], one of the most popular feature selection techniques, incorporates a supervised learning algorithm as a black box to evaluate the selected candidate subsets in their classification performance (or prediction power), and then guides the search process to find the optimal subset with the best performance. Compared to the other two techniques, i.e., the filter and embedded methods, wrapper is designed to decouple search algorithms from the intrinsic statistical characteristics of the datasets, thereby paying more attention to search ability. Provided the same learning algorithm as the FS stage, the subsequent prediction task using the selected features by the wrapper often achieves better performance [2]. Therefore, this paper focuses on wrapper FS algorithms.

A typical wrapper FS process needs two indispensable components, which include a search algorithm to generate diverse feature subsets and an evaluator to measure these subsets. Appropriate evaluation criteria can lead the search for better feature subsets. In general, the primary objective of wrapper FS is maximizing the generalization ability, i.e., the prediction ability for unseen instances, of the final classifier. In real-life FS research, the prediction ability is always estimated based on known instances limited in quantity. It is frequently scored through invoking machine learning algorithms on the datasets prepared in independent training/validation or n-fold cross-validation manners [2]. Therefore, wrapper FS processes are computationally costly. A single evaluation cost on a feature subset is positively correlated to the number of selected features. Reducing the number of features is thus deemed as another objective, which is self-evident for the term “feature selection”. Simultaneously considering these two objectives for FS, plenty of research has been presented in the literature. Two currently most common objective formulations are the aggregated weighted-sum form and the Pareto-optimal multi-objective form. To be clear, these two objectives are not always in conflict. According to the curse of dimensionality, given limited available instances, when the number of features used is too large, the classification performance will be decreased by superfluous features. On the other hand, when too few features are used, the classification performance will decline because some informative features are neglected. Hence, reasonable feature subset evaluation is conducive to the confirmation of the optimal subset.

A search algorithm for FS is an ensemble of effective operators, which are properly organized to generate promising feature subsets. In the search process of wrapper FS, new subsets are always dynamically generated, then evaluated, which is repeated in cycles. The traditional FS algorithms include sequential ones such as sequential forward selection (SFS) [3], sequential backward selection (SBS) [4] and their improved versions, sequential forward/backward floating selection (SFFS/SBFS) [5], improved forward floating selection (IFFS) [6], etc. Practice has shown that these algorithms are of low search efficiency, stagnation in local optima, and deficiency in scalability to handle high-dimensional datasets. In the past three decades, with the prosperity of stochastic optimization algorithms, especially the population-based metaheuristics, which are known for global search ability, there has been an explosion of research on FS. Since the first FS study based on genetic algorithm (GA) [7] was published in 1989, substantial metaheuristics have been applied to FS to play the role of a search algorithm, such as differential evolution (DE), particle swarm optimization (PSO), whale optimization algorithm (WOA), etc. Up to now, plenty of research has tried to involve more neoteric metaheuristics and various improvements to solve FS problems. Several studies [8], [9], [10], [11] have comprehensively surveyed the fruitful achievements on FS using metaheuristic algorithms. Nevertheless, drawbacks remain when carrying out research based on existing metaheuristic mechanisms.

The introduction of population-based metaheuristic algorithms for FS problems is always paired with proper adaptations. Notice that these algorithms were originally invented for real-valued parameter optimization problems, while FS is a binary-valued combinatorial optimization problem. The adaptations require transfer functions that transform the original real-valued vectors into binary ones for feature subset representation. However, optimality cannot be theoretically guaranteed to maximization, even though plenty of studies have used transfer functions, such as S/V-shaped transfers in [12], and simple threshold $θ$ transfer in [13]. Counter-intuitively, the metaheuristic algorithms even naturally conflict with the FS problems. This is because the population is designed to converge to local or global optima, while the FS process requires as many different feature subsets as possible to support the superiority of the (quasi-/near-) optimal solution. Repeatedly generating the same solutions are pointless; even worse, they waste the resources in excessive evaluation. Therefore, exploration and exploitation are frequently found to be enhanced in metaheuristic FS research such as [14], [15], where improved operators for both global and local search are applied to diversify the population or solutions in order to visit more different feature subsets.

Another defect of directly using metaheuristic algorithms for FS lies in the poor interpretability of the search process. The emergence of optimal subsets, caused by stochastic operators, is disposable. In other words, better feature subsets are generated by chance, not for certain. The “generation chain” of the optimal subsets is obscure, which cannot explain why the previous subset can cause the next one better. Finally, although the metaheuristic algorithms are thought of with simple mechanisms and structures, as for practitioners, it is still challenging that high learning, programming and test costs are required. For example, the hyper parameters brought by the algorithms are difficult to recognize and set with reasonable values for all cases. In most related research, parameter adjustment is an important step that should be addressed in advance by experience before formal experiments [16]. In fact, there exist no such optimal settings for an algorithm when facing datasets with diverse dimensionality. Aforesaid arguments can lead to an in-depth discussion, that is, whether a better FS search algorithm can always be achieved by resorting to improving metaheuristic algorithms tirelessly?

This paper, to overcome the aforementioned drawbacks, initiates a brand-new FS algorithm design named recursive elimination current (REC), which is inspired by the recursion technique in computer science and guided by the implications of the No Free Lunch (NFL) theorems [17]. Based on REC, two intact algorithms are further established. The first one, named recurrent REC (REC²), is a promising alternative to the existing feature subset search algorithms and is simpler than the metaheuristic ones. The second one (distributed REC, DiREC for short) is a distributed computing scheme of the former to accelerate the wrapper FS process by distributing the computational burden to different processors in a cluster of computers connected through local area network (LAN). Our algorithms are substantially different from those developed on the basis of metaheuristic optimizers. By using the two algorithms, experiments were carried on several benchmark datasets that are often investigated in FS research. Experimental results are partly compared to those of several published studies, and our analysis demonstrates that the proposed algorithms are effective. The prominent characteristics of REC² and DiREC include non-parametric nature, good self-adaption and scalability for high dimensionality. In particular, DiREC can reduce the processing time remarkably, though it does not mean that the more processors there are, the shorter the processing time.

Specifically, the main contributions of this work are summarized as follows:

1.
A retrospection of the objective formulations for wrapper FS is performed, which suggests that the number of features could be manipulated elaborately rather than just treated as part of objectives;
2.
A new reflection on the relationship between FS and search algorithms is driven within the limitations of the NFL theorems to guide our novel design for FS algorithms;
3.
A simple feature elimination structure design (recursive elimination current, REC) is conceived to execute a quick search for potential better subsets, which further constitutes the REC² algorithm at a higher level;
4.
The novel research on an asynchronous distributed computing scheme (DiREC) for wrapper FS is presented and implemented in practical terms.

The remaining sections of this paper are organized as follows. Section 2 outlines and analyzes the background and related work in terms of objectives, search algorithms and computing modes of the FS process. Then, Section 3 presents our methodology for the novel feature selection algorithms. Section 4 introduces the experimental design and the comparison means. Section 5 discusses and compares the experimental results. Finally, Section 6 draws conclusion and points out our future work.

Section snippets

Objective formulations of wrapper feature selection

Wrapper FS evaluates candidate feature subsets based on the performance of the classifier models trained on the datasets restricted by these feature subsets, so as to determine the possible optimal subset. Generally, classification accuracy or error rate is utilized as the main factor, followed by the size of feature subsets. In early research, some algorithms sequentially add/remove one to several features [6] or randomly choose a number of features [18] to produce different feature subsets.

Feature subset representation and criterion

In this work, we use index sets to represent the subsets of selected features. Considering a dataset with n features, the full set of features is represented as $S_{A} = {1, 2, \dots, n}$ . Then, a feature subset $S_{i}$ of this dataset can be defined as $S_{i} = {j, k, \dots, m} \subset S_{A}$ where $j, k, \dots, m$ denote that the $j, k, \dots, m$ th features are selected. This representation form can be found in traditional FS methods, such as SFS, SBS, etc. It is more explicit than that of the popular metaheuristic methods which widely use the 0–1

Dataset description and subset evaluation

In the past research of FS, datasets from UCI machine learning repository ¹ and other famous sources such as ASU feature selection datasets ², etc., were commonly opted for experiments of proposed algorithms [14], [23], [40], [15]. However, before giving our experimental scheme, we suggest the readers think over the already published FS results by focusing on a representative dataset Madelon to draw

Overall search ability

The main experimental results of this study are given in Table 3, Table 4, which are based on the records from Algorithm 2 in different runs. These two tables cover four aspects of the search ability of FS algorithms, which are about classification accuracy, number of selected features, visited feature subsets and run time.

Conclusion and future work

Wrapper FS requires a high-performance search algorithm to find the optimal feature subset. Although a large number of FS algorithms have been proposed with the help of population-based metaheuristic optimization algorithms, there is still no satisfactory answer to the question of why the improved algorithms are efficient for solving FS problems. This paper provides a new way to solve the wrapper FS problems. First, the objective formulations are analyzed to make the point that the number of

CRediT authorship contribution statement

Wei Liu: Conceptualization, Methodology, Software, Writing - original draft, Writing - review & editing. Jianyu Wang: Data curation, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (47)

R. Kohavi et al.
Wrappers for feature subset selection
Artif. Intell.
(1997)
P. Pudil et al.
Floating search methods in feature selection
Pattern Recognit. Lett.
(1994)
S. Nakariyakul et al.
An improvement on floating search algorithms for feature subset selection
Pattern Recognit.
(2009)
W.W. Siedlecki et al.
A note on genetic algorithms for large-scale feature selection
Pattern Recognit. Lett.
(1989)
B.H. Nguyen et al.
A survey on swarm intelligence approaches to feature selection in data mining
Swarm Evol. Comput.
(2020)
A. Li et al.
Improved binary particle swarm optimization for feature selection with new initialization and search space reduction strategies
Appl. Soft Comput.
(2021)
M. Taradeh et al.
An evolutionary gravitational search-based feature selection
Inf. Sci.
(2019)
O. Tarkhaneh et al.
A novel wrapper-based feature subset selection method using modified binary differential evolution algorithm
Inf. Sci.
(2021)
M. Abdel-Basset et al.
A new fusion of grey wolf optimizer algorithm with a two-phase mutation for feature selection
Expert Syst. Appl.
(2020)
J. Too et al.
A hyper learning binary dragonfly algorithm for feature selection: A COVID-19 case study
Knowl. Based Syst.
(2021)

D. Rodrigues et al.

A multi-objective artificial butterfly optimization approach for feature selection

Appl. Soft Comput.

(2020)

A. Deniz et al.

Robust multiobjective evolutionary feature subset selection algorithm for binary classification using machine learning techniques

Neurocomputing

(2017)

Y. Zhang et al.

Binary differential evolution with self-learning for multi-objective feature selection

Inf. Sci.

(2020)

Y. Zhou et al.

Many-objective optimization of feature selection based on two-level particle cooperation

Inf. Sci.

(2020)

Y. Zhou et al.

A problem-specific non-dominated sorting genetic algorithm for supervised feature selection

Inf. Sci.

(2021)

D. Albashish et al.

Binary biogeography-based optimization based SVM-RFE for feature selection

Appl. Soft Comput.

(2021)

X. Zhang et al.

Gaussian mutational chaotic fruit fly-built optimization and feature selection

Expert Syst. Appl.

(2020)

A. Deniz et al.

On initial population generation in feature subset selection

Expert Syst. Appl.

(2019)

X. Song et al.

Feature selection using bare-bones particle swarm optimization with mutual information

Pattern Recognit.

(2021)

C. Huang et al.

A distributed PSO-SVM hybrid system with feature selection and parameter optimization

Appl. Soft Comput.

(2008)

V. Bolón-Canedo et al.

Distributed feature selection: An application to microarray data classification

Appl. Soft Comput.

(2015)

L. Moran-Fernandez et al.

Centralized vs. distributed feature selection methods based on data complexity measures

Knowl. Based Syst.

(2017)

W. Pedrycz et al.

Evolutionary feature selection via structure retention

Expert Syst. Appl.

(2012)

Cited by (11)

A recursive framework for improving the performance of multi-objective differential evolution algorithms for gene selection
2024, Swarm and Evolutionary Computation
Gene selection is a pivotal process in machine-learning-driven medical diagnostics, where the goal is to identify a subset of genes from microarray expression profiles that can enhance the predictive accuracy of classifiers for disease diagnosis. The two key objectives of gene selection are to reduce the dimensionality of the data and to improve the accuracy of disease diagnosis, which is typically a multi-objective optimization problem. In recent years, multi-objective evolutionary algorithms (MOEAs) have gained wide attention in feature selection research, and several related algorithms have been produced. However, most algorithms tend to get stuck in local optimality when searching for solutions from a high-dimensional space. To solve the gene selection problem effectively, this study introduces a recursive multi-objective differential evolution algorithm with elite recursive strategy (RMODE-E) and a recursive multi-objective differential evolution algorithm with Pareto front recursive strategy (RMODE-P). RMODE-E amalgamates the features selected by the top E elite individuals, RMODE-P consolidates the features selected by the Pareto front set, and the combined features then serve as the foundation for subsequent recursive rounds of searching. The proposed feature subspace combination strategy not only reduces the recursive search space but also improves the capacity to find globally optimal feature subsets. Extensive experiments were conducted to compare our proposed algorithms with eight state-of-the-art evolutionary algorithms to validate their effectiveness. Experimental results demonstrate that RMODE-P has better global search capability as it achieves better best classification accuracy, mean classification accuracy, and minimal gene subset size.
An improved binary dandelion algorithm using sine cosine operator and restart strategy for feature selection
2024, Expert Systems with Applications
Feature selection (FS) is an important data preprocessing technology for machine learning and data mining. Metaheuristic algorithm (MH) has been widely used in feature selection because of its powerful search function. This paper presents an improved Binary Dandelion Algorithm using Sine Cosine operator and Restart strategy (SCRBDA) for feature selection. First, the sine cosine operator is used in the radius formula of the core dandelions (CD), which significantly enhances the ability of algorithm development and exploration. Secondly, the algorithm uses a restart strategy to increase its ability to get rid of local optimum. Thirdly, mutual information is used to guide the generation of some dandelions, which pays more attention to the correlation between the selected features and categories. Finally, quick bit mutation is used as the mutation strategy to improve the diversity of the population. The SCRBDA proposed in this paper was tested on 18 datasets of different sizes from UCI machine learning database. The SCRBDA was compared with 8 other classical feature selection algorithms, and the performance of the proposed algorithm was evaluated through feature subset size, classification accuracy, fitness value, and F1-score. The experimental results show that SCRBDA achieves the best performance, which has stronger feature reduction ability and achieves better overall performance on most datasets. Especially on large-scale datasets, SCRBDA can obtain extremely smaller feature subsets while maintaining much higher classification accuracy, and satisfactory F1-score.
Feature selection with multi-class logistic regression
2023, Neurocomputing
Feature selection can help to reduce data redundancy and improve algorithm performance in actual tasks. Most of the embedded feature selection models are constructed based on square loss and hinge loss. However, these models based on the square loss cannot directly evaluate the discriminability of the samples in the feature subspace, and these methods based on the hinge loss are difficult to solve due to their complex objective functions. To deal with these problems, a Feature Selection method with Multi-class Logistic Regression (FSMLR) is proposed in this paper. Firstly, we construct a linear function to measure the difference between the distance from samples to their regression hyperplane and the distance from these samples to regression hyperplanes of other classes, which could be used to strengthen the discriminant property of the embedded model. Then, we design a re-weighting matrix with a $ℓ_{2, 0}$ -norm sparse condition as well as a discrete condition, which is used to select features in the subspace. Considering that it is difficult to solve the re-weighting matrix with the discrete and sparse conditions in an optimization problem, we relax these two conditions and present a feature selection model via a re-weighted multi-class logistic regression with the two relaxed constraints. Finally, we add the F-norm regularization in our model to avoid overfitting, and its unconstrained equivalent transformation with $ℓ_{2, p}$ -norm regularization is derived to explore the function of the re-weighting matrix. The gradient descent algorithm could be used to solve the FSMLR. Especially, when the regularization term in the equivalence problem is set to $ℓ_{2, 1}$ -norm, the global optimal solution can be obtained. Extensive experiments on multiple public data sets prove that FSMLR outperforms other competitors.
A joint multiobjective optimization of feature selection and classifier design for high-dimensional data classification
2023, Information Sciences
Feature selection (FS) in data mining and machine learning has attracted extensive attention. The purpose of FS in a classification task is to find the optimal subset of features from given candidate features. Recently, more and more meta-heuristic algorithms have been used to deal with the FS problems. However, meta-heuristic algorithms suffer from certain issues, such as large search space for solutions and huge time consumption. Moreover, most of existing meta-heuristic algorithms focus only on the selection of an optimal feature subset, and pay little attention to the optimal design of the classifier. In this article, we propose a joint multiobjective optimization method for both feature selection and classifier design, called JMO-FSCD. The proposed approach uses neural network as a classifier and introduces a non-iterative algorithm for training the classifier so as to ensure good performance and fast learning. A new coding scheme is also designed for optimizing FS and classifier simultaneously. For demonstrating the superiority of the proposed approach, its performance is compared with those of six state-of-the-art FS algorithms. Experimental results on thirty-five benchmark data sets reflect the superior performance of the proposed JMO-FSCD.
On the Use of Kernel Fisher Discriminant Analysis as a Reduction Method for the Classification of EMG Signals
2023, European Signal Processing Conference
Wheel Defect Detection Using Attentive Feature Selection Sequential Network with Multidimensional Modeling of Acoustic Emission Signals
2023, IEEE Transactions on Instrumentation and Measurement

View all citing articles on Scopus

View full text

Recursive elimination current algorithms and a distributed computing scheme to accelerate wrapper feature selection

Highlights

Abstract

Introduction

Section snippets

Objective formulations of wrapper feature selection

Feature subset representation and criterion

Dataset description and subset evaluation

Overall search ability

Conclusion and future work

CRediT authorship contribution statement

Declaration of Competing Interest

Artif. Intell.

Pattern Recognit. Lett.

Pattern Recognit.

Pattern Recognit. Lett.

Swarm Evol. Comput.

Appl. Soft Comput.

Inf. Sci.

Inf. Sci.

Expert Syst. Appl.

Knowl. Based Syst.

Appl. Soft Comput.

Neurocomputing

Inf. Sci.

Inf. Sci.

Inf. Sci.

Appl. Soft Comput.

Expert Syst. Appl.

Expert Syst. Appl.

Pattern Recognit.

Appl. Soft Comput.

Appl. Soft Comput.

Knowl. Based Syst.

Expert Syst. Appl.