A recursive PSO scheme for gene selection in microarray data

doi:10.1016/j.asoc.2018.06.019

Applied Soft Computing

Volume 71, October 2018, Pages 213-225

https://doi.org/10.1016/j.asoc.2018.06.019 Get rights and content

Highlights

•
Proposed a recursive PSO method for gene selection.
•
Improved accuracy with reduced set of features (genes).
•
Faster than the standard PSO and its variants.

Abstract

In DNA microarray datasets, the number of genes are very large, typically in thousands while the number of samples are in hundreds. This raises the issue of generalization in the classification process. Gene selection plays a significant role in improving the accuracy. In this paper, we have proposed a recursive particle swarm optimization approach (PSO) for gene selection. The proposed method refines the feature (gene) space from a very coarse level to a fine-grained one at each recursive step of the algorithm without degrading the accuracy. In addition, we have integrated various filter based ranking methods with the proposed recursive PSO approach. We also propose to use linear support vector machine weight vector to serve as initial gene pool selection. We evaluate our method on five publicly available benchmark microarray datasets. Our approach selects only a small number of genes while yielding substantial improvements in accuracy over state-of-the-art evolutionary methods.

Graphical abstract

Introduction

DNA microarray datasets have been used to analyze cellular functions of genes and cancer diagnosis [1]. Generally, disease diagnosis is carried out through classification. These datasets are comprised of small number of samples and expression levels of thousands of genes in a sample. This leads to poor generalization in the classification process.

It is well known that only a small number of genes are sufficient to accurately diagnose some of the diseases, while a large set badly affects the diagnosis process [1], [2]. Thus, there is a need to find a small subset of genes which is sufficient for disease diagnosis. Generally, the analysis of these datasets is carried out through classification/regression where genes correspond to features and samples correspond to data points. As presence of a large number of features (genes) in a dataset leads to poor generalization accuracy and high execution time several methods have been proposed in the literature to identify a small subset of biologically relevant genes for classification in microarray data [1], [2], [3]. Search for such relevant subset is computationally intractable as search space is exponential. The key concept behind gene selection is to remove irrelevant, noisy and redundant features from the training data for improving prediction accuracy.

The existing gene selection techniques can be divided into three different categories namely filter, embedded and wrapper. Filter based methods depend on the statistical estimation of the importance of genes (or subset of genes) and are oblivious to the classifier being used [4], [5], [6]. These methods capture only the intrinsic characteristics of the genes and are not able to capture the complex interactions amongst the genes. Examples of Filter based methods are F-score [7], Wilcoxon's rank test [3], Maximal Relevance (MaxRel), ReliefF, minimal-Redundancy-Maximal - Relevance (mRMR) [6] and Joint Mutual Information Maximization (JMIM) [8]. Embedded methods incorporate the feature selection metric directly in the objective function of the classifier such that gene selection is the part of model construction process [9], [10], [11]. In these methods, the structure of the classification function plays a critical role. The feature selection and classification part are integrated as a single unit. This limits its use with different classifiers.

Wrapper based methods use classification accuracy to measure the quality of a feature subset without the knowledge of the structure of the classification function. This method heuristically searches the relevant subset of genes in an exponential search space. Examples of wrapper based methods are Particle Swarm Optimization with Wilcoxon's Rank Test [3], Genetic Swarm Algorithm (GSA) [1], GA based classifier [12], PSO with GA (PSO-GA) [3], hybrid particle swarm optimization and tabu search (HPSOTS) [13], Binary Matrix Shuffling Filter with SVM (BMSF-SVM) [14], novel hybrid framework (NHF) [15], kernel Fisher discriminant analysis (KFDA) [16], genetic Swarm Algorithm (GSA) [1], Binary Coded Genetic Algorithm (BCGA) [1], Real Coded Genetic Algorithm (RCGA) [1], and enhancement of Binary PSO (CPSO) [17].

The soft computing based approaches such as Backward Feature Elimination, Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Ant Colony Optimization and simulated annealing etc. [1], [18], [19] fall in the category of wrapper based methods.

In these methods a set of candidate solution is generated based on some local heuristics as an initial step. In the succeeding iterations these solutions are refined based on the fitness criteria of candidates (particles). In the initial iterations, the particles explore a large search space and later they exploit the search space to refine their solutions. These methods may suffer from high computational cost as there is a need to retrain the classifier number of times for each gene subset. These methods exploit the intercorrelations amongst the genes for a particular classifier. Despite their high computational complexity, wrapper based method are more popular in practice as they tend to achieve higher accuracies. Among the various methods, the PSO based approaches have been widely used for gene selection [1], [20], [13], [12], [21], [22], [23], [24].

The authors in [1] proposed a Genetic Swarm Algorithm (GSA) by combining strengths of Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) where GA chooses near optimal rule set for prediction and PSO tunes the membership function parameter. In another work [13], authors purposed a tabu search based local improvement procedure with PSO to improve the performance by exhibiting a local optima. In [17], an enhanced binary particle swarm optimization (EBPSO) is proposed by introducing constraint over particle's velocity vector and defining rules for updating the particle's position vector for improving the accuracy. In [22], authors enhanced the PSO by integrating a correlation based local search to select a salient feature subset of reduced size. This method uses a local search technique which is embedded in particle swarm optimization (PSO) to select the salient and reduced sized feature subset. Further, Xi et al. [24] proposed a binary quantum-behaved particle swarm optimization (BQPSO) method to select a minimal subset of genes to improve the classification.

There have been attempts to integrate filter based approaches like Independent Component Analysis (ICA), Clustering, Information Gain and Wilcoxon's rank test with wrapper based methods to improve the generalization performance over many state-of-art of PSO variants for feature selection [3], [15], [23].

In [3], authors integrated Wilcoxon rank sum test method with PSO and GA to select a good subset of genes. Wilcoxon rank sum test method is used to filter the relevant set of genes. A hybrid of PSO with GA is applied on this filtered subset to compute a good set of genes. In an another work [23], a combination of independent component analysis (ICA) and fuzzy backward feature elimination is used to improve the accuracy with minimal subset of genes. Further, an integration of clustering with Quantum Genetic Algorithm (CQGA) is proposed in [25]. In CQGA, clustering is used to select a small set of non-redundant representative genes following Quantum Genetic Algorithm which determines a minimal set of relevant and non-redundant genes.

In literature, recursive gene (feature) selection methods have been proposed such as Support Vector Machine with Recursive Feature Elimination (SVM-RFE) and Ridge Regression with Recursive Feature Elimination (RR-RFE) [2], [26], [27].

The SVM-RFE method [2] employs a recursive procedure to select relevant genes based on the absolute weight vector generated in each of the recursive step. In [26], Li and Yang presented a novel study to explore the various classifiers with recursive feature elimination scheme for gene selection in microarray data. In recursive feature elimination scheme, search space is reduced by removing low rank genes at each recursive step of the approach. The rank of a gene is computed using absolute weight vector of the underline classifier being used. In their study, they have found that different penalization for redundant features affect the recursive feature elimination process. Further, Li and Yang have shown the superiority of Ridge Regression with Recursive Feature Elimination (RR-RFE) approach over the SVM-RFE method as the Ridge Regression classifier penalizes redundant features more than the SVM classifier. The summarized table with publication details, datasets, method type, results and remarks is presented in Table 1 of supplementary file.

In general it is observed from these papers that the overall accuracy slightly improves with reduction in number of selected features while losing the optimal solution. In this paper, our attempt is to present techniques which select a small subset of features without degrading the accuracy. We start with a very simple scheme where we run a linear SVM in the primal form to filter out a top set of genes based on the values of the weight vector. This weight vector is used for gene ranking to integrate with PSO method, subsequently termed as PSW method in our paper.

Next, we present a recursive PSO scheme which gradually refines the feature (gene) space from a very coarse level to a fine-grained one, by reducing the gene set at each step of the algorithm. The fitness of the PSO particles are computed using SVM classifier accuracy. This approach is different from RR-RFE [26] and SVM-RFE approaches where ranking criteria is used to remove the irrelevant and redundant genes at each recursive step of the algorithm. We have shown the preliminary work using iterative approach in [28].

Further, we also examine the integration of F-score, Mutual Information and Wilcoxon's rank test with the proposed recursive PSO approach for gene selection. In this approach, the idea is to filter the most relevant genes prior to apply PSO. This strategy may improve the accuracy as most irrelevant and redundant genes are filtered out. The search space becomes narrow which helps in reducing the execution time of PSO method. This is a two step approach. In the first step, we select the top K genes using these rankings while in the second step, we apply the recursive PSO approach on this reduced set of genes. We have compared the 10-fold (10CV) and leave- one-out (LOOCV) accuracies of our proposed methods with other well-known approaches on five publicly available benchmark microarray datasets.

We show that our proposed integration recursive scheme achieves minimal subsets of genes while improving the overall prediction accuracy in comparison to many state-of-art gene selection methods. The rest of the paper is organized as follows: We start by describing the binary PSO approach and our proposed recursive PSO method with its variants. This is followed by the experimental results. Finally, we conclude our work.

Section snippets

Binary PSO

The PSO was originally developed for continuous valued search spaces. Later on, it has been extended to discrete valued search spaces [29]. In Binary PSO (BPSO), the candidate solutions are represented by particles. Each particle is a boolean vector. The velocity of a particle is used to compute the probability of the next state. In many studies, gene selection is carried out based on the principles of binary PSO [3], [13], [15]. When BPSO is used for feature selection, a particle's position

Integrated approach for gene selection

The presence of a large number of genes in a dataset leads to poor generalization accuracy as most of the genes are irrelevant and redundant. The filter based approaches tries to remove the redundant and highly correlated genes (features) at low computational cost [2], [6]. These methods does not take care for intercorrelations amongst the genes unlike the wrapper based approaches. The wrapper based approaches performs this task at very high compuational cost [1], [3]. In order to improve the

Proposed recursive PSO

The wrapper based approaches performs gene selection at a very high computational cost while producing improved accuracy [1], [3]. These methods explores a very large search space at each iteration of the algorithm. There should be a good compromise between exploration and exploitation for searching an optimal solution. In order to search for optimal solution, we have proposed to highly explore during the standard wrapper iterations and highly exploit during recursive step by reducing the

Dataset

The proposed scheme is evaluated on five publicly available benchmark microarray data, namely, Colon, Lymphoma, Leukemia, Rheumatoid Arthritis versus Osteoarthritis (RAOA) and Type 2 Diabetes (T2D) datasets [1].

The description of these datasets with the number of training instances, number of features and number of classes are presented in Table 1. Table 2 represents the notations for the different variants of methods used in the experiments. It should be noted that F-score, Wilcoxon and

Conclusion

In this paper, we present a recursive formulation of PSO based wrapper approach for gene selection. By integrating it with various filter based ranking strategies, we show that a considerable improvement in the classification accuracy can be obtained while at the same time with a considerable reduction in the selected gene set. We have illustrated the comparison of our proposed approach with other existing results for five publicly available benchmark microarray datasets. Specifically, the

References (48)

Q. Shen et al.
Research article: hybrid particle swarm optimization and Tabu search approach for selecting genes for tumor classification using gene expression data
Comput. Biol. Chem.
(2008)
J.-H. Cho et al.
New gene selection method for classification of cancer subtypes considering within-class variation
FEBS Lett.
(2003)
S. Nemati et al.
A novel ACO-GA hybrid algorithm for feature selection in protein function prediction
Expert Syst. Appl.
(2009)
M.H. Aghdam et al.
Text feature selection using ant colony optimization
Expert Syst. Appl.
(2009)
R. Kohavi et al.
Wrappers for feature subset selection
Art. Intel.
(1997)
P. Moradi et al.
A hybrid particle swarm optimization for feature subset selection by integrating a novel local search strategy
Appl. Soft Comput.
(2016)
R. Aziz et al.
A fuzzy based feature selection from independent component subspace for machine learning classification of microarray data
Genomics Data
(2016)
F.-L. Chen et al.
Combination of feature selection approaches with SVM in credit scoring
Expert Syst. Appl.
(2010)
S. Tabakhi et al.
Relevance-redundancy feature selection based on ant colony optimization
Pattern Recognit.
(2015)
S. Tabakhi et al.
Gene selection for microarray data classification using a novel ant colony optimization
Neurocomputing
(2015)

P. Moradi et al.

Integration of graph clustering with ant colony optimization for feature selection

Knowl.-Based Syst.

(2015)

P.K. Ganesh et al.

Design of fuzzy expert system for microarray data classification using a novel Genetic Swarm Algorithm

Expert Syst. Appl.

(2012)

I. Guyon et al.

Gene selection for cancer classification using support vector machines

Mach. Learn.

(2002)

S. Li et al.

Gene selection using hybrid particle swarm optimization and genetic algorithm

Soft Comput.

(2008)

R. Bekkerman et al.

Distributional word clusters vs. words for text categorization

J. Mach. Learn. Res.

(2003)

F. George et al.

An extensive empirical study of feature selection metrics for text classification

J. Mach. Learn. Res.

(2003)

H. Peng et al.

Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy

IEEE Trans. Pattern Anal. Mach. Intell.

(2005)

L. Yuanning et al.

An improved particle swarm optimization for feature selection

J. Bionic Eng.

(2011)

B. Mohamed et al.

Feature selection using joint mutual information maximisation

Expert Syst. Appl.

(2015)

J. Weston et al.

Feature selection for SVMs

Advances in Neural Information Processing Systems (NIPS 13), vol. 13

(2001)

M. Varma et al.

More generality in efficient multiple kernel learning

M. Tan et al.

Learning sparse SVM for feature selection on very high dimensional datasets

Proceedings of the Twenty-Seventh International Conference on Machine Learning

(2010)

S. Hengpraprohm et al.

A GA-based classifier for microarray data classification

2010 International Conference on Intelligent Computing and Cognitive Informatics (ICICCI)

(2010)

H. Zhang et al.

Improving accuracy for cancer classification with a new algorithm for genes selection

BMC Bioinform.

(2012)

Cited by (49)

A recursive framework for improving the performance of multi-objective differential evolution algorithms for gene selection
2024, Swarm and Evolutionary Computation
Gene selection is a pivotal process in machine-learning-driven medical diagnostics, where the goal is to identify a subset of genes from microarray expression profiles that can enhance the predictive accuracy of classifiers for disease diagnosis. The two key objectives of gene selection are to reduce the dimensionality of the data and to improve the accuracy of disease diagnosis, which is typically a multi-objective optimization problem. In recent years, multi-objective evolutionary algorithms (MOEAs) have gained wide attention in feature selection research, and several related algorithms have been produced. However, most algorithms tend to get stuck in local optimality when searching for solutions from a high-dimensional space. To solve the gene selection problem effectively, this study introduces a recursive multi-objective differential evolution algorithm with elite recursive strategy (RMODE-E) and a recursive multi-objective differential evolution algorithm with Pareto front recursive strategy (RMODE-P). RMODE-E amalgamates the features selected by the top E elite individuals, RMODE-P consolidates the features selected by the Pareto front set, and the combined features then serve as the foundation for subsequent recursive rounds of searching. The proposed feature subspace combination strategy not only reduces the recursive search space but also improves the capacity to find globally optimal feature subsets. Extensive experiments were conducted to compare our proposed algorithms with eight state-of-the-art evolutionary algorithms to validate their effectiveness. Experimental results demonstrate that RMODE-P has better global search capability as it achieves better best classification accuracy, mean classification accuracy, and minimal gene subset size.
Maximum margin and global criterion based-recursive feature selection
2024, Neural Networks
In this research paper, we aim to investigate and address the limitations of recursive feature elimination (RFE) and its variants in high-dimensional feature selection tasks. We identify two main challenges associated with these methods. Firstly, the feature ranking criterion utilized in these approaches is inconsistent with the maximum-margin theory. Secondly, the computation of the criterion is performed locally, lacking the ability to measure the importance of features globally. To overcome these challenges, we propose a novel feature ranking criterion called Maximum Margin and Global (MMG) criterion. This criterion utilizes the classification margin to determine the importance of features and computes it globally, enabling a more accurate assessment of feature importance. Moreover, we introduce an optimal feature subset evaluation algorithm that leverages the MMG criterion to determine the best subset of features. To enhance the efficiency of the proposed algorithms, we provide two alpha seeding strategies that significantly reduce computational costs while maintaining high accuracy. These strategies offer a practical means to expedite the feature selection process. Through extensive experiments conducted on ten benchmark datasets, we demonstrate that our proposed algorithms outperform current state-of-the-art methods. Additionally, the alpha seeding strategies yield significant speedups, further enhancing the efficiency of the feature selection process.
Fuzzy-based concept-cognitive learning: An investigation of novel approach to tumor diagnosis analysis
2023, Information Sciences
Medical decision-making with high-dimensional complex data has recently become a focus and difficulty in artificial intelligence and the medical field. Tumor diagnosis using data mining technology, from the perspective of gene analysis, can effectively improve the prediction accuracy of patients. For gene databases of tumors with high-dimensional attributes and small sample sizes, tumor classification based on gene analysis is a significant step in the intervention and treatment of tumors. The existing research on the tumor classification of gene data has one prevalent disadvantage: gene obtained via the classification performance evaluation has weak interpretability and universality. This paper presents a concept-cognitive learning system with the three-way analysis (CCL3S) in the fuzzy context for the problem of tumor diagnosis with high-dimensional data, a new fuzzy classifier good at tumor diagnosis. The main steps of the CCL3S include: designing fuzzy recognition to extract the core gene, constructing a fuzzy three-way concept space via the core gene, and finally completing the tumor diagnosis based on the minimum recognition degree. Experimental results on nine tumor gene expression datasets demonstrate that CCL3S achieves better classification performance than some related methods.
Tree enhanced deep adaptive network for cancer prediction with high dimension low sample size microarray data
2023, Applied Soft Computing
Cancer prediction based on microarray data can facilitate the molecular exploration of cancers, thus building more accurate cancer prediction models is essential. This study focuses on a deep learning-based cancer prediction model. However, using a deep neural network to predict cancer is a difficult task due to the complexity of the underlying biological patterns and high dimension low sample size (HDLSS) of microarray data, which could bring about over-fitting and large training gradient variance. Therefore, a tree-enhanced deep adaptive network (TEDAN) is proposed to address these issues. Firstly, we employ the idea of the ensemble tree as a feature transformation method to alleviate the over-fitting problem, which generates a feature with a lower dimension and a more discriminative pattern. Secondly, a deep adaptive network (DAN) based on a self-attention mechanism is proposed to model the underlying biological interaction between different genes. Thirdly, a low sample size training (LSST) method is proposed to further reduce the large training gradient variance. Experiment results on six public cancer prediction datasets demonstrate that the TEDAN outperforms other strong baseline models.
A two-phase gene selection method using anomaly detection and genetic algorithm for microarray data
2023, Knowledge-Based Systems
Cancer diagnosis based on gene analysis is one of the main research areas in bioinformatics and machine learning. Microarray is a technology that can simultaneously study the expression level of thousands of genes in a sample. However, mutation or change in gene expression of only a small number of genes can lead to cancer, and basically, the expression level of most genes is the same between cancerous and healthy samples. On the other hand, the main challenge in microarray data is the high number of genes compared to the very small number of samples. This issue makes gene selection an essential step in microarray analysis. In this paper, we have proposed a new two-phase gene selection method for microarray data. In the first stage of this method, with a different approach, the genes that are the main features of the microarray are considered as training samples instead of cancerous and healthy samples; afterward, we reduce the number of genes to a great extent via anomaly detection. In the second stage, we apply a guided genetic algorithm to the genes obtained from the previous step to reach the final effective genes. Based on the experimental results, our method can reduce the number of genes up to at least 99% on all datasets. Besides, in addition to the very high reduction rate of genes, we managed to significantly increase the classification accuracy using the selected genes.
A self-adaptive quantum equilibrium optimizer with artificial bee colony for feature selection
2023, Computers in Biology and Medicine
Feature selection (FS) is a popular data pre-processing technique in machine learning to extract the optimal features to maintain or increase the classification accuracy of the dataset, which is a combinatorial optimization problem, requiring a powerful optimizer to obtain the optimum subset. The equilibrium optimizer (EO) is a recent physical-based metaheuristic algorithm with good performance for various optimization problems, but it may encounter premature or the local convergence in feature selection. This work presents a self-adaptive quantum EO with artificial bee colony for feature selection, named SQEOABC. In the proposed algorithm, the quantum theory and the self-adaptive mechanism are employed into the updating rule of EO to enhance convergence, and the updating mechanism from the artificial bee colony is also incorporated into EO to achieve appropriate FS solutions. In the experiments, 25 benchmark datasets from the UCI repository are investigated to verify SQEOABC, which is compared with several state-of-the-art metaheuristic algorithms and the variants of EO. The statistical results of fitness values and accuracy demonstrate that SQEOABC has better performance than the compared algorithms and the variants of EO. Finally, a real-world FS problem from COVID-19 illustrates the effectiveness and superiority of SQEOABC.

View all citing articles on Scopus

¹: Working as Post Doctoral Fellow in Department of Mathematics & Statistics at Thompson Rivers University, BC, Canada. Main areas of interests are Machine Learning, Soft Computing, Natural Language Processing, Convex Optimization etc.

²: Working as Professor in Department of Computer Science Engineering at Bennett University, Greater Noida, India. Main areas of interests are AI, Machine Learning, Soft Computing, Computer Vision etc.

³: Working as Professor in Department of Electrical Engineering at Indian institute of Technology Delhi. Main areas of interests are Soft Computing, Image Processing, Computer Vision, Pattern recognition, Biometrics, Surveillance and Intelligent Control.

View full text

A recursive PSO scheme for gene selection in microarray data

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Binary PSO

Integrated approach for gene selection

Proposed recursive PSO

Dataset

Conclusion

Comput. Biol. Chem.

FEBS Lett.

Expert Syst. Appl.

Expert Syst. Appl.

Art. Intel.

Appl. Soft Comput.

Genomics Data

Expert Syst. Appl.

Pattern Recognit.

Neurocomputing

Knowl.-Based Syst.

Design of fuzzy expert system for microarray data classification using a novel Genetic Swarm Algorithm

Expert Syst. Appl.

Gene selection for cancer classification using support vector machines

Mach. Learn.

Gene selection using hybrid particle swarm optimization and genetic algorithm

Soft Comput.

Distributional word clusters vs. words for text categorization

J. Mach. Learn. Res.

An extensive empirical study of feature selection metrics for text classification

J. Mach. Learn. Res.

Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy

IEEE Trans. Pattern Anal. Mach. Intell.

An improved particle swarm optimization for feature selection

J. Bionic Eng.

Feature selection using joint mutual information maximisation

Expert Syst. Appl.

Feature selection for SVMs

Advances in Neural Information Processing Systems (NIPS 13), vol. 13

More generality in efficient multiple kernel learning

Learning sparse SVM for feature selection on very high dimensional datasets

Proceedings of the Twenty-Seventh International Conference on Machine Learning

A GA-based classifier for microarray data classification

2010 International Conference on Intelligent Computing and Cognitive Informatics (ICICCI)

Improving accuracy for cancer classification with a new algorithm for genes selection

BMC Bioinform.