A recursive PSO scheme for gene selection in microarray data
Graphical abstract
Introduction
DNA microarray datasets have been used to analyze cellular functions of genes and cancer diagnosis [1]. Generally, disease diagnosis is carried out through classification. These datasets are comprised of small number of samples and expression levels of thousands of genes in a sample. This leads to poor generalization in the classification process.
It is well known that only a small number of genes are sufficient to accurately diagnose some of the diseases, while a large set badly affects the diagnosis process [1], [2]. Thus, there is a need to find a small subset of genes which is sufficient for disease diagnosis. Generally, the analysis of these datasets is carried out through classification/regression where genes correspond to features and samples correspond to data points. As presence of a large number of features (genes) in a dataset leads to poor generalization accuracy and high execution time several methods have been proposed in the literature to identify a small subset of biologically relevant genes for classification in microarray data [1], [2], [3]. Search for such relevant subset is computationally intractable as search space is exponential. The key concept behind gene selection is to remove irrelevant, noisy and redundant features from the training data for improving prediction accuracy.
The existing gene selection techniques can be divided into three different categories namely filter, embedded and wrapper. Filter based methods depend on the statistical estimation of the importance of genes (or subset of genes) and are oblivious to the classifier being used [4], [5], [6]. These methods capture only the intrinsic characteristics of the genes and are not able to capture the complex interactions amongst the genes. Examples of Filter based methods are F-score [7], Wilcoxon's rank test [3], Maximal Relevance (MaxRel), ReliefF, minimal-Redundancy-Maximal - Relevance (mRMR) [6] and Joint Mutual Information Maximization (JMIM) [8]. Embedded methods incorporate the feature selection metric directly in the objective function of the classifier such that gene selection is the part of model construction process [9], [10], [11]. In these methods, the structure of the classification function plays a critical role. The feature selection and classification part are integrated as a single unit. This limits its use with different classifiers.
Wrapper based methods use classification accuracy to measure the quality of a feature subset without the knowledge of the structure of the classification function. This method heuristically searches the relevant subset of genes in an exponential search space. Examples of wrapper based methods are Particle Swarm Optimization with Wilcoxon's Rank Test [3], Genetic Swarm Algorithm (GSA) [1], GA based classifier [12], PSO with GA (PSO-GA) [3], hybrid particle swarm optimization and tabu search (HPSOTS) [13], Binary Matrix Shuffling Filter with SVM (BMSF-SVM) [14], novel hybrid framework (NHF) [15], kernel Fisher discriminant analysis (KFDA) [16], genetic Swarm Algorithm (GSA) [1], Binary Coded Genetic Algorithm (BCGA) [1], Real Coded Genetic Algorithm (RCGA) [1], and enhancement of Binary PSO (CPSO) [17].
The soft computing based approaches such as Backward Feature Elimination, Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Ant Colony Optimization and simulated annealing etc. [1], [18], [19] fall in the category of wrapper based methods.
In these methods a set of candidate solution is generated based on some local heuristics as an initial step. In the succeeding iterations these solutions are refined based on the fitness criteria of candidates (particles). In the initial iterations, the particles explore a large search space and later they exploit the search space to refine their solutions. These methods may suffer from high computational cost as there is a need to retrain the classifier number of times for each gene subset. These methods exploit the intercorrelations amongst the genes for a particular classifier. Despite their high computational complexity, wrapper based method are more popular in practice as they tend to achieve higher accuracies. Among the various methods, the PSO based approaches have been widely used for gene selection [1], [20], [13], [12], [21], [22], [23], [24].
The authors in [1] proposed a Genetic Swarm Algorithm (GSA) by combining strengths of Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) where GA chooses near optimal rule set for prediction and PSO tunes the membership function parameter. In another work [13], authors purposed a tabu search based local improvement procedure with PSO to improve the performance by exhibiting a local optima. In [17], an enhanced binary particle swarm optimization (EBPSO) is proposed by introducing constraint over particle's velocity vector and defining rules for updating the particle's position vector for improving the accuracy. In [22], authors enhanced the PSO by integrating a correlation based local search to select a salient feature subset of reduced size. This method uses a local search technique which is embedded in particle swarm optimization (PSO) to select the salient and reduced sized feature subset. Further, Xi et al. [24] proposed a binary quantum-behaved particle swarm optimization (BQPSO) method to select a minimal subset of genes to improve the classification.
There have been attempts to integrate filter based approaches like Independent Component Analysis (ICA), Clustering, Information Gain and Wilcoxon's rank test with wrapper based methods to improve the generalization performance over many state-of-art of PSO variants for feature selection [3], [15], [23].
In [3], authors integrated Wilcoxon rank sum test method with PSO and GA to select a good subset of genes. Wilcoxon rank sum test method is used to filter the relevant set of genes. A hybrid of PSO with GA is applied on this filtered subset to compute a good set of genes. In an another work [23], a combination of independent component analysis (ICA) and fuzzy backward feature elimination is used to improve the accuracy with minimal subset of genes. Further, an integration of clustering with Quantum Genetic Algorithm (CQGA) is proposed in [25]. In CQGA, clustering is used to select a small set of non-redundant representative genes following Quantum Genetic Algorithm which determines a minimal set of relevant and non-redundant genes.
In literature, recursive gene (feature) selection methods have been proposed such as Support Vector Machine with Recursive Feature Elimination (SVM-RFE) and Ridge Regression with Recursive Feature Elimination (RR-RFE) [2], [26], [27].
The SVM-RFE method [2] employs a recursive procedure to select relevant genes based on the absolute weight vector generated in each of the recursive step. In [26], Li and Yang presented a novel study to explore the various classifiers with recursive feature elimination scheme for gene selection in microarray data. In recursive feature elimination scheme, search space is reduced by removing low rank genes at each recursive step of the approach. The rank of a gene is computed using absolute weight vector of the underline classifier being used. In their study, they have found that different penalization for redundant features affect the recursive feature elimination process. Further, Li and Yang have shown the superiority of Ridge Regression with Recursive Feature Elimination (RR-RFE) approach over the SVM-RFE method as the Ridge Regression classifier penalizes redundant features more than the SVM classifier. The summarized table with publication details, datasets, method type, results and remarks is presented in Table 1 of supplementary file.
In general it is observed from these papers that the overall accuracy slightly improves with reduction in number of selected features while losing the optimal solution. In this paper, our attempt is to present techniques which select a small subset of features without degrading the accuracy. We start with a very simple scheme where we run a linear SVM in the primal form to filter out a top set of genes based on the values of the weight vector. This weight vector is used for gene ranking to integrate with PSO method, subsequently termed as PSW method in our paper.
Next, we present a recursive PSO scheme which gradually refines the feature (gene) space from a very coarse level to a fine-grained one, by reducing the gene set at each step of the algorithm. The fitness of the PSO particles are computed using SVM classifier accuracy. This approach is different from RR-RFE [26] and SVM-RFE approaches where ranking criteria is used to remove the irrelevant and redundant genes at each recursive step of the algorithm. We have shown the preliminary work using iterative approach in [28].
Further, we also examine the integration of F-score, Mutual Information and Wilcoxon's rank test with the proposed recursive PSO approach for gene selection. In this approach, the idea is to filter the most relevant genes prior to apply PSO. This strategy may improve the accuracy as most irrelevant and redundant genes are filtered out. The search space becomes narrow which helps in reducing the execution time of PSO method. This is a two step approach. In the first step, we select the top K genes using these rankings while in the second step, we apply the recursive PSO approach on this reduced set of genes. We have compared the 10-fold (10CV) and leave- one-out (LOOCV) accuracies of our proposed methods with other well-known approaches on five publicly available benchmark microarray datasets.
We show that our proposed integration recursive scheme achieves minimal subsets of genes while improving the overall prediction accuracy in comparison to many state-of-art gene selection methods. The rest of the paper is organized as follows: We start by describing the binary PSO approach and our proposed recursive PSO method with its variants. This is followed by the experimental results. Finally, we conclude our work.
Section snippets
Binary PSO
The PSO was originally developed for continuous valued search spaces. Later on, it has been extended to discrete valued search spaces [29]. In Binary PSO (BPSO), the candidate solutions are represented by particles. Each particle is a boolean vector. The velocity of a particle is used to compute the probability of the next state. In many studies, gene selection is carried out based on the principles of binary PSO [3], [13], [15]. When BPSO is used for feature selection, a particle's position
Integrated approach for gene selection
The presence of a large number of genes in a dataset leads to poor generalization accuracy as most of the genes are irrelevant and redundant. The filter based approaches tries to remove the redundant and highly correlated genes (features) at low computational cost [2], [6]. These methods does not take care for intercorrelations amongst the genes unlike the wrapper based approaches. The wrapper based approaches performs this task at very high compuational cost [1], [3]. In order to improve the
Proposed recursive PSO
The wrapper based approaches performs gene selection at a very high computational cost while producing improved accuracy [1], [3]. These methods explores a very large search space at each iteration of the algorithm. There should be a good compromise between exploration and exploitation for searching an optimal solution. In order to search for optimal solution, we have proposed to highly explore during the standard wrapper iterations and highly exploit during recursive step by reducing the
Dataset
The proposed scheme is evaluated on five publicly available benchmark microarray data, namely, Colon, Lymphoma, Leukemia, Rheumatoid Arthritis versus Osteoarthritis (RAOA) and Type 2 Diabetes (T2D) datasets [1].
The description of these datasets with the number of training instances, number of features and number of classes are presented in Table 1. Table 2 represents the notations for the different variants of methods used in the experiments. It should be noted that F-score, Wilcoxon and
Conclusion
In this paper, we present a recursive formulation of PSO based wrapper approach for gene selection. By integrating it with various filter based ranking strategies, we show that a considerable improvement in the classification accuracy can be obtained while at the same time with a considerable reduction in the selected gene set. We have illustrated the comparison of our proposed approach with other existing results for five publicly available benchmark microarray datasets. Specifically, the
References (48)
- et al.
Research article: hybrid particle swarm optimization and Tabu search approach for selecting genes for tumor classification using gene expression data
Comput. Biol. Chem.
(2008) - et al.
New gene selection method for classification of cancer subtypes considering within-class variation
FEBS Lett.
(2003) - et al.
A novel ACO-GA hybrid algorithm for feature selection in protein function prediction
Expert Syst. Appl.
(2009) - et al.
Text feature selection using ant colony optimization
Expert Syst. Appl.
(2009) - et al.
Wrappers for feature subset selection
Art. Intel.
(1997) - et al.
A hybrid particle swarm optimization for feature subset selection by integrating a novel local search strategy
Appl. Soft Comput.
(2016) - et al.
A fuzzy based feature selection from independent component subspace for machine learning classification of microarray data
Genomics Data
(2016) - et al.
Combination of feature selection approaches with SVM in credit scoring
Expert Syst. Appl.
(2010) - et al.
Relevance-redundancy feature selection based on ant colony optimization
Pattern Recognit.
(2015) - et al.
Gene selection for microarray data classification using a novel ant colony optimization
Neurocomputing
(2015)
Integration of graph clustering with ant colony optimization for feature selection
Knowl.-Based Syst.
Design of fuzzy expert system for microarray data classification using a novel Genetic Swarm Algorithm
Expert Syst. Appl.
Gene selection for cancer classification using support vector machines
Mach. Learn.
Gene selection using hybrid particle swarm optimization and genetic algorithm
Soft Comput.
Distributional word clusters vs. words for text categorization
J. Mach. Learn. Res.
An extensive empirical study of feature selection metrics for text classification
J. Mach. Learn. Res.
Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy
IEEE Trans. Pattern Anal. Mach. Intell.
An improved particle swarm optimization for feature selection
J. Bionic Eng.
Feature selection using joint mutual information maximisation
Expert Syst. Appl.
Feature selection for SVMs
Advances in Neural Information Processing Systems (NIPS 13), vol. 13
More generality in efficient multiple kernel learning
Learning sparse SVM for feature selection on very high dimensional datasets
Proceedings of the Twenty-Seventh International Conference on Machine Learning
A GA-based classifier for microarray data classification
2010 International Conference on Intelligent Computing and Cognitive Informatics (ICICCI)
Improving accuracy for cancer classification with a new algorithm for genes selection
BMC Bioinform.
Cited by (49)
A recursive framework for improving the performance of multi-objective differential evolution algorithms for gene selection
2024, Swarm and Evolutionary ComputationMaximum margin and global criterion based-recursive feature selection
2024, Neural NetworksFuzzy-based concept-cognitive learning: An investigation of novel approach to tumor diagnosis analysis
2023, Information SciencesTree enhanced deep adaptive network for cancer prediction with high dimension low sample size microarray data
2023, Applied Soft ComputingA two-phase gene selection method using anomaly detection and genetic algorithm for microarray data
2023, Knowledge-Based SystemsA self-adaptive quantum equilibrium optimizer with artificial bee colony for feature selection
2023, Computers in Biology and Medicine
- 1
Working as Post Doctoral Fellow in Department of Mathematics & Statistics at Thompson Rivers University, BC, Canada. Main areas of interests are Machine Learning, Soft Computing, Natural Language Processing, Convex Optimization etc.
- 2
Working as Professor in Department of Computer Science Engineering at Bennett University, Greater Noida, India. Main areas of interests are AI, Machine Learning, Soft Computing, Computer Vision etc.
- 3
Working as Professor in Department of Electrical Engineering at Indian institute of Technology Delhi. Main areas of interests are Soft Computing, Image Processing, Computer Vision, Pattern recognition, Biometrics, Surveillance and Intelligent Control.