Elsevier

Applied Soft Computing

Volume 71, October 2018, Pages 213-225
Applied Soft Computing

A recursive PSO scheme for gene selection in microarray data

https://doi.org/10.1016/j.asoc.2018.06.019Get rights and content

Highlights

  • Proposed a recursive PSO method for gene selection.

  • Improved accuracy with reduced set of features (genes).

  • Faster than the standard PSO and its variants.

Abstract

In DNA microarray datasets, the number of genes are very large, typically in thousands while the number of samples are in hundreds. This raises the issue of generalization in the classification process. Gene selection plays a significant role in improving the accuracy. In this paper, we have proposed a recursive particle swarm optimization approach (PSO) for gene selection. The proposed method refines the feature (gene) space from a very coarse level to a fine-grained one at each recursive step of the algorithm without degrading the accuracy. In addition, we have integrated various filter based ranking methods with the proposed recursive PSO approach. We also propose to use linear support vector machine weight vector to serve as initial gene pool selection. We evaluate our method on five publicly available benchmark microarray datasets. Our approach selects only a small number of genes while yielding substantial improvements in accuracy over state-of-the-art evolutionary methods.

Introduction

DNA microarray datasets have been used to analyze cellular functions of genes and cancer diagnosis [1]. Generally, disease diagnosis is carried out through classification. These datasets are comprised of small number of samples and expression levels of thousands of genes in a sample. This leads to poor generalization in the classification process.

It is well known that only a small number of genes are sufficient to accurately diagnose some of the diseases, while a large set badly affects the diagnosis process [1], [2]. Thus, there is a need to find a small subset of genes which is sufficient for disease diagnosis. Generally, the analysis of these datasets is carried out through classification/regression where genes correspond to features and samples correspond to data points. As presence of a large number of features (genes) in a dataset leads to poor generalization accuracy and high execution time several methods have been proposed in the literature to identify a small subset of biologically relevant genes for classification in microarray data [1], [2], [3]. Search for such relevant subset is computationally intractable as search space is exponential. The key concept behind gene selection is to remove irrelevant, noisy and redundant features from the training data for improving prediction accuracy.

The existing gene selection techniques can be divided into three different categories namely filter, embedded and wrapper. Filter based methods depend on the statistical estimation of the importance of genes (or subset of genes) and are oblivious to the classifier being used [4], [5], [6]. These methods capture only the intrinsic characteristics of the genes and are not able to capture the complex interactions amongst the genes. Examples of Filter based methods are F-score [7], Wilcoxon's rank test [3], Maximal Relevance (MaxRel), ReliefF, minimal-Redundancy-Maximal - Relevance (mRMR) [6] and Joint Mutual Information Maximization (JMIM) [8]. Embedded methods incorporate the feature selection metric directly in the objective function of the classifier such that gene selection is the part of model construction process [9], [10], [11]. In these methods, the structure of the classification function plays a critical role. The feature selection and classification part are integrated as a single unit. This limits its use with different classifiers.

Wrapper based methods use classification accuracy to measure the quality of a feature subset without the knowledge of the structure of the classification function. This method heuristically searches the relevant subset of genes in an exponential search space. Examples of wrapper based methods are Particle Swarm Optimization with Wilcoxon's Rank Test [3], Genetic Swarm Algorithm (GSA) [1], GA based classifier [12], PSO with GA (PSO-GA) [3], hybrid particle swarm optimization and tabu search (HPSOTS) [13], Binary Matrix Shuffling Filter with SVM (BMSF-SVM) [14], novel hybrid framework (NHF) [15], kernel Fisher discriminant analysis (KFDA) [16], genetic Swarm Algorithm (GSA) [1], Binary Coded Genetic Algorithm (BCGA) [1], Real Coded Genetic Algorithm (RCGA) [1], and enhancement of Binary PSO (CPSO) [17].

The soft computing based approaches such as Backward Feature Elimination, Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Ant Colony Optimization and simulated annealing etc. [1], [18], [19] fall in the category of wrapper based methods.

In these methods a set of candidate solution is generated based on some local heuristics as an initial step. In the succeeding iterations these solutions are refined based on the fitness criteria of candidates (particles). In the initial iterations, the particles explore a large search space and later they exploit the search space to refine their solutions. These methods may suffer from high computational cost as there is a need to retrain the classifier number of times for each gene subset. These methods exploit the intercorrelations amongst the genes for a particular classifier. Despite their high computational complexity, wrapper based method are more popular in practice as they tend to achieve higher accuracies. Among the various methods, the PSO based approaches have been widely used for gene selection [1], [20], [13], [12], [21], [22], [23], [24].

The authors in [1] proposed a Genetic Swarm Algorithm (GSA) by combining strengths of Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) where GA chooses near optimal rule set for prediction and PSO tunes the membership function parameter. In another work [13], authors purposed a tabu search based local improvement procedure with PSO to improve the performance by exhibiting a local optima. In [17], an enhanced binary particle swarm optimization (EBPSO) is proposed by introducing constraint over particle's velocity vector and defining rules for updating the particle's position vector for improving the accuracy. In [22], authors enhanced the PSO by integrating a correlation based local search to select a salient feature subset of reduced size. This method uses a local search technique which is embedded in particle swarm optimization (PSO) to select the salient and reduced sized feature subset. Further, Xi et al. [24] proposed a binary quantum-behaved particle swarm optimization (BQPSO) method to select a minimal subset of genes to improve the classification.

There have been attempts to integrate filter based approaches like Independent Component Analysis (ICA), Clustering, Information Gain and Wilcoxon's rank test with wrapper based methods to improve the generalization performance over many state-of-art of PSO variants for feature selection [3], [15], [23].

In [3], authors integrated Wilcoxon rank sum test method with PSO and GA to select a good subset of genes. Wilcoxon rank sum test method is used to filter the relevant set of genes. A hybrid of PSO with GA is applied on this filtered subset to compute a good set of genes. In an another work [23], a combination of independent component analysis (ICA) and fuzzy backward feature elimination is used to improve the accuracy with minimal subset of genes. Further, an integration of clustering with Quantum Genetic Algorithm (CQGA) is proposed in [25]. In CQGA, clustering is used to select a small set of non-redundant representative genes following Quantum Genetic Algorithm which determines a minimal set of relevant and non-redundant genes.

In literature, recursive gene (feature) selection methods have been proposed such as Support Vector Machine with Recursive Feature Elimination (SVM-RFE) and Ridge Regression with Recursive Feature Elimination (RR-RFE) [2], [26], [27].

The SVM-RFE method [2] employs a recursive procedure to select relevant genes based on the absolute weight vector generated in each of the recursive step. In [26], Li and Yang presented a novel study to explore the various classifiers with recursive feature elimination scheme for gene selection in microarray data. In recursive feature elimination scheme, search space is reduced by removing low rank genes at each recursive step of the approach. The rank of a gene is computed using absolute weight vector of the underline classifier being used. In their study, they have found that different penalization for redundant features affect the recursive feature elimination process. Further, Li and Yang have shown the superiority of Ridge Regression with Recursive Feature Elimination (RR-RFE) approach over the SVM-RFE method as the Ridge Regression classifier penalizes redundant features more than the SVM classifier. The summarized table with publication details, datasets, method type, results and remarks is presented in Table 1 of supplementary file.

In general it is observed from these papers that the overall accuracy slightly improves with reduction in number of selected features while losing the optimal solution. In this paper, our attempt is to present techniques which select a small subset of features without degrading the accuracy. We start with a very simple scheme where we run a linear SVM in the primal form to filter out a top set of genes based on the values of the weight vector. This weight vector is used for gene ranking to integrate with PSO method, subsequently termed as PSW method in our paper.

Next, we present a recursive PSO scheme which gradually refines the feature (gene) space from a very coarse level to a fine-grained one, by reducing the gene set at each step of the algorithm. The fitness of the PSO particles are computed using SVM classifier accuracy. This approach is different from RR-RFE [26] and SVM-RFE approaches where ranking criteria is used to remove the irrelevant and redundant genes at each recursive step of the algorithm. We have shown the preliminary work using iterative approach in [28].

Further, we also examine the integration of F-score, Mutual Information and Wilcoxon's rank test with the proposed recursive PSO approach for gene selection. In this approach, the idea is to filter the most relevant genes prior to apply PSO. This strategy may improve the accuracy as most irrelevant and redundant genes are filtered out. The search space becomes narrow which helps in reducing the execution time of PSO method. This is a two step approach. In the first step, we select the top K genes using these rankings while in the second step, we apply the recursive PSO approach on this reduced set of genes. We have compared the 10-fold (10CV) and leave- one-out (LOOCV) accuracies of our proposed methods with other well-known approaches on five publicly available benchmark microarray datasets.

We show that our proposed integration recursive scheme achieves minimal subsets of genes while improving the overall prediction accuracy in comparison to many state-of-art gene selection methods. The rest of the paper is organized as follows: We start by describing the binary PSO approach and our proposed recursive PSO method with its variants. This is followed by the experimental results. Finally, we conclude our work.

Section snippets

Binary PSO

The PSO was originally developed for continuous valued search spaces. Later on, it has been extended to discrete valued search spaces [29]. In Binary PSO (BPSO), the candidate solutions are represented by particles. Each particle is a boolean vector. The velocity of a particle is used to compute the probability of the next state. In many studies, gene selection is carried out based on the principles of binary PSO [3], [13], [15]. When BPSO is used for feature selection, a particle's position

Integrated approach for gene selection

The presence of a large number of genes in a dataset leads to poor generalization accuracy as most of the genes are irrelevant and redundant. The filter based approaches tries to remove the redundant and highly correlated genes (features) at low computational cost [2], [6]. These methods does not take care for intercorrelations amongst the genes unlike the wrapper based approaches. The wrapper based approaches performs this task at very high compuational cost [1], [3]. In order to improve the

Proposed recursive PSO

The wrapper based approaches performs gene selection at a very high computational cost while producing improved accuracy [1], [3]. These methods explores a very large search space at each iteration of the algorithm. There should be a good compromise between exploration and exploitation for searching an optimal solution. In order to search for optimal solution, we have proposed to highly explore during the standard wrapper iterations and highly exploit during recursive step by reducing the

Dataset

The proposed scheme is evaluated on five publicly available benchmark microarray data, namely, Colon, Lymphoma, Leukemia, Rheumatoid Arthritis versus Osteoarthritis (RAOA) and Type 2 Diabetes (T2D) datasets [1].

The description of these datasets with the number of training instances, number of features and number of classes are presented in Table 1. Table 2 represents the notations for the different variants of methods used in the experiments. It should be noted that F-score, Wilcoxon and

Conclusion

In this paper, we present a recursive formulation of PSO based wrapper approach for gene selection. By integrating it with various filter based ranking strategies, we show that a considerable improvement in the classification accuracy can be obtained while at the same time with a considerable reduction in the selected gene set. We have illustrated the comparison of our proposed approach with other existing results for five publicly available benchmark microarray datasets. Specifically, the

References (48)

  • P. Moradi et al.

    Integration of graph clustering with ant colony optimization for feature selection

    Knowl.-Based Syst.

    (2015)
  • P.K. Ganesh et al.

    Design of fuzzy expert system for microarray data classification using a novel Genetic Swarm Algorithm

    Expert Syst. Appl.

    (2012)
  • I. Guyon et al.

    Gene selection for cancer classification using support vector machines

    Mach. Learn.

    (2002)
  • S. Li et al.

    Gene selection using hybrid particle swarm optimization and genetic algorithm

    Soft Comput.

    (2008)
  • R. Bekkerman et al.

    Distributional word clusters vs. words for text categorization

    J. Mach. Learn. Res.

    (2003)
  • F. George et al.

    An extensive empirical study of feature selection metrics for text classification

    J. Mach. Learn. Res.

    (2003)
  • H. Peng et al.

    Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2005)
  • L. Yuanning et al.

    An improved particle swarm optimization for feature selection

    J. Bionic Eng.

    (2011)
  • B. Mohamed et al.

    Feature selection using joint mutual information maximisation

    Expert Syst. Appl.

    (2015)
  • J. Weston et al.

    Feature selection for SVMs

    Advances in Neural Information Processing Systems (NIPS 13), vol. 13

    (2001)
  • M. Varma et al.

    More generality in efficient multiple kernel learning

  • M. Tan et al.

    Learning sparse SVM for feature selection on very high dimensional datasets

    Proceedings of the Twenty-Seventh International Conference on Machine Learning

    (2010)
  • S. Hengpraprohm et al.

    A GA-based classifier for microarray data classification

    2010 International Conference on Intelligent Computing and Cognitive Informatics (ICICCI)

    (2010)
  • H. Zhang et al.

    Improving accuracy for cancer classification with a new algorithm for genes selection

    BMC Bioinform.

    (2012)
  • Cited by (49)

    View all citing articles on Scopus
    1

    Working as Post Doctoral Fellow in Department of Mathematics & Statistics at Thompson Rivers University, BC, Canada. Main areas of interests are Machine Learning, Soft Computing, Natural Language Processing, Convex Optimization etc.

    2

    Working as Professor in Department of Computer Science Engineering at Bennett University, Greater Noida, India. Main areas of interests are AI, Machine Learning, Soft Computing, Computer Vision etc.

    3

    Working as Professor in Department of Electrical Engineering at Indian institute of Technology Delhi. Main areas of interests are Soft Computing, Image Processing, Computer Vision, Pattern recognition, Biometrics, Surveillance and Intelligent Control.

    View full text