A hybrid feature selection method for DNA microarray data

https://doi.org/10.1016/j.compbiomed.2011.02.004Get rights and content

Abstract

Gene expression profiles, which represent the state of a cell at a molecular level, have great potential as a medical diagnosis tool. In cancer classification, available training data sets are generally of a fairly small sample size compared to the number of genes involved. Along with training data limitations, this constitutes a challenge to certain classification methods. Feature (gene) selection can be used to successfully extract those genes that directly influence classification accuracy and to eliminate genes which have no influence on it. This significantly improves calculation performance and classification accuracy. In this paper, correlation-based feature selection (CFS) and the Taguchi-genetic algorithm (TGA) method were combined into a hybrid method, and the K-nearest neighbor (KNN) with the leave-one-out cross-validation (LOOCV) method served as a classifier for eleven classification profiles to calculate the classification accuracy. Experimental results show that the proposed method reduced redundant features effectively and achieved superior classification accuracy. The classification accuracy obtained by the proposed method was higher in ten out of the eleven gene expression data set test problems when compared to other classification methods from the literature.

Introduction

Microarray data can provide valuable results for a variety of gene expression profile problems and contribute to advances in clinical medicine. The application of microarray data on cancer type classification has recently gained in popularity. Coupled with statistical techniques, gene expression patterns have been used to screen potential tumor markers. Differential expressions of genes are analyzed statistically and each gene expression is assigned to a certain category. The classification of gene expressions can substantially enhance the understanding of the underlying biological processes.

The goal of microarray data classification is to build an efficient and effective model that can differentiate the gene expressions of samples, i.e., determine normal or abnormal states, or classify tissue samples into different classes of diseases. The challenges posed in microarray classification are the limited amount of samples in comparison to the high-dimensionality of the sample, along with experimental variations in measured gene expression levels. In general, only a relatively small number of gene expression data show a strong correlation with a certain phenotype compared to the total number of genes investigated. This means that of the thousands of genes investigated only a small number show significant correlation with the phenotype in question. Thus, in order to analyze gene expression profiles correctly, feature (gene) selection is crucial for the classification process.

Recently, many gene expression data classification and gene selection techniques have been introduced. Kim et al. [1] proposed a novel method based on an evolutionary algorithm (EA) to assemble optimal classifiers and improve feature selection. Tang et al. [2] used an approach that selects multiple highly informative gene subsets. Wang et al. [3] proposed a new tumor classification approach based on an ensemble of probabilistic neural networks (PNN) and neighborhood rough set models based on gene selection. Shen et al. [4] proposed a modified particle swarm optimization that allows for the simultaneous selection of genes and samples. Xie et al. [5] developed a diagnosis model based on support vector machines (SVM) with a novel hybrid feature selection method for diagnosis of erythemato-squamous diseases. Li et al. [6] proposed an algorithm with a locally linear discriminant embedded in it to map the microarray data to a low dimensional space, while Huang et al. [7] proposed an improved decision forest for the classification of gene expression data that incorporates a built-in feature selection mechanism for fine-tuning.

In summary, the above feature selection methods can be divided into three common models: filter methods, wrapper methods, and embedded methods. The filter approach separates data before the actual classification process takes place and then calculates feature weight values, and thus features that accurately present the original data set can be identified. However, a filter approach does not account for interactions amongst the features. Methods in the filter approach category include correlation-based feature selection (CFS) [9], t-test, information gain [10], mutual information [11], and entropy-based methods [12]. Wrapper models, on the other hand, generally focus on improving classification accuracy of pattern classification problems and typically perform better (i.e., reach higher classification accuracy) than filter models. However, wrapper approaches are more computationally expensive than filter methods [13], [14]. Several methods in this category have previously been used to perform feature selection of training and testing data, such as genetic algorithm (GA) [15], branch and bound algorithm [16], sequential search algorithm [17], tabu search [18], [19], binary particle swarm optimization [20], [21], and hybrid genetic algorithm [22]. Embedded techniques use an inductive algorithm. The inductive algorithm itself represents the feature selector and the classifier. Embedded techniques search for an optimal subset of features that is built into the classifier. Examples of these classification trees are ID3, C4.5 and random forest. The advantage of embedded algorithms is that they take the interaction with the classifier into account. A disadvantage of embedded algorithms is that they are generally based on a greedy mechanism, i.e., they only use top-ranked attributes to perform sample classification [8], [23].

Many feature selection methods are combined with a local search process to improve accuracy. One example is presented in Oh et al. [22] who used a local search mechanism in their genetic algorithm. In this paper, we used the Taguchi method as a local search method embedded in the GA. The Taguchi method uses ideas from statistical experimental design to improve and optimize products, processes or equipment. The two main tools of the Taguchi method are: (a) the signal-to-noise ratio (SNR), which measures quality and (b) orthogonal arrays (OAs), which are used to simultaneously study the many design parameters involved. The Taguchi method is a robust design approach [24]. It has been successfully applied in machine learning and data mining, e.g., combined data mining and electrical discharge machining [20]. Sohn and Shin used the Taguchi experimental design for the Monte Carlo simulation of classifier combination methods [25]. Kwak and Choi used the Taguchi method to select features for classification problems [26]. Chen et al. optimized neural network parameters with the Taguchi method [27].

A hybrid feature selection approach consisting of two stages is presented in this study. In the first stage, a filter approach is used to calculate correlation-based feature weights for each feature, thus identifying relevant features. In the second stage, which constitutes a wrapper approach, the previously identified relevant feature subsets are tested by a Taguchi-genetic algorithm (TGA), which tries to determine optimal feature subsets. These optimal feature subsets are then appraised with the K-nearest neighbor method (KNN) [28], [29] with leave-one-out cross-validation (LOOCV) [30], [31] based on Euclidean distance calculations. Genetic algorithms [32], [33] are utilized with randomness for a global search over the entire search space. The genetic operations crossover and mutation are performed to assist the search procedure in escaping from sub-optimal solutions [14]. In each iteration of the proposed nature-inspired method, the Taguchi method [24], [34], [35] is implemented to help explore better feature subsets (or solutions), which are somewhat different from those in the candidate feature subsets. In other words, the Taguchi algorithm is employed for a local search in the search space. Experimental results show that the proposed method achieved higher classification accuracy rates and outperformed the other methods from the literature it was compared to.

Section snippets

Correlation-based feature selection (CFS)

CFS was developed by Hall in 1999 [9]. CFS is a simple filter feature selection method that ranks feature subsets based on a correlation-based heuristic evaluation. This feature selection method is based on the following hypothesis:

Good feature subsets contain features highly correlated with (i.e., predictive of) the class, yet uncorrelated with (i.e., not predictive of) each other [9].

This hypothesis is incorporated into the correlation-based heuristic evaluation equation asMeritS=kγ¯cfk+k(k1)

Data description

Due to the large number of genes and the small sample size of gene expression data, many researchers are currently studying how to select genes effectively before using a classification method to decrease the predictive error rate. In general, gene selection is based on two aspects: one is to obtain a set of genes that have similar functions and a close relationship, the other is to find the smallest set of genes that can provide meaningful diagnostic information for disease prediction without

Discussion

The performances of various classifiers for microarray data have been discussed. Each classifier has its advantages and disadvantages, so no single one can be considered ideal. As a classifier, KNN performs well for cancer classification when compared to the more sophisticated classifiers. KNN is an easily implemented method that has a simple parameter (the number of nearest neighbors) that needs to be predefined, given that the distance metric is Euclidean [44]. In order to enhance the

Conclusions

Classification problems associated with microarray data analysis constitute a very important research area in bioinformatics. In this paper, a filter (CFS) and wrapper (TGA) feature selection method were merged into a new hybrid method, and KNN with LOOCV method served as a classifier for 11 classification profiles. Experimental results show that this method effectively simplifies feature selection by reducing the total number of features needed. The classification accuracy obtained by the

Conflict of interest statement

None declared.

Acknowledgement

This work is partly supported by the National Science Council in Taiwan under Grants NSC96-2622-E-151-019-CC3, NSC96-2622-E214-004-CC3, NSC95-2221-E-151-004-MY3, NSC95-2221-E-214-087, NSC95-2622-E-214-004, NSC94-2622-E-151-025-CC3, and NSC94-2622-E-151-025-CC3.

References (48)

  • G.C. Cawley et al.

    Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers

    Pattern Recognition

    (2003)
  • E. Elbeltagi et al.

    Comparison among five evolutionary-based optimization algorithms

    Advanced Engineering Informatics

    (2005)
  • A.K. Ghosh

    On optimum choice of k in nearest neighbor classification

    Computational Statistics & Data Analysis

    (2006)
  • K. Deb et al.

    Reliable classification of two-class cancer data using evolutionary algorithms

    Biosystems

    (2003)
  • Y. Wang et al.

    Gene selection from microarray data for cancer classification—a machine learning approach

    Computational Biology and Chemistry

    (2005)
  • I. Inza et al.

    Filter versus wrapper gene selection approaches in DNA microarray domains

    Artificial Intelligence in Medicine

    (2004)
  • K.-J. Kim et al.

    An evolutionary algorithm approach to optimal ensemble classifiers for DNA microarray data analysis

    IEEE Transactions on Evolutionary Computation

    (2008)
  • Y. Tang et al.

    Recursive fuzzy granulation for gene subsets extraction and cancer classification

    IEEE Transactions on Information Technology in Biomedicine

    (2008)
  • J. Xie, W. Xie, C. Wang, X. Gao, A. Novel, Hybrid feature selection method based on IFSFFS and SVM for the diagnosis of...
  • Y. Saeys et al.

    A review of feature selection techniques in bioinformatics

    Bioinfomatics

    (2007)
  • M.A. Hall, Correlation-based feature subset selection for machine learning, PhD thesis, Department of Computer Science,...
  • J.R. Quinlan

    Induction of decision trees

    Machine Learning

    (1986)
  • R. Battiti

    Using mutual information for selecting features in supervised neural net learning

    IEEE Transactions on Neural Networks

    (1994)
  • X. Liu et al.

    An entropy-based gene selection method for cancer classification using microarray data

    BMC Bioinformatics

    (2005)
  • Cited by (108)

    • Deep learning approaches for high dimension cancer microarray data feature prediction: A review

      2022, Computational Intelligence in Cancer Diagnosis: Progress and Challenges
    • An efficient feature selection framework based on information theory for high dimensional data

      2021, Applied Soft Computing
      Citation Excerpt :

      Feature selection process selects and finds the best meaningful inputs for further processing. In the context of microarray gene selection [10,11], an additional challenge is that microarrays reveal enormous information about the cell through huge number of genes leading to very high dimensionality of data; however, with comparatively small number of samples. As the genes correspond to the features in a microarray dataset, it may include noisy or irrelevant or missing data making prediction more complicated.

    View all citing articles on Scopus
    View full text