A hybrid feature selection method for DNA microarray data
Introduction
Microarray data can provide valuable results for a variety of gene expression profile problems and contribute to advances in clinical medicine. The application of microarray data on cancer type classification has recently gained in popularity. Coupled with statistical techniques, gene expression patterns have been used to screen potential tumor markers. Differential expressions of genes are analyzed statistically and each gene expression is assigned to a certain category. The classification of gene expressions can substantially enhance the understanding of the underlying biological processes.
The goal of microarray data classification is to build an efficient and effective model that can differentiate the gene expressions of samples, i.e., determine normal or abnormal states, or classify tissue samples into different classes of diseases. The challenges posed in microarray classification are the limited amount of samples in comparison to the high-dimensionality of the sample, along with experimental variations in measured gene expression levels. In general, only a relatively small number of gene expression data show a strong correlation with a certain phenotype compared to the total number of genes investigated. This means that of the thousands of genes investigated only a small number show significant correlation with the phenotype in question. Thus, in order to analyze gene expression profiles correctly, feature (gene) selection is crucial for the classification process.
Recently, many gene expression data classification and gene selection techniques have been introduced. Kim et al. [1] proposed a novel method based on an evolutionary algorithm (EA) to assemble optimal classifiers and improve feature selection. Tang et al. [2] used an approach that selects multiple highly informative gene subsets. Wang et al. [3] proposed a new tumor classification approach based on an ensemble of probabilistic neural networks (PNN) and neighborhood rough set models based on gene selection. Shen et al. [4] proposed a modified particle swarm optimization that allows for the simultaneous selection of genes and samples. Xie et al. [5] developed a diagnosis model based on support vector machines (SVM) with a novel hybrid feature selection method for diagnosis of erythemato-squamous diseases. Li et al. [6] proposed an algorithm with a locally linear discriminant embedded in it to map the microarray data to a low dimensional space, while Huang et al. [7] proposed an improved decision forest for the classification of gene expression data that incorporates a built-in feature selection mechanism for fine-tuning.
In summary, the above feature selection methods can be divided into three common models: filter methods, wrapper methods, and embedded methods. The filter approach separates data before the actual classification process takes place and then calculates feature weight values, and thus features that accurately present the original data set can be identified. However, a filter approach does not account for interactions amongst the features. Methods in the filter approach category include correlation-based feature selection (CFS) [9], t-test, information gain [10], mutual information [11], and entropy-based methods [12]. Wrapper models, on the other hand, generally focus on improving classification accuracy of pattern classification problems and typically perform better (i.e., reach higher classification accuracy) than filter models. However, wrapper approaches are more computationally expensive than filter methods [13], [14]. Several methods in this category have previously been used to perform feature selection of training and testing data, such as genetic algorithm (GA) [15], branch and bound algorithm [16], sequential search algorithm [17], tabu search [18], [19], binary particle swarm optimization [20], [21], and hybrid genetic algorithm [22]. Embedded techniques use an inductive algorithm. The inductive algorithm itself represents the feature selector and the classifier. Embedded techniques search for an optimal subset of features that is built into the classifier. Examples of these classification trees are ID3, C4.5 and random forest. The advantage of embedded algorithms is that they take the interaction with the classifier into account. A disadvantage of embedded algorithms is that they are generally based on a greedy mechanism, i.e., they only use top-ranked attributes to perform sample classification [8], [23].
Many feature selection methods are combined with a local search process to improve accuracy. One example is presented in Oh et al. [22] who used a local search mechanism in their genetic algorithm. In this paper, we used the Taguchi method as a local search method embedded in the GA. The Taguchi method uses ideas from statistical experimental design to improve and optimize products, processes or equipment. The two main tools of the Taguchi method are: (a) the signal-to-noise ratio (SNR), which measures quality and (b) orthogonal arrays (OAs), which are used to simultaneously study the many design parameters involved. The Taguchi method is a robust design approach [24]. It has been successfully applied in machine learning and data mining, e.g., combined data mining and electrical discharge machining [20]. Sohn and Shin used the Taguchi experimental design for the Monte Carlo simulation of classifier combination methods [25]. Kwak and Choi used the Taguchi method to select features for classification problems [26]. Chen et al. optimized neural network parameters with the Taguchi method [27].
A hybrid feature selection approach consisting of two stages is presented in this study. In the first stage, a filter approach is used to calculate correlation-based feature weights for each feature, thus identifying relevant features. In the second stage, which constitutes a wrapper approach, the previously identified relevant feature subsets are tested by a Taguchi-genetic algorithm (TGA), which tries to determine optimal feature subsets. These optimal feature subsets are then appraised with the K-nearest neighbor method (KNN) [28], [29] with leave-one-out cross-validation (LOOCV) [30], [31] based on Euclidean distance calculations. Genetic algorithms [32], [33] are utilized with randomness for a global search over the entire search space. The genetic operations crossover and mutation are performed to assist the search procedure in escaping from sub-optimal solutions [14]. In each iteration of the proposed nature-inspired method, the Taguchi method [24], [34], [35] is implemented to help explore better feature subsets (or solutions), which are somewhat different from those in the candidate feature subsets. In other words, the Taguchi algorithm is employed for a local search in the search space. Experimental results show that the proposed method achieved higher classification accuracy rates and outperformed the other methods from the literature it was compared to.
Section snippets
Correlation-based feature selection (CFS)
CFS was developed by Hall in 1999 [9]. CFS is a simple filter feature selection method that ranks feature subsets based on a correlation-based heuristic evaluation. This feature selection method is based on the following hypothesis:
Good feature subsets contain features highly correlated with (i.e., predictive of) the class, yet uncorrelated with (i.e., not predictive of) each other [9].
This hypothesis is incorporated into the correlation-based heuristic evaluation equation as
Data description
Due to the large number of genes and the small sample size of gene expression data, many researchers are currently studying how to select genes effectively before using a classification method to decrease the predictive error rate. In general, gene selection is based on two aspects: one is to obtain a set of genes that have similar functions and a close relationship, the other is to find the smallest set of genes that can provide meaningful diagnostic information for disease prediction without
Discussion
The performances of various classifiers for microarray data have been discussed. Each classifier has its advantages and disadvantages, so no single one can be considered ideal. As a classifier, KNN performs well for cancer classification when compared to the more sophisticated classifiers. KNN is an easily implemented method that has a simple parameter (the number of nearest neighbors) that needs to be predefined, given that the distance metric is Euclidean [44]. In order to enhance the
Conclusions
Classification problems associated with microarray data analysis constitute a very important research area in bioinformatics. In this paper, a filter (CFS) and wrapper (TGA) feature selection method were merged into a new hybrid method, and KNN with LOOCV method served as a classifier for 11 classification profiles. Experimental results show that this method effectively simplifies feature selection by reducing the total number of features needed. The classification accuracy obtained by the
Conflict of interest statement
None declared.
Acknowledgement
This work is partly supported by the National Science Council in Taiwan under Grants NSC96-2622-E-151-019-CC3, NSC96-2622-E214-004-CC3, NSC95-2221-E-151-004-MY3, NSC95-2221-E-214-087, NSC95-2622-E-214-004, NSC94-2622-E-151-025-CC3, and NSC94-2622-E-151-025-CC3.
References (48)
- et al.
Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction
Computers in Biology and Medicine
(2010) - et al.
Simultaneous genes and training samples selection by modified particle swarm optimization for gene expression data classification
Computers in Biology and Medicine
(2009) - et al.
Gene expression data classification using locally linear discriminant embedding
Computers in Biology and Medicine
(2010) - et al.
Decision forest for classification of gene expression data
Computers in Biology and Medicine
(2010) - et al.
Wrappers for feature subset selection
Artificial Intelligence
(1997) - et al.
Floating search methods in feature selection
Pattern Recognition Letters
(1994) - et al.
Feature selection using tabu search method
Pattern Recognition
(2002) - et al.
Improved binary PSO for feature selection using gene expression data
Computational Biology and Chemistry
(2008) - et al.
Experimental study for the comparison of classifier combination methods
Pattern Recognition
(2007) - et al.
A neural network-based approach for dynamic quality prediction in a plastic injection molding process
Expert Systems with Applications
(2008)
Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers
Pattern Recognition
Comparison among five evolutionary-based optimization algorithms
Advanced Engineering Informatics
On optimum choice of k in nearest neighbor classification
Computational Statistics & Data Analysis
Reliable classification of two-class cancer data using evolutionary algorithms
Biosystems
Gene selection from microarray data for cancer classification—a machine learning approach
Computational Biology and Chemistry
Filter versus wrapper gene selection approaches in DNA microarray domains
Artificial Intelligence in Medicine
An evolutionary algorithm approach to optimal ensemble classifiers for DNA microarray data analysis
IEEE Transactions on Evolutionary Computation
Recursive fuzzy granulation for gene subsets extraction and cancer classification
IEEE Transactions on Information Technology in Biomedicine
A review of feature selection techniques in bioinformatics
Bioinfomatics
Induction of decision trees
Machine Learning
Using mutual information for selecting features in supervised neural net learning
IEEE Transactions on Neural Networks
An entropy-based gene selection method for cancer classification using microarray data
BMC Bioinformatics
Cited by (108)
A new filter-based gene selection approach in the DNA microarray domain
2024, Expert Systems with ApplicationsGene reduction and machine learning algorithms for cancer classification based on microarray gene expression data: A comprehensive review
2023, Expert Systems with ApplicationsDeep learning approaches for high dimension cancer microarray data feature prediction: A review
2022, Computational Intelligence in Cancer Diagnosis: Progress and ChallengesEfficient high-dimension feature selection based on enhanced equilibrium optimizer
2022, Expert Systems with ApplicationsAn efficient feature selection framework based on information theory for high dimensional data
2021, Applied Soft ComputingCitation Excerpt :Feature selection process selects and finds the best meaningful inputs for further processing. In the context of microarray gene selection [10,11], an additional challenge is that microarrays reveal enormous information about the cell through huge number of genes leading to very high dimensionality of data; however, with comparatively small number of samples. As the genes correspond to the features in a microarray dataset, it may include noisy or irrelevant or missing data making prediction more complicated.
Classification of cancer microarray data using a two-step feature selection framework with moth-flame optimization and extreme learning machine
2024, Multimedia Tools and Applications