Ensemble gene selection for cancer classification
Introduction
Cancer classification, which can help to improve health care of patients and the quality of life of individuals, is essential for cancer diagnosis and drug discovery. An accurate prediction of cancer has great value in providing better treatment and response to therapy varying from different aspects [1]. However, traditional diagnostic methods are mainly based on the morphological and clinical appearance of cancer. They have limited contributions because cancers usually result from many environmental factors, and even the same tumor may have different symptoms under different conditions. Thus, it is necessary to inject systemic approaches into the problem of cancer diagnosis and prediction. From the bio-medical perspective, each kind of disease is associated with certain genes in tissues and the mutation of genes may give rise to the occurrence of certain diseases. Fortunately, the advent of DNA microarray technique, which allows simultaneously measure the expression levels of thousands of genes in a single experiment [2], makes the accurate prediction of cancer possible and easier. Since it is capable of comparing the gene expression levels in tissues under different conditions, the microarray technique may bring many advantages to cancer prediction and make the diagnosis result more objective, accurate and reliable. During past years, this method has drawn a great deal of attention from both biological and engineering fields [3], [4], [5].
A microarray dataset (i.e., gene expression profile) is usually organized as a two-dimensional matrix with n rows and m columns, where columns represent m genes (or features) and each row in is a sample represented by the m genes in different expression levels. Cancer classification or prediction refers to the process of constructing a model on the microarray dataset and then distinguishing one type of samples from other types with this induced model. Since the prediction results can help doctors to take proper treatment solution for patients, especially when the disease has been diagnosed at its early time, cancer prediction with microarray data is very important in clinical diagnosis and therapy. By now, many classification or prediction methods have been developed in machine learning community and many of them have been applied to cancer classification [3], [5]. However, a great challenge would be raised by the unique nature of microarray data when traditional learning algorithms have been used.
Generally, the microarray data has very high dimensionality (in thousands) and small size of samples (in dozens). From the view of geometrical space, this high dimensional space only contains several sparse points if each sample is mapped to a point in the space. Facing with this situation, most existing classification algorithms are not scalable very well. Usually, a very small portion of genes in is relevant to cancer prediction and most of them are useless. These irrelevant genes not only confuse learning algorithms, but also degrade their performance and efficiency. Moreover, the prediction model induced from irrelevant genes may prone to over-fitting. The presence of noises raised from complex scientific procedures makes this even worse. To alleviate this so-called “high-dimensional small-sample” problem, gene selection seems to be an effective and sound solution [3], [6].
The purpose of gene selection (or feature selection) is to identify significant genes, which contribute most to the reliable classification of cancers, and discard those irrelevant genes as many as possible. With noisy genes removed, the performance and efficiency of classification model will be improved and the over-fitting situation will be lessened [7]. Furthermore, the biological information hidden within data may be less obscured, which can be used to assist researchers to discover the relationship between cancer types and genes. Due to its crucial role in cancer diagnosis, gene selection has been extensively studied during past years [7], [8]. Several experiments have also been reported that gene selection can effectively enhance the performance of cancer classification (see, e.g., [9], [10], [11], [12], [13]).
Usually, most popular gene selection methods identify a single gene subset, whose discriminative capability is limited for classification purpose. As a matter of fact, given a microarray dataset with huge number of genes, there are lots of gene subsets with good discriminative power. In order to achieve better performance and more useful insights in prediction, a kind of sophisticated method, called ensemble gene selection, is introduced to predict cancers by manipulating multiple gene subsets simultaneously. It makes use of different gene subsets to train classifiers and then integrates these classifiers into an overall outcome by some combination strategies, such as majority voting and weighting voting [14]. Compared with the traditional methods, its predominance is that it exploits the complementary information of different gene subsets in making decision [15]. As a result, the prediction result is more reliable and robust and this has been demonstrated by several experiments in literatures (see, e.g., [16], [17], [18], [19]).
In this study, we propose a new ensemble method of gene selection based on information theory. Unlike other ensemble methods, where gene subsets are generated via random manner or different gene selection algorithms, our method, called EGS, obtains multiple gene subsets by the same selection technique with different starting points in its search procedure. The same manner of gene selection implies that each obtained gene subset has good discriminative capability, while different starting points guarantee that each subset has its own information and avoid being trapped into local optima. That is to say, the diversity of our ensemble method lies in gene subsets, not specifical selection methods. Furthermore, to obtain gene subsets with more information, two non-parametric and symmetric measurements, i.e., conditional mutual information and its normalized form, are taken as the evaluation criterions in our gene selection method. The reason to choose them is that they can effectively measure the relevance between genes and cancer types. More importantly, the number of selected genes in the proposed method is determined self-adaptively. This, however, lessens the muddy harassment induced by many parameters in most ensemble methods to some extent.
The rest of this paper is organized as follows. Section 2 gives some concepts about information theory and gene selection. In Section 3, the state of the art of gene selection methods is briefly presented. Section 4 firstly introduces a gene selection algorithm based on a new evaluation criterion. After that, a framework of ensemble gene selection is proposed. Experimental results of the proposed ensemble method conducted on six microarray datasets are shown and discussed in Section 5. Finally, conclusions and future works are given in the last section.
Section snippets
Mutual information
Entropy is an elementary concept in information theory [20]. Unlike other measurements, e.g., correlation coefficient and Euclidean distance, information entropy provides an intuitive method to quantify the uncertainty of random variable. Let X and dom(X) be a discrete random variable and its domain (or alphabet). The information amount of variable X is represented as Shannon entropy H(X), andwhere p(x) is the marginal probability distribution of X. From this
Related work
Since gene selection can bring lots of advantages to microarray data analysis, numerous gene selection algorithms have been witnessed during past years. This section briefly reviews the state of the art of gene selection, so as to provide a deep insight into this problem. Interested readers can refer to good survey literatures (see, e.g., [7], [8], [30], [24]) to get more information.
As mentioned above, the evaluation criterion J(GS) is mainly used to measure the goodness of candidate genes GS.
Gene selection
From the definition of conditional mutual information, we know that can effectively measure the information amount shared by the gene g and the disease types C, but this information has not been captured by the already selected genes GS. Moreover, according to Eqs. (2), (3), the following equation holds on:This indicates that is the incremental amount of information of selected genes about C. That is to say, this incremental mode of information in
Experiments and discussions
This section presents the experimental results and analysis of EGS on six public microarray datasets with high dimensionality/small sample size. At the beginning, the datasets and several gene selection algorithms used in this analysis are briefly described. Subsequently, experimental results are given and discussed from two different aspects.
Conclusions
In this paper, we proposed a new ensemble gene selection method, where each gene subset is obtained by the same gene selector with different starting point. In this algorithm, genes are sequentially selected according to conditional mutual information or its normalized form. As a result, the obtained gene subset has good discriminative capability for classification. Moreover, the number of selected genes in the proposed method is determined self-adaptively. To increase the diversity of ensemble
Acknowledgements
The authors are grateful to anonymous referees for their valuable and constructive comments. This work is supported by the National NSF of China (60873044).
About the Author—HUAWEN LIU received his B.S. degree in computer science from Jiangxi Normal University, in 1999, and M.S. degree in computer science from Jilin University, PR China, in 2007. At present, he is a Ph.D. candidate in Jilin University. His research interests involve data mining, machine learning, pattern recognition and rough set.
References (57)
- et al.
Cancer classification using gene expression data
Information Systems
(2003) - et al.
Ensemble classifiers based on correlation analysis for DNA microarray classification
Neurocomputing
(2006) - et al.
The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming
Artificial Intelligence in Medicine
(2006) An integrated algorithm for gene selection and classification applied to microarray data of ovarian cancer
Artificial Intelligence in Medicine
(2008)- et al.
Ensemble methods for classification of patients for personalized medicine with high-dimensional data
Artificial Intelligence in Medicine
(2007) - et al.
Feature selection with dynamic mutual information
Pattern Recognition
(2009) - et al.
Markov blanket-embedded genetic algorithm for gene selection
Pattern Recognition
(2007) - et al.
Incremental wrapper-based gene selection from microarray data for cancer classification
Pattern Recognition
(2006) - et al.
Non-parametric classifier-independent feature selection
Pattern Recognition
(2006) - et al.
Classifier ensembles: select real-world applications
Information Fusion
(2008)
Diversity in search strategies for ensemble feature selection
Information Fusion
Gene expression correlates of clinical prostate cancer behavior
Cancer Cell
Selecting differentially expressed genes using minimum probability of classification error
Journal of Biomedical Informatics
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
Evaluating microarray-based classifiers: an overview
Cancer Informatics
Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting
Journal of the National Cancer Institute
Machine learning in bioinformatics
Briefings in Bioinformatics
Gene-set approach for expression pattern analysis
Briefings in Bioinformatics
A review of feature selection techniques in bioinformatics
Bioinformatics
Approaches to dimensionality reduction in proteomic biomarker studies
Briefings in Bioinformatics
Feature selection, mutual information, and the classification of high-dimensional patterns
Pattern Analysis and Applications
A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification
BMC Bioinformatics
Conditional variable importance for random forests
BMC Bioinformatics
Gene selection for classification of microarray data based on the Bayes error
BMC Bioinformatics
MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data
Bioinformatics
Cancer classification using ensemble of neural networks with multiple significant gene subsets
Applied Intelligence
Elements of Information Theory
Cited by (0)
About the Author—HUAWEN LIU received his B.S. degree in computer science from Jiangxi Normal University, in 1999, and M.S. degree in computer science from Jilin University, PR China, in 2007. At present, he is a Ph.D. candidate in Jilin University. His research interests involve data mining, machine learning, pattern recognition and rough set.
About the Author—LEI LIU received his B.S. and M.S. degrees in computer science from Jilin University, in 1982 and 1985, respectively. Then he joined College of Computer Science and Technology of Jilin University as a lecturer in the same year. Currently, he is a professor and the director of Software Formalization Lab in Jilin University. He has wide research interests, mainly including programming theory, semantic web, computational language, pattern recognition and data mining.
About the Author—HUIJIE ZHANG received her B.Sc., M.Sc. and Ph.D. degrees in computer science from Jilin University, in 1998, 2004 and 2008, respectively. Currently, she is a lecturer in the Department of Computer Science, Northeast Normal University, PR China. Her research areas include Geographical Information System (GIS), data mining and pattern recognition.