Elsevier

Pattern Recognition

Volume 43, Issue 8, August 2010, Pages 2763-2772
Pattern Recognition

Ensemble gene selection for cancer classification

https://doi.org/10.1016/j.patcog.2010.02.008Get rights and content

Abstract

Cancer diagnosis is an important emerging clinical application of microarray data. Its accurate prediction to the type or size of tumors relies on adopting powerful and reliable classification models, so as to patients can be provided with better treatment or response to therapy. However, the high dimensionality of microarray data may bring some disadvantages, such as over-fitting, poor performance and low efficiency, to traditional classification models. Thus, one of the challenging tasks in cancer diagnosis is how to identify salient expression genes from thousands of genes in microarray data that can directly contribute to the phenotype or symptom of disease. In this paper, we propose a new ensemble gene selection method (EGS) to choose multiple gene subsets for classification purpose, where the significant degree of gene is measured by conditional mutual information or its normalized form. After different gene subsets have been obtained by setting different starting points of the search procedure, they will be used to train multiple base classifiers and then aggregated into a consensus classifier by the manner of majority voting. The proposed method is compared with five popular gene selection methods on six public microarray datasets and the comparison results show that our method works well.

Introduction

Cancer classification, which can help to improve health care of patients and the quality of life of individuals, is essential for cancer diagnosis and drug discovery. An accurate prediction of cancer has great value in providing better treatment and response to therapy varying from different aspects [1]. However, traditional diagnostic methods are mainly based on the morphological and clinical appearance of cancer. They have limited contributions because cancers usually result from many environmental factors, and even the same tumor may have different symptoms under different conditions. Thus, it is necessary to inject systemic approaches into the problem of cancer diagnosis and prediction. From the bio-medical perspective, each kind of disease is associated with certain genes in tissues and the mutation of genes may give rise to the occurrence of certain diseases. Fortunately, the advent of DNA microarray technique, which allows simultaneously measure the expression levels of thousands of genes in a single experiment [2], makes the accurate prediction of cancer possible and easier. Since it is capable of comparing the gene expression levels in tissues under different conditions, the microarray technique may bring many advantages to cancer prediction and make the diagnosis result more objective, accurate and reliable. During past years, this method has drawn a great deal of attention from both biological and engineering fields [3], [4], [5].

A microarray dataset (i.e., gene expression profile) is usually organized as a two-dimensional matrix M=(D,G) with n rows and m columns, where columns represent m genes (or features) G={g1,,gm} and each row in D={s1,,sn} is a sample represented by the m genes in different expression levels. Cancer classification or prediction refers to the process of constructing a model on the microarray dataset and then distinguishing one type of samples from other types with this induced model. Since the prediction results can help doctors to take proper treatment solution for patients, especially when the disease has been diagnosed at its early time, cancer prediction with microarray data is very important in clinical diagnosis and therapy. By now, many classification or prediction methods have been developed in machine learning community and many of them have been applied to cancer classification [3], [5]. However, a great challenge would be raised by the unique nature of microarray data when traditional learning algorithms have been used.

Generally, the microarray data M has very high dimensionality (in thousands) and small size of samples (in dozens). From the view of geometrical space, this high dimensional space G only contains several sparse points if each sample is mapped to a point in the space. Facing with this situation, most existing classification algorithms are not scalable very well. Usually, a very small portion of genes in G is relevant to cancer prediction and most of them are useless. These irrelevant genes not only confuse learning algorithms, but also degrade their performance and efficiency. Moreover, the prediction model induced from irrelevant genes may prone to over-fitting. The presence of noises raised from complex scientific procedures makes this even worse. To alleviate this so-called “high-dimensional small-sample” problem, gene selection seems to be an effective and sound solution [3], [6].

The purpose of gene selection (or feature selection) is to identify significant genes, which contribute most to the reliable classification of cancers, and discard those irrelevant genes as many as possible. With noisy genes removed, the performance and efficiency of classification model will be improved and the over-fitting situation will be lessened [7]. Furthermore, the biological information hidden within data may be less obscured, which can be used to assist researchers to discover the relationship between cancer types and genes. Due to its crucial role in cancer diagnosis, gene selection has been extensively studied during past years [7], [8]. Several experiments have also been reported that gene selection can effectively enhance the performance of cancer classification (see, e.g., [9], [10], [11], [12], [13]).

Usually, most popular gene selection methods identify a single gene subset, whose discriminative capability is limited for classification purpose. As a matter of fact, given a microarray dataset with huge number of genes, there are lots of gene subsets with good discriminative power. In order to achieve better performance and more useful insights in prediction, a kind of sophisticated method, called ensemble gene selection, is introduced to predict cancers by manipulating multiple gene subsets simultaneously. It makes use of different gene subsets to train classifiers and then integrates these classifiers into an overall outcome by some combination strategies, such as majority voting and weighting voting [14]. Compared with the traditional methods, its predominance is that it exploits the complementary information of different gene subsets in making decision [15]. As a result, the prediction result is more reliable and robust and this has been demonstrated by several experiments in literatures (see, e.g., [16], [17], [18], [19]).

In this study, we propose a new ensemble method of gene selection based on information theory. Unlike other ensemble methods, where gene subsets are generated via random manner or different gene selection algorithms, our method, called EGS, obtains multiple gene subsets by the same selection technique with different starting points in its search procedure. The same manner of gene selection implies that each obtained gene subset has good discriminative capability, while different starting points guarantee that each subset has its own information and avoid being trapped into local optima. That is to say, the diversity of our ensemble method lies in gene subsets, not specifical selection methods. Furthermore, to obtain gene subsets with more information, two non-parametric and symmetric measurements, i.e., conditional mutual information and its normalized form, are taken as the evaluation criterions in our gene selection method. The reason to choose them is that they can effectively measure the relevance between genes and cancer types. More importantly, the number of selected genes in the proposed method is determined self-adaptively. This, however, lessens the muddy harassment induced by many parameters in most ensemble methods to some extent.

The rest of this paper is organized as follows. Section 2 gives some concepts about information theory and gene selection. In Section 3, the state of the art of gene selection methods is briefly presented. Section 4 firstly introduces a gene selection algorithm based on a new evaluation criterion. After that, a framework of ensemble gene selection is proposed. Experimental results of the proposed ensemble method conducted on six microarray datasets are shown and discussed in Section 5. Finally, conclusions and future works are given in the last section.

Section snippets

Mutual information

Entropy is an elementary concept in information theory [20]. Unlike other measurements, e.g., correlation coefficient and Euclidean distance, information entropy provides an intuitive method to quantify the uncertainty of random variable. Let X and dom(X) be a discrete random variable and its domain (or alphabet). The information amount of variable X is represented as Shannon entropy H(X), andH(X)=xdom(X)p(x)logp(x),where p(x) is the marginal probability distribution of X. From this

Related work

Since gene selection can bring lots of advantages to microarray data analysis, numerous gene selection algorithms have been witnessed during past years. This section briefly reviews the state of the art of gene selection, so as to provide a deep insight into this problem. Interested readers can refer to good survey literatures (see, e.g., [7], [8], [30], [24]) to get more information.

As mentioned above, the evaluation criterion J(GS) is mainly used to measure the goodness of candidate genes GS.

Gene selection

From the definition of conditional mutual information, we know that I(g;C|GS) can effectively measure the information amount shared by the gene g and the disease types C, but this information has not been captured by the already selected genes GS. Moreover, according to Eqs. (2), (3), the following equation holds on:I(g;C|GS)=I(g,GS;C)I(GS;C).This indicates that I(g;C|GS) is the incremental amount of information of selected genes about C. That is to say, this incremental mode of information in

Experiments and discussions

This section presents the experimental results and analysis of EGS on six public microarray datasets with high dimensionality/small sample size. At the beginning, the datasets and several gene selection algorithms used in this analysis are briefly described. Subsequently, experimental results are given and discussed from two different aspects.

Conclusions

In this paper, we proposed a new ensemble gene selection method, where each gene subset is obtained by the same gene selector with different starting point. In this algorithm, genes are sequentially selected according to conditional mutual information or its normalized form. As a result, the obtained gene subset has good discriminative capability for classification. Moreover, the number of selected genes in the proposed method is determined self-adaptively. To increase the diversity of ensemble

Acknowledgements

The authors are grateful to anonymous referees for their valuable and constructive comments. This work is supported by the National NSF of China (60873044).

About the AuthorHUAWEN LIU received his B.S. degree in computer science from Jiangxi Normal University, in 1999, and M.S. degree in computer science from Jilin University, PR China, in 2007. At present, he is a Ph.D. candidate in Jilin University. His research interests involve data mining, machine learning, pattern recognition and rough set.

References (57)

  • A. Tsymbal et al.

    Diversity in search strategies for ensemble feature selection

    Information Fusion

    (2005)
  • D. Singh et al.

    Gene expression correlates of clinical prostate cancer behavior

    Cancer Cell

    (2002)
  • P. Mahata et al.

    Selecting differentially expressed genes using minimum probability of classification error

    Journal of Biomedical Informatics

    (2007)
  • T.R. Golub et al.

    Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • A.L. Boulesteix et al.

    Evaluating microarray-based classifiers: an overview

    Cancer Informatics

    (2008)
  • A. Dupuy et al.

    Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting

    Journal of the National Cancer Institute

    (2007)
  • P. Larran¨aga et al.

    Machine learning in bioinformatics

    Briefings in Bioinformatics

    (2006)
  • D. Nam et al.

    Gene-set approach for expression pattern analysis

    Briefings in Bioinformatics

    (2008)
  • Y. Saeys et al.

    A review of feature selection techniques in bioinformatics

    Bioinformatics

    (2007)
  • M. Hilario et al.

    Approaches to dimensionality reduction in proteomic biomarker studies

    Briefings in Bioinformatics

    (2008)
  • B. Bonev et al.

    Feature selection, mutual information, and the classification of high-dimensional patterns

    Pattern Analysis and Applications

    (2008)
  • A. Statnikov et al.

    A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

    BMC Bioinformatics

    (2008)
  • C. Strobl et al.

    Conditional variable importance for random forests

    BMC Bioinformatics

    (2008)
  • J.-G. Zhang et al.

    Gene selection for classification of microarray data based on the Bayes error

    BMC Bioinformatics

    (2007)
  • X. Zhou et al.

    MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data

    Bioinformatics

    (2007)
  • Y. Saeys, T. Abeel, Y. van de Peer, Robust feature selection using ensemble feature selection techniques, in:...
  • S.-B. Cho et al.

    Cancer classification using ensemble of neural networks with multiple significant gene subsets

    Applied Intelligence

    (2007)
  • T.M. Cover et al.

    Elements of Information Theory

    (1991)
  • Cited by (0)

    About the AuthorHUAWEN LIU received his B.S. degree in computer science from Jiangxi Normal University, in 1999, and M.S. degree in computer science from Jilin University, PR China, in 2007. At present, he is a Ph.D. candidate in Jilin University. His research interests involve data mining, machine learning, pattern recognition and rough set.

    About the AuthorLEI LIU received his B.S. and M.S. degrees in computer science from Jilin University, in 1982 and 1985, respectively. Then he joined College of Computer Science and Technology of Jilin University as a lecturer in the same year. Currently, he is a professor and the director of Software Formalization Lab in Jilin University. He has wide research interests, mainly including programming theory, semantic web, computational language, pattern recognition and data mining.

    About the AuthorHUIJIE ZHANG received her B.Sc., M.Sc. and Ph.D. degrees in computer science from Jilin University, in 1998, 2004 and 2008, respectively. Currently, she is a lecturer in the Department of Computer Science, Northeast Normal University, PR China. Her research areas include Geographical Information System (GIS), data mining and pattern recognition.

    View full text