Elsevier

Pattern Recognition Letters

Volume 26, Issue 10, 15 July 2005, Pages 1444-1453
Pattern Recognition Letters

Feature selection algorithms to find strong genes

https://doi.org/10.1016/j.patrec.2004.11.017Get rights and content

Abstract

The cDNA microarray technology allows us to estimate the expression of thousands of genes of a given tissue. It is natural then to use such information to classify different cell states, like healthy or diseased, or one particular type of cancer or another. However, usually the number of microarray samples is very small and leads to a classification problem with only tens of samples and thousands of features. Recently, Kim et al. proposed to use a parameterized distribution based on the original sample set as a way to attenuate such difficulty. Genes that contribute to good classifiers in such setting are called strong. In this paper, we investigate how to use feature selection techniques to speed up the quest for strong genes. The idea is to use a feature selection algorithm to filter the gene set considered before the original strong feature technique, that is based on a combinatorial search. The filtering helps us to find very good strong gene sets, without resorting to super computers. We have tested several filter options and compared the strong genes obtained with the ones got by the original full combinatorial search.

Introduction

There are many ways to design a classifier from sample data. The worth of such classifier depends on the suitability of the particular classification rule to the feature-label distribution and the amount of sample data. Since we are in the context of very small sample size, the latter issue is our main concern in this work. This lack of training data makes it necessary to apply simple classification rules to avoid overfitting. Here we are interested in the particular case in which the data are linear separable, and therefore a linear classifier (perceptron model) is suitable for classification. Although ordinarily such an assumption might seem too strong, it has proven to be the case in many real world applications coming from gene expression analysis of cancer data (for instance see Kim et al., 2002a, Kim et al., 2002b, Morikawa et al., 2003, Luo et al., 2003, Bomprezzi et al., 2003, Simon, 2003).

The error rate of designed classifiers over the population of samples tends to have a large variance in a small-sample setting. Hence, the selection of feature sets is problematic. Given a large set of potential features, it is necessary to find a small subset that provides good classification. Small feature sets are prudent because the complexity of the classifier, and therefore the data requirement, typically grows with increasing numbers of features. Even with small feature sets, if there are thousands of features, then the number of possible feature sets can be astronomical. Hence, even if the classes are moderately separated, for small samples there may be thousands of feature sets whose error estimates are close to zero. It would be wrong to conclude that the true errors of all corresponding classifiers are small. Low estimation of a classifier’s population error can result from the peculiarity of the particular sample and/or the large deviation between the true error of the designed classifier and its estimated error computed from the same training data from which it has been designed. In our application of direct interest, cancer classification, cross-validation error estimation has been very popular; however, there are serious questions with regard to its application in the case of very small settings (Braga-Neto et al., 2004, Braga-Neto and Dougherty, 2004a, Braga-Neto and Dougherty, 2004b).

To lower both the risks of overfitting and choosing a feature set based on a low error estimate, rather than design a classifier directly from a small sample, Kim et al. (2002a) have recently proposed to design a perceptron from a distribution based on the sample and for which it is more difficult to distinguish the labels. This is done in a parameterized manner in which the parameter relates to the difficulty of classification. The resulting features are called “strong features.” In the case of perceptrons, a strictly analytic approach is used to find both the classifier and its error. Analytic design and error estimation facilitate efficient computation in the context of a large set of potential features; nonetheless, the computation becomes quickly impossible and the method has been applied using a supercomputer. Here we consider the efficacy of several different feature-selection methods in the context of the strong-feature algorithm.

The immediate application of interest is classification via cDNA microarrays, which provide expression measurements for thousands of genes simultaneously (DeRisi et al., 1997, Duggan et al., 1999, Schena et al., 1995). A key goal for the use of expression data is to perform classification via different expression patterns. A successful classifier provides a list of genes whose product abundance is indicative of important differences in cell state, such as healthy or diseased, or one particular type of cancer or another. Two central goals of molecular analysis of disease are to use such information to directly diagnose the presence or type of disease and to produce therapies based on the disruption or correction of the aberrant function of gene products whose activities are central to the pathology of a disease. Correction would be accomplished either by the use of drugs already known to act on these gene products or by developing new drugs targeting these gene products. Achieving these goals requires designing a classifier that takes a vector of gene expression levels as input and outputs a class label, which predicts the class containing the input vector. Classification can be between different kinds of cancer, different stages of tumor development, or many other such differences.

The inherent class-separating power of expression data has been clearly demonstrated (Ben-Dor et al., 2001, Golub et al., 1999, Hedenfalk et al., 2001, Khan et al., 2001, Kobayashi et al., 2003). Going further, sufficient information must be vested in sets of genes small enough to serve as either convenient diagnostic panels or as candidates for the very expensive and time-consuming analysis required to determine if they could serve as useful targets for therapy. The problem at this stage is that there is a very large set of gene-expression profiles (features) and typically a small number of microarrays (sample points), making it difficult to find the best features from which to construct a classifier. We require methods to find gene sets that can perform accurate classification in distributional settings whose dispersions are in excess of the sample data. In this direction, the strong-feature methodology has been used successfully to find feature sets in several oncogenomic settings: breast cancer (Kim et al., 2002a), glioma (Kim et al., 2002b), lymphoma (Kobayashi et al., 2003), and leukemia (Morikawa et al., 2003). The purpose of this paper is to examine the performance of a number of feature-selection methods for the strong-feature methodology.

Section snippets

Finding strong features: original algorithm

In this section, we review the strong-feature algorithm. We denote random variables by capital italic letters A, B,  , Z. A random vector will be denoted by a capital italic boldface letter. For example X = (X1, X2,  , Xd). A binary classification involving the random feature vector X is determined by a binary random variable Y taking the values (class labels) 0 and 1. A classifier or filter is a function of X which is an estimator of Y. For a feature vector x = (x1, x2,  , xd), a perceptron is defined by T

Short description of the feature selection methods

In order to save computational effort in the exhaustive search to find strong feature sets, we propose to use a feature selection algorithm to decrease the number of genes to a manageable amount. After this first pre-selection phase, full search should be attempted on the pre-selected genes in order to recover the best strong gene sets. Following this strategy, we hope to be able to find many of the best sets, while decreasing considerably the required computational resources.

We present below 5

Computational experiments

We have performed a number of computational experiments based on the different proposed strategies to find strong gene sets. In all experiments, the full gene list has been reduced to approximately 100 genes using the methods described in Section 3, after which a full search is employed to find the best feature sets using the pre-selected genes. The results are easily split in two: the guided random walk and PCA do not succeed on finding good gene sets; on the other hand, the other

References (22)

  • U. Braga-Neto et al.

    Bolstered error estimation

    Pattern Recognition

    (2004)
  • P. Pudil et al.

    Floating search methods in feature selection

    Pattern Recognition Letters

    (1994)
  • P. Somol et al.

    Adaptative floating search methods in feature selection

    Pattern Recognition Letters

    (1999)
  • A. Ben-Dor et al.

    Tissue classification with gene expression profiles

    Journal of Computational Biology

    (2000)
  • R. Bomprezzi et al.

    Gene expression in multiple sclerosis patients and healthy controls: identifying pathways relevant to disease

    Human Molecular Genetics

    (2003)
  • P. Bradley et al.

    Feature selection via mathematical programming

    INFORMS Journal on Computing

    (1998)
  • Braga-Neto, U., Dougherty, E., 2004b. Is cross-validation valid for small-sample microarray classification?...
  • U. Braga-Neto et al.

    Is cross-validation better than resubstitution for ranking genes?

    Bioinformatics

    (2004)
  • J. DeRisi et al.

    Exploring the metabolic and genetic control of gene expression on a genomic scale

    Science

    (1997)
  • D. Duggan et al.

    Expression profiling using cdna microarray

    Nature Genetics

    (1999)
  • T. Golub et al.

    Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • Cited by (0)

    View full text