Two-stage classification methods for microarray data

https://doi.org/10.1016/j.eswa.2006.09.005Get rights and content

Abstract

Gene expression data are a key factor for the success of medical diagnosis, and two-stage classification methods are therefore developed for processing microarray data. The first stage for this kind of classification methods is to select a pre-specified number of genes, which are likely to be the most relevant to the occurrence of a disease, and passes these genes to the second stage for classification. In this paper, we use four gene selection mechanisms and two classification tools to compose eight two-stage classification methods, and test these eight methods on eight microarray data sets for analyzing their performance. The first interesting finding is that the genes chosen by different categories of gene selection mechanisms are less than half in common but result in insignificantly different classification accuracies. A subset-gene-ranking mechanism can be beneficial in classification accuracy, but its computational effort is much heavier. Whether the classification tool employed at the second stage should be accompanied with a dimension reduction technique depends on the characteristics of a data set.

Introduction

Gene expression data are a key factor for the success of medical diagnosis (Quackenbush, 2001), and classification methods are therefore developed for processing microarray data. In a microarray data set, an instance usually contains the expression data of several thousand genes that are the features for identifying the occurrence of a specific disease. Even the techniques for obtaining microarray data have been improved, the cost for inquiring such an instance is still expensive. Thus, the number of instances in a microarray data set is generally far less than the number of genes in an instance. Since most of the genes are irrelevant to the disease of interest, using all features in classification can slow down the performance and make the results contaminated by lots of noise. Most traditional classification tools are infeasible in processing data sets with few instances and a huge number of features. So, many studies developed new classification methods for filtering genes that are critical to the occurrence of some disease.

A two-stage classification method first selects a pre-specified number of genes that is much less than the number of genes in an instance. The selected genes are then passed to the second stage for classification. There exist a wide variety of techniques for feature selection and classification, and most of these techniques have been studied deeply and thoroughly. Any gene selection mechanism and classification tool can be combined to form a two-stage classification method for microarray data. This demonstrates the flexibility and applicability of the two-stage classification method. In particular, biologists can adopt the available tools on hand to analyze microarray data without studying new classification tools specifically developed for microarray data. This study attempts to provide some insights and guidelines in designing such a two-stage classification method.

Like the way to categorize feature selection tools, there are two alternatives in designing the gene selection mechanism for the first stage: individual gene ranking or subset gene ranking (Lu & Han, 2003). The tools for classification can be categorized in many ways. In this study, we need the classification tools that are naturally designed for processing continuous features like gene expression data. Since the expression data of thousands of genes are likely to contain noise, a dimension reduction technique is an appropriate tool to filter such noise. We therefore would like to investigate whether a classification tool should be preceded by a dimension reduction technique in analyzing microarray data.

This paper is organized as follows. The literatures relevant to classifying microarray data are reviewed in Section 2. Section 3 presents the possible designs of two-stage classification methods, and eight two-stage classification methods that will be investigated in this study are further introduced. In Section 4, we will propose a procedure for evaluating the performance of the eight two-stage classification methods. This procedure will be used in Section 5 to analyze the experimental results obtained from eight microarray data sets for cancer detection. The conclusions and the direction for future study of this paper are addressed in Section 6.

Section snippets

Related works

Traditional classification methods are generally inapplicable to microarray data that possess some special characteristics as pointed out in Section 1. New classification methods are therefore developed for processing microarray data in these years, as summarized in Table 1. Some or part of the methods listed in Table 1 are brand new and specifically developed for processing microarray data. However, most of them are just to modify well-known techniques and assemble them together to deal with

Structure of two-stage classification methods

A two-stage classification method includes a gene selection mechanism at the first stage and a classification tool that predicts the class of a new instance based on the genes chosen at the first stage. Since the number of genes in a microarray instance is generally more than 1000, and most of the genes cannot provide useful information in classification, a gene selection mechanism is necessary for processing microarray data.

As pointed out in Section 2, the mechanism for gene selection can be

Performance evaluation

A two-stage classification method composed by gene selection mechanism A and classification tool B will be denoted by A/B. For example, the methods proposed by Li et al., 2001, Nguyen and Rocke, 2002 can be represented by GK/K and T/PL, respectively. In this study, we are going to test the two-stage classification methods composed by the four gene selection mechanisms and the two classification tools introduced in the previous section. The testing results will be able to provide some guidelines

Experimental study

In this study, the gene selection mechanism applied at the first stage will be either T, BW, S, or GK, and the tool for classification employed at the second stage will be either K or PL. So, the number of two-stage classification methods investigated in this paper is eight. In this section, we will introduce the characteristics of eight microarray data sets, the procedure of data pre-processing, and the parameter settings for the methods. Then the eight methods will be tested by the eight data

Conclusions

The causality of a disease is believed to be highly dependent on the gene expression data. Many classification methods are therefore developed for extracting such information from microarray data. With respect to the other classification methods for microarray data, two-stage classification methods have a higher applicability and understandability. The gene selection mechanisms at the first stage can be either individual gene ranking or subset gene ranking, and the classification tools for the

References (22)

  • R. Jörnsten et al.

    Simultaneous gene clustering and subset selection for sample classification via MDL

    Bioinformatics

    (2003)
  • Cited by (0)

    View full text