Two-stage classification methods for microarray data
Introduction
Gene expression data are a key factor for the success of medical diagnosis (Quackenbush, 2001), and classification methods are therefore developed for processing microarray data. In a microarray data set, an instance usually contains the expression data of several thousand genes that are the features for identifying the occurrence of a specific disease. Even the techniques for obtaining microarray data have been improved, the cost for inquiring such an instance is still expensive. Thus, the number of instances in a microarray data set is generally far less than the number of genes in an instance. Since most of the genes are irrelevant to the disease of interest, using all features in classification can slow down the performance and make the results contaminated by lots of noise. Most traditional classification tools are infeasible in processing data sets with few instances and a huge number of features. So, many studies developed new classification methods for filtering genes that are critical to the occurrence of some disease.
A two-stage classification method first selects a pre-specified number of genes that is much less than the number of genes in an instance. The selected genes are then passed to the second stage for classification. There exist a wide variety of techniques for feature selection and classification, and most of these techniques have been studied deeply and thoroughly. Any gene selection mechanism and classification tool can be combined to form a two-stage classification method for microarray data. This demonstrates the flexibility and applicability of the two-stage classification method. In particular, biologists can adopt the available tools on hand to analyze microarray data without studying new classification tools specifically developed for microarray data. This study attempts to provide some insights and guidelines in designing such a two-stage classification method.
Like the way to categorize feature selection tools, there are two alternatives in designing the gene selection mechanism for the first stage: individual gene ranking or subset gene ranking (Lu & Han, 2003). The tools for classification can be categorized in many ways. In this study, we need the classification tools that are naturally designed for processing continuous features like gene expression data. Since the expression data of thousands of genes are likely to contain noise, a dimension reduction technique is an appropriate tool to filter such noise. We therefore would like to investigate whether a classification tool should be preceded by a dimension reduction technique in analyzing microarray data.
This paper is organized as follows. The literatures relevant to classifying microarray data are reviewed in Section 2. Section 3 presents the possible designs of two-stage classification methods, and eight two-stage classification methods that will be investigated in this study are further introduced. In Section 4, we will propose a procedure for evaluating the performance of the eight two-stage classification methods. This procedure will be used in Section 5 to analyze the experimental results obtained from eight microarray data sets for cancer detection. The conclusions and the direction for future study of this paper are addressed in Section 6.
Section snippets
Related works
Traditional classification methods are generally inapplicable to microarray data that possess some special characteristics as pointed out in Section 1. New classification methods are therefore developed for processing microarray data in these years, as summarized in Table 1. Some or part of the methods listed in Table 1 are brand new and specifically developed for processing microarray data. However, most of them are just to modify well-known techniques and assemble them together to deal with
Structure of two-stage classification methods
A two-stage classification method includes a gene selection mechanism at the first stage and a classification tool that predicts the class of a new instance based on the genes chosen at the first stage. Since the number of genes in a microarray instance is generally more than 1000, and most of the genes cannot provide useful information in classification, a gene selection mechanism is necessary for processing microarray data.
As pointed out in Section 2, the mechanism for gene selection can be
Performance evaluation
A two-stage classification method composed by gene selection mechanism A and classification tool B will be denoted by A/B. For example, the methods proposed by Li et al., 2001, Nguyen and Rocke, 2002 can be represented by GK/K and T/PL, respectively. In this study, we are going to test the two-stage classification methods composed by the four gene selection mechanisms and the two classification tools introduced in the previous section. The testing results will be able to provide some guidelines
Experimental study
In this study, the gene selection mechanism applied at the first stage will be either T, BW, S, or GK, and the tool for classification employed at the second stage will be either K or PL. So, the number of two-stage classification methods investigated in this paper is eight. In this section, we will introduce the characteristics of eight microarray data sets, the procedure of data pre-processing, and the parameter settings for the methods. Then the eight methods will be tested by the eight data
Conclusions
The causality of a disease is believed to be highly dependent on the gene expression data. Many classification methods are therefore developed for extracting such information from microarray data. With respect to the other classification methods for microarray data, two-stage classification methods have a higher applicability and understandability. The gene selection mechanisms at the first stage can be either individual gene ranking or subset gene ranking, and the classification tools for the
References (22)
- et al.
An Epicurean learning approach to gene-expression data classification
Artificial Intelligence in Medicine
(2003) - et al.
Tumor classification using phylogenetic methods on expression data
Journal of Theoretical Biology
(2004) - et al.
Cancer classification using gene expression data
Information Systems
(2003) - et al.
Using SVD and SVM methods for selection, classification, clustering and modeling of DNA microarray data
Engineering Application of Artificial Intelligence
(2004) - et al.
Effective dimension reduction methods for tumor classification using gene expression data
Bioinformatics
(2003) - et al.
Reliability analysis of microarray data using fuzzy c-means and normal mixture modeling based classification methods
Bioinformatics
(2005) - et al.
Comparison of discrimination methods for the classification of tumor using gene expression data
Journal of the American Statistical Association
(2002) - et al.
Using Bayesian networks to analyze expression data
Journal of Computational Biology
(2000) - et al.
Analyzing microarray data using quantitative association rules
Bioinformatics
(2005) - et al.
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
(1999)