Multi-class tumor classification by discriminant partial least squares using microarray gene expression data and assessment of classification models

https://doi.org/10.1016/j.compbiolchem.2004.05.002Get rights and content

Abstract

High-throughput DNA microarray provides an effective approach to the monitoring of expression levels of thousands of genes in a sample simultaneously. One promising application of this technology is the molecular diagnostics of cancer, e.g. to distinguish normal tissue from tumor or to classify tumors into different types or subtypes. One problem arising from the use of microarray data is how to analyze the high-dimensional gene expression data, typically with thousands of variables (genes) and much fewer observations (samples). There is a need to develop reliable classification methods to make full use of microarray data and to evaluate accurately the predictive ability and reliability of such derived models. In this paper, discriminant partial least squares was used to classify the different types of human tumors using four microarray datasets and showed good prediction performance. Four different cross-validation procedures (leave-one-out versus leave-half-out; incomplete versus full) were used to evaluate the classification model. Our results indicate that discriminant partial least squares using leave-half-out cross-validation provides a more realistic estimate of the predictive ability of a classification model, which may be overestimated by some of the cross-validation procedures, and the information obtained from different cross-validation procedures can be used to evaluate the reliability of the classification model.

Introduction

Improvements in cancer classification have been of great importance in cancer treatment. Traditional cancer classification based primarily on morphological appearance of tumor relies on specific biological insight and experience, rather than systematic approach for tumor recognition (Golub et al., 1999). It is difficult for those approaches to distinguish tumors with similar histopathological appearance but different clinical course and response to therapy. The advent of DNA microarray technology makes it possible to classify cancers on a genome-wide scale by simultaneously monitoring expression of thousands of genes in the sample (Golub et al., 1999, Young, 2000, Shi, 2001, Gershon, 2002). Various clustering, classification, and prediction techniques have been used to analyze and understand the gene expression data resulted from DNA microarray. Some recent applications include: molecular classification of acute leukemia (Golub et al., 1999), classification of human cancer cell lines (Ross et al., 2000), support vector machine classification of cancer tissue samples (Furey et al., 2000), classifying cancers based on gene expression signatures using artificial neural networks (Khan et al., 2001), mapping of the physiological state of cells and tissues and identification of important genes using Fisher discriminant analysis (Stephanopoulos et al., 2002), tumor classification by logistic discrimination and quadratic discriminant analysis after dimension-reduction by partial least squares (Nguyen and Rocke, 2002a, Nguyen and Rocke, 2002b), principal component analysis disjoint models for cancer classification (Bicciato et al., 2003).

DNA microarray gene expression data are usually characterized by many thousands of variables (genes) with a few observations (samples), which often imply high degree of multicollinearity in the gene expression data. One common way to handle this kind of severely ill-conditioned problems is to reduce the dimensionality of the gene expression data. Principal component analysis (PCA) maybe the most popular approach to serve this purpose, which attempts to find a set of orthogonal principal components (linear combinations of original variables) to account for the maximum variations in gene expression data. Since the information about the classification provided by response variable (class membership) is not taken into account in constructing the principal components in PCA, the performance of PCA in classification may not be optimal. To overcome this problem, partial least squares (PLS), developed by Wold et al. (1984), was used to find orthogonal linear combinations of original predictor variables that highly correlate with the response variables while accounting for as much variance in predictors as possible. PLS can be seen as a compromise between PCA and ordinary least squares regression (Park et al., 2002) and it is of particular interest due to its powerful capability to analyze data with strongly collinear (correlated), noisy and numerous X-variable (Wold et al., 2001). Since its first introduction into chemometrics as a multivariate calibration tool, PLS has been widely applied in many other fields such as quantitative structure–activity relationship (Cramer et al., 1988) and pattern classification (Sjostrom et al., 1986, Stahle and Wold, 1987, Gottfries et al., 1995, Song and Hopke, 1999).

When PLS is used as a discrimination procedure in classification, it is called discriminant PLS (D-PLS) (Song and Hopke, 1999, Barker and Rayens, 2003). It is well known that PLS is related to canonical correlation analysis (CCA). Considering that CCA is, in turn, related to linear discriminant analysis (LDA), Barker and Rayens (2003) has shown recently the direct connection between PLS and LDA. They also pointed out that PCA is only capable of identifying gross variability, rather than distinguishing ‘among-groups’ and ‘within-groups’ variability, as is the explicit goal of LDA paradigm and D-PLS. Therefore, D-PLS will necessarily perform better or at least no worse than PCA for dimension reduction with the goal of achieving classification, especially when ‘within-groups’ variability dominates the ‘among-groups’ variability in the data (Barker and Rayens, 2003).

Another problem associated with gene expression data is that most of the genes monitored in microarray may not be relevant to classification and these genes may potentially degrade the prediction performance of classification by masking the contribution of the relevant genes (Stephanopoulos et al., 2002; Nguyen and Rocke, 2002a, Nguyen and Rocke, 2002b; Bicciato et al., 2003). Thus the elimination of the genes unrelated to classification is of great importance. A question arising from gene selection together with leave-one-out cross-validation (LOOCV) in microarray data analysis is whether the gene selection step should be put inside the CV loop or not (Nguyen and Rocke, 2002b, Simon et al., 2003, Ntzani and Loannidis, 2003, Ambroise and McLachlan, 2002). As has already been noted recently (Nguyen and Rocke, 2002b, Simon et al., 2003, Ntzani and Loannidis, 2003), the gene selection step in most of the literatures concerning with tumor classification using microarray was performed before (outside of) the CV loop. This kind of incomplete CV is not cross-validation in its strict definition and usually gives spurious good statistical estimates about a classification model (Simon et al., 2003, Ntzani and Loannidis, 2003, Ambroise and McLachlan, 2002). By using a simulated dataset, Simon et al. (2003) clearly demonstrated the need for performing full LOOCV to assess the predictive ability of a classification model.

In this paper, leave-half-out CV (LHOCV) procedure, in addition to LOOCV, was applied to four real microarray datasets to assess more realistically the predictive ability of a classification model as well as to show more clearly the difference between incomplete and complete CV. This kind of difference also provided a way to measure the reliability of a classification model.

Section snippets

Discriminant partial least squares (D-PLS)

The basic idea of various projection or dimension-reduction approaches, for example PCA and PLS, is to project the observations (samples) from the high-dimension variables (genes) space to a low-dimensional subspace spanned by several linear combinations of the original variables to satisfy a certain objective criterion. PCA attempts to find a set of orthogonal principal components to account for the maximum variance in predictors (X). Since there is no guarantee that the principal components

Acute leukemia data

The leukemia data set was measured by Golub et al. (1999) using Affymetrix high-density oligonucleotide microarray containing probes for 6817 human genes. The original training data set consisted of 38 bone marrow samples from acute leukemia patients, including 19 B-cell acute lymphoblastic leukemia (B-ALL), 8 T-cell acute lymphoblastic leukemia (T-ALL) and 11 acute myeloid leukemia (AML). The original independent (test) data set consisted of 24 bone marrow and 10 peripheral blood samples (19

Conclusions

In this paper, D-PLS was used directly to perform multi-class classification of tumor samples based on microarray gene expression data. Four microarray datasets were used to demonstrate the effectiveness and reliability of D-PLS in the classification of tumors. Although the best result of our D-PLS model is superior to the best result of Nguyen and Rocke’s approach (2002b), where PLS was only used as a dimension reduction tool and the classification was done by LDA or QDA, in three datasets

Acknowledgements

We thank referees for helpful comments. Y.X. Tan and C. Wang were partially supported by General Clinical Research Center grant M01-RR00425 from the National Center for Research Resources.

References (31)

  • D. Gershon

    Microarray technology: an array of opportunities

    Nature

    (2002)
  • T.R. Golub et al.

    Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • J. Gottfries et al.

    Diagnosis of dementias using partial least squares discrimination analysis

    Dementia

    (1995)
  • I. Hedenfalk et al.

    Gene-expression profiles in hereditary breast cancer

    N. Engl. J. Med.

    (2001)
  • A. Höskuldsson

    PLS regression methods

    J. Chemom.

    (1988)
  • Cited by (79)

    • Correlation and association analyses in microbiome study integrating multiomics in health and disease

      2020, Progress in Molecular Biology and Translational Science
      Citation Excerpt :

      The common feature of PLS-DA and OPLS-DA is that they both may utilize weaker sources of variation to separate groups. PLS-DA is a popular multivariate tool performing three kinds of functions: (1) used as a multivariate dimensionality-reduction tool607,608; (2) also used for feature selection method609; and (3) used for classification method.6,181,610–614 The beneficial properties of PLS-DA lie on its capabilities of dealing with large p and small n problem.

    • Semi-supervised fault classification based on dynamic Sparse Stacked auto-encoders model

      2017, Chemometrics and Intelligent Laboratory Systems
      Citation Excerpt :

      Unlike fault detection, fault classification is the problem of identifying which fault category a new data sample belongs to, on the basis of a training set of data containing observations whose fault type is known. In recent years, linear supervised classification techniques such as K-nearest neighbor (KNN) [7], principal component analysis (PCA) [8], FDA [9], and discriminant partial least squares (DPLS) [10] are proposed based on the assumption that the fault data can be distinguished from each other by linear functions. Furthermore, nonlinear supervised classification techniques such as artificial neural networks (ANN) [11] and kernel based methods [12–14] have been introduced for fault classification.

    View all citing articles on Scopus
    View full text