Use of SVD-based probit transformation in clustering gene expression profiles

https://doi.org/10.1016/j.csda.2007.01.022Get rights and content

Abstract

The mixture-Gaussian model-based clustering method has received much attention in clustering gene expression profiles in the literature of bioinformatics. However, this method suffers from two difficulties in applications. The first one is on the parameter estimation, which becomes difficult when the dimension of the data is high or the size of a cluster is small. The second one is on the normality assumption for gene expression levels, which is seldom satisfied by real data. In this paper, we propose to overcome these two difficulties by the probit transformation in conjunction with the singular value decomposition (SVD). SVD reduces the dimensionality of the data, and the probit transformation converts the scaled eigensamples, which can be interpreted as correlation coefficients as explained in the text, into Gaussian random variables. Our numerical results show that the SVD-based probit transformation enhances the ability of the mixture-Gaussian model-based clustering method for identifying prominent patterns of the data. As a by-product, we show that the SVD-based probit transformation also improves the performance of the model-free clustering methods, such as hierarchical, K-means and self-organizing maps (SOM), for the data sets containing scattered genes. In this paper, we also propose a run test-based rule for selection of eigensamples used for clustering.

Introduction

In attempts to understand biological systems, large amount of gene expression data have been generated by researchers. Due to the large number of genes and the complexity of the biological systems, clustering has been one of the most important exploratory tools for analyzing these data. Clustering identifies groups of genes that exhibit similar expression profiles. The popular clustering methods can be divided into two categories, namely, model-free methods and model-based methods. In the model-free clustering methods, no probabilistic model is specified for the data, and clustering is made either by optimizing a certain target function or iteratively agglomerating/dividing genes to form a bottom-up/top-down tree. Examples include K-means (Tavazoie et al., 1999, Tou and Gonzalez, 1979), hierarchical clustering (Carr et al., 1997, Eisen et al., 1998), self-organizing maps (Tamayo et al., 1999), among others. The model-based clustering methods construct clusters based on the assumption that the data follows a mixture distribution. A non-exhaustive list of recent works in this direction include Banfield and Raftery (1993), Biernacki et al. (1999), Fraley and Raftery (2002), Yeung et al. (2001), Medvedovic and Sivaganesan (2002), McLachlan et al. (2002), Wakefield et al. (2003) and Medvedovic et al. (2004). One advantage of the model-based clustering methods is that the probability model can be used in criteria to choose an appropriate number of clusters.

Among the model-based clustering methods, the one based on the mixture-Gaussian distribution is much more interesting due to its simplicity in computation. Henceforth, the mixture-Gaussian model-based clustering method will be abbreviated as the MG method. The MG method assumes that the observations x1,,xn are generated from a mixture Gaussian distribution with an unknown number of components. The corresponding likelihood function isL(x1,,xn|ωk,μk,Σk,k=1,,G)=i=1nk=1Gωk(2π)p/2|Σk|1/2e-1/2(xi-μk)TΣk-1(xi-μk),where G is the (unknown) number of components, p is the dimension of the observations, ωk is the probability that an observation belongs to component k (ωk0 and k=1mωk=1), and μk and Σk are the mean vector and covariance matrix of component k, respectively. Banfield and Raftery (1993) proposed to reparametrize the covariance matrices by eigenvalue decomposition in the form:Σk=λkDkAkDkT,where λk=|Σk|1/d, Dk is the matrix of eigenvectors of Σk, and Ak is a diagonal matrix such that |Ak|=1. The parameter λk determines the volume of component k, Dk determines its orientation and Ak its shape. Allowing some but not all of these quantities to vary between components results in a set of parsimonious models which are appropriate to describe various clustering situations. Fraley and Raftery (2002) considered 10 different models related to different assumptions on the component variance matrices. Each model is denoted by a string of three letters E (equal), I (identical) and V (variable). The first letter of the string states the assumption on the volumes of the clusters, the second letter on the shapes, and the third letter on the orientation. For example, VEI represents a model in which the volumes of clusters may vary (V), the shapes of all clusters are equal (E), and the orientation is the identity (I). The MG method is implemented in the software MCLUST, which is downloadable at http://www.stat.washington.edu/mclust. In MCLUST, the model parameters are estimated using the EM algorithm (Dempster et al., 1977), and the BIC criterion is adopted for determining the number of clusters and the covariance structure.

Although the MG method has achieved great successes in clustering gene expression profiles (Yeung et al., 2001), its applications may be seriously hindered by the following two difficulties. The first one is on the parameter estimation, which becomes difficult when the dimension of the data is high or the size of a cluster is small. The second one is on the validity of the normality assumption. This assumption is seldom satisfied by real data. Applying the MG method to a data set with distribution deviated from Gaussian will result in a sub-optimal clustering result.

In this paper, we propose to overcome the above two difficulties by a SVD-based probit transformation. SVD reduces the dimension of the observations, and the probit transformation converts the scaled eigensamples, which can be interpreted as correlation coefficients as explained below, into Gaussian random variables. Our numerical results show that the transformation enhances the ability of the MG method for identifying prominent patterns of the data. In this paper, we also propose a run test-based rule for selection of eigensamples used for clustering. The new rule works well for all examples studied in this paper. Although the main theme of this paper is to show that the SVD-based probit transformation generally improves the performance of the MG method in clustering gene expression profiles, as a by-product we show that the transformation also improves the performance of the model-free clustering methods for the data sets containing scattered genes (Tseng and Wong, 2005).

The remaining part of this paper is organized as follows. In Section 2, we describe the SVD-based probit transformation and illustrate it using two motivating examples. In Section 3, we test the performance of the transformation on three simulated examples. In Section 4, we apply the transformation to a real data example. In Section 5, we conclude the paper with a brief discussion.

Section snippets

SVD-based probit transformation

Principal component analysis, or its computational equivalent the SVD analysis, has long been considered as a useful tool for reducing the dimensionality of the data prior to clustering, see, for example, Jolliffe (1986). Recently, this tool has been applied to gene expression data (Holter et al., 2000, Alter et al., 2000, Yeung and Ruzzo, 2001, Hastie et al., 2000, Horn and Axel, 2003).

Let X denote an n×p matrix, which represents a data set of n genes with each being measured at p discrete

Example I

Yeung and Ruzzo (2001) considered a simulated example, which models the cyclic behavior of genes over different time points using the sine function. The sine function modeling for the cell cycle behavior is supported by the experiments reported by Holter et al. (2000) and Alter et al. (2000). Genes in the same cluster have similar peak time over the time course. Different clusters have different phase shifts and different sizes. This example is the same with Yeung and Ruzzo's example except

A real data example

The fibroblast data set was collected by Iyer et al. (1999) for the purpose of investigating the response of human fibroblast to serum after growth arrest. In the study, the temporal changes in mRNA levels of 8613 genes were measured at 12 time-points, ranging from 15 min to 24 h after serum stimulation. The full data set is available at http://genome-www.stanford.edu/serunm/data.html. The genes whose expression levels changed substantially in response to serum were extracted to form a subset of

Conclusion

In this paper, we have proposed to use the SVD-based probit transformation to improve the performance of the MG method for clustering gene expression profiles. Our numerical results show that the transformation can be generally useful for both types of data sets with or without scattered genes. As a by-product, we show that the probit transformation also improves the performance of the model-free clustering methods, such as SOM, AHC and K-means, for the data sets containing scattered genes.

We

Acknowledgments

The author's research was partially supported by grants from the National Science Foundation (DMS-0405748) and the National Cancer Institute (CA104620).

References (33)

  • C. Biernacki et al.

    An improvement of the NEC criterion for assessing the number of clusters in a mixture model

    Pattern Recognition Lett.

    (1999)
  • R.J. Cho et al.

    A genome wide transcriptional analysis of the mitotic cell cycle

    Mol. Cell

    (1998)
  • O. Alter et al.

    Singular value decomposition for genome-wide expression data processing and modeling

    Proc. Natl. Acad. Sci. USA

    (2000)
  • J.D. Banfield et al.

    Model-based Gaussian and non-Gaussian clustering

    Biometrics

    (1993)
  • D.B. Carr et al.

    Templates for looking at gene expression clustering

    Statist. Comput. Statist. Graphics Newslett.

    (1997)
  • W.C. Chang

    On using principal components before separating a mixture of two multivariate normal distributions

    Appl. Statist.

    (1983)
  • G. Chen et al.

    Evaluation and comparison of clustering algorithms in analyzing es cell gene expression data

    Statist. Sinica

    (2002)
  • D.L. Davies et al.

    A cluster separation measure

    IEEE Trans. Pattern Anal. Machine Intell.

    (1979)
  • A.P. Dempster et al.

    Maximum likelihood for incomplete data via the EM algorithm (with discussion)

    J. Roy. Statist. Soc., Ser. B

    (1977)
  • M.B. Eisen et al.

    Cluster analysis and display of genome-wide expression patterns

    Proc. Natl. Acad. Sci. USA

    (1998)
  • C. Fraley et al.

    Model-based clustering, discriminant analysis, and density estimation

    J. Amer. Statist. Assoc.

    (2002)
  • T. Hastie et al.

    Gene shaving as a method for identifying distinct sets of genes with similar expression patterns

    Genome Biol.

    (2000)
  • N.S. Holter et al.

    Fundamental patterns underlying gene expression profiles: simplicity from complexity

    Proc. Natl. Acad. Sci. USA

    (2000)
  • D. Horn et al.

    Novel clustering algorithm for microarray expression data in a truncated SVD space

    Bioinformatics

    (2003)
  • Hubert, L., Arabie, P., 1985. Comparing partitions. J. Classification,...
  • V.R. Iyer et al.

    The transcriptional program in the response of human fibroblasts to serum

    Science

    (1999)
  • Cited by (18)

    • An ExPosition of multivariate analysis with the singular value decomposition in R

      2014, Computational Statistics and Data Analysis
      Citation Excerpt :

      The singular value decomposition (svd; Yanai et al., 2011) is an indispensable statistical technique used in many domains, such as neuroimaging (McIntosh and Mišić, 2013), complex systems (Tuncer et al., 2008), text reconstruction (Gomez and Moens, 2012), sensory analyses (Husson et al., 2007), and genetics (Liang, 2007).

    • Polarization of forecast densities: A new approach to time series classification

      2014, Computational Statistics and Data Analysis
      Citation Excerpt :

      Time series classification techniques have been applied in a wide range of fields. For applications in economics and finance, see Liu and Maharaj (2013), Salcedo et al. (2012), Maharaj and D’Urso (2010), Miskiewicz and Ausloos (2008), Ausloos and Lambiotte (2007), Dose and Cincotti (2005), Basalto et al. (2005) and Pattarin et al. (2004); for environmental applications see Macchiato et al. (1995), Cowpertwait and Cox (1992); for gene studies see Douzal-Chouakria et al. (2009), Liu et al. (2008), Park et al. (2008), Scrucca (2007), Liang (2007) and Kim et al. (2006); for applications in health sciences see Alonso et al. (2012), Slaets et al. (2012) and Volant et al. (2012); for studies in astronomy, see Harvill et al. (2013). Most approaches to time series classification are either feature-based or model-based, producing classification solutions based on historical or current information extracted from time series.

    • A hypothesis test using bias-adjusted AR estimators for classifying time series in small samples

      2013, Computational Statistics and Data Analysis
      Citation Excerpt :

      In medicine, the study of physiological data such as Electrocardiography (ECG) data (Corduas and Piccolo, 2008; Kalpakis et al., 2001), Electroencephalography (EEG) data (Prado et al., 2006; Xiong and Yeung, 2004; Alagón, 1989) and Electromyography (EMG) data (Kang et al., 1995) can be used for medical diagnosis by grouping the observed signals and the ones generated by particular diseases. In biology, clustering and classification help understanding the gene function, gene regulation, and cellular processes (Douzal-Chouakria et al., 2009; Liu et al., 2008; Park et al., 2008; Scrucca, 2007; Liang, 2007; Kim et al., 2006). In seismology, it is important to discriminate between seismic waves that are respectively caused by earthquakes and explosions (Maharaj and Alonso, 2007; Kakizawa et al., 1998; Dargahi-Noubary, 1992).

    • Extracting plants core genes responding to abiotic stresses by penalized matrix decomposition

      2012, Computers in Biology and Medicine
      Citation Excerpt :

      Referring to the definition in the Section 2.1, the left singular vectors span the space of the sample expression profiles {sj} and the right singular vectors span the space of the gene transcriptional responses {ri}. Following the convention [22], we refer to the right singular vectors {vk}, i.e., the columns of V, as eigenpatterns, to the left singular vectors {uk}, i.e., the columns of U, as eigensamples and to the rows of U as eigengenes. eigensamples, eigengenes, eigenpattern and other definitions are shown in Fig. 1.

    • Adaptive clustering for time series: Application for identifying cell cycle expressed genes

      2009, Computational Statistics and Data Analysis
      Citation Excerpt :

      DNA microarray technology allows us to monitor simultaneously the expression levels of thousands of genes during important biological processes and across collections of related experiments. Clustering and classification techniques have proved to be helpful in understanding gene function, gene regulation, and cellular processes (Park et al. (2008), Liu et al. (2008), Scrucca (2007), Zhang et al. (2007), Liang (2007), Kim et al. (2006) and He et al. (2006), etc.). Though most cells in our bodies contain the same genes, not all of them intervene in each cell: genes are turned on (i.e., expressed) when needed.

    View all citing articles on Scopus
    View full text