Use of SVD-based probit transformation in clustering gene expression profiles

doi:10.1016/j.csda.2007.01.022

Computational Statistics & Data Analysis

Volume 51, Issue 12, 15 August 2007, Pages 6355-6366

https://doi.org/10.1016/j.csda.2007.01.022 Get rights and content

Abstract

The mixture-Gaussian model-based clustering method has received much attention in clustering gene expression profiles in the literature of bioinformatics. However, this method suffers from two difficulties in applications. The first one is on the parameter estimation, which becomes difficult when the dimension of the data is high or the size of a cluster is small. The second one is on the normality assumption for gene expression levels, which is seldom satisfied by real data. In this paper, we propose to overcome these two difficulties by the probit transformation in conjunction with the singular value decomposition (SVD). SVD reduces the dimensionality of the data, and the probit transformation converts the scaled eigensamples, which can be interpreted as correlation coefficients as explained in the text, into Gaussian random variables. Our numerical results show that the SVD-based probit transformation enhances the ability of the mixture-Gaussian model-based clustering method for identifying prominent patterns of the data. As a by-product, we show that the SVD-based probit transformation also improves the performance of the model-free clustering methods, such as hierarchical, $K$ -means and self-organizing maps (SOM), for the data sets containing scattered genes. In this paper, we also propose a run test-based rule for selection of eigensamples used for clustering.

Introduction

In attempts to understand biological systems, large amount of gene expression data have been generated by researchers. Due to the large number of genes and the complexity of the biological systems, clustering has been one of the most important exploratory tools for analyzing these data. Clustering identifies groups of genes that exhibit similar expression profiles. The popular clustering methods can be divided into two categories, namely, model-free methods and model-based methods. In the model-free clustering methods, no probabilistic model is specified for the data, and clustering is made either by optimizing a certain target function or iteratively agglomerating/dividing genes to form a bottom-up/top-down tree. Examples include $K$ -means (Tavazoie et al., 1999, Tou and Gonzalez, 1979), hierarchical clustering (Carr et al., 1997, Eisen et al., 1998), self-organizing maps (Tamayo et al., 1999), among others. The model-based clustering methods construct clusters based on the assumption that the data follows a mixture distribution. A non-exhaustive list of recent works in this direction include Banfield and Raftery (1993), Biernacki et al. (1999), Fraley and Raftery (2002), Yeung et al. (2001), Medvedovic and Sivaganesan (2002), McLachlan et al. (2002), Wakefield et al. (2003) and Medvedovic et al. (2004). One advantage of the model-based clustering methods is that the probability model can be used in criteria to choose an appropriate number of clusters.

Among the model-based clustering methods, the one based on the mixture-Gaussian distribution is much more interesting due to its simplicity in computation. Henceforth, the mixture-Gaussian model-based clustering method will be abbreviated as the MG method. The MG method assumes that the observations $x_{1}, \dots, x_{n}$ are generated from a mixture Gaussian distribution with an unknown number of components. The corresponding likelihood function is $L (x_{1}, \dots, x_{n} | ω_{k}, μ_{k}, Σ_{k}, k = 1, \dots, G) = \prod_{i = 1}^{n} [\sum_{k = 1}^{G} \frac{ω_{k}}{(2 π)^{p / 2} | Σ_{k} |^{1 / 2}} e^{- 1 / 2 (x_{i} - μ_{k})^{T} Σ_{k}^{- 1} (x_{i} - μ_{k})}],$ where G is the (unknown) number of components, p is the dimension of the observations, $ω_{k}$ is the probability that an observation belongs to component k ( $ω_{k} ⩾ 0$ and $\sum_{k = 1}^{m} ω_{k} = 1$ ), and $μ_{k}$ and $Σ_{k}$ are the mean vector and covariance matrix of component k, respectively. Banfield and Raftery (1993) proposed to reparametrize the covariance matrices by eigenvalue decomposition in the form: $Σ_{k} = λ_{k} D_{k} A_{k} D_{k}^{T},$ where $λ_{k} = | Σ_{k} |^{1 / d}$ , $D_{k}$ is the matrix of eigenvectors of $Σ_{k}$ , and $A_{k}$ is a diagonal matrix such that $| A_{k} | = 1$ . The parameter $λ_{k}$ determines the volume of component k, $D_{k}$ determines its orientation and $A_{k}$ its shape. Allowing some but not all of these quantities to vary between components results in a set of parsimonious models which are appropriate to describe various clustering situations. Fraley and Raftery (2002) considered 10 different models related to different assumptions on the component variance matrices. Each model is denoted by a string of three letters E (equal), I (identical) and V (variable). The first letter of the string states the assumption on the volumes of the clusters, the second letter on the shapes, and the third letter on the orientation. For example, VEI represents a model in which the volumes of clusters may vary (V), the shapes of all clusters are equal (E), and the orientation is the identity (I). The MG method is implemented in the software MCLUST, which is downloadable at http://www.stat.washington.edu/mclust. In MCLUST, the model parameters are estimated using the EM algorithm (Dempster et al., 1977), and the BIC criterion is adopted for determining the number of clusters and the covariance structure.

Although the MG method has achieved great successes in clustering gene expression profiles (Yeung et al., 2001), its applications may be seriously hindered by the following two difficulties. The first one is on the parameter estimation, which becomes difficult when the dimension of the data is high or the size of a cluster is small. The second one is on the validity of the normality assumption. This assumption is seldom satisfied by real data. Applying the MG method to a data set with distribution deviated from Gaussian will result in a sub-optimal clustering result.

In this paper, we propose to overcome the above two difficulties by a SVD-based probit transformation. SVD reduces the dimension of the observations, and the probit transformation converts the scaled eigensamples, which can be interpreted as correlation coefficients as explained below, into Gaussian random variables. Our numerical results show that the transformation enhances the ability of the MG method for identifying prominent patterns of the data. In this paper, we also propose a run test-based rule for selection of eigensamples used for clustering. The new rule works well for all examples studied in this paper. Although the main theme of this paper is to show that the SVD-based probit transformation generally improves the performance of the MG method in clustering gene expression profiles, as a by-product we show that the transformation also improves the performance of the model-free clustering methods for the data sets containing scattered genes (Tseng and Wong, 2005).

The remaining part of this paper is organized as follows. In Section 2, we describe the SVD-based probit transformation and illustrate it using two motivating examples. In Section 3, we test the performance of the transformation on three simulated examples. In Section 4, we apply the transformation to a real data example. In Section 5, we conclude the paper with a brief discussion.

Section snippets

SVD-based probit transformation

Principal component analysis, or its computational equivalent the SVD analysis, has long been considered as a useful tool for reducing the dimensionality of the data prior to clustering, see, for example, Jolliffe (1986). Recently, this tool has been applied to gene expression data (Holter et al., 2000, Alter et al., 2000, Yeung and Ruzzo, 2001, Hastie et al., 2000, Horn and Axel, 2003).

Let X denote an $n \times p$ matrix, which represents a data set of n genes with each being measured at p discrete

Example I

Yeung and Ruzzo (2001) considered a simulated example, which models the cyclic behavior of genes over different time points using the sine function. The sine function modeling for the cell cycle behavior is supported by the experiments reported by Holter et al. (2000) and Alter et al. (2000). Genes in the same cluster have similar peak time over the time course. Different clusters have different phase shifts and different sizes. This example is the same with Yeung and Ruzzo's example except

A real data example

The fibroblast data set was collected by Iyer et al. (1999) for the purpose of investigating the response of human fibroblast to serum after growth arrest. In the study, the temporal changes in mRNA levels of 8613 genes were measured at 12 time-points, ranging from 15 min to 24 h after serum stimulation. The full data set is available at http://genome-www.stanford.edu/serunm/data.html. The genes whose expression levels changed substantially in response to serum were extracted to form a subset of

Conclusion

In this paper, we have proposed to use the SVD-based probit transformation to improve the performance of the MG method for clustering gene expression profiles. Our numerical results show that the transformation can be generally useful for both types of data sets with or without scattered genes. As a by-product, we show that the probit transformation also improves the performance of the model-free clustering methods, such as SOM, AHC and K-means, for the data sets containing scattered genes.

Acknowledgments

The author's research was partially supported by grants from the National Science Foundation (DMS-0405748) and the National Cancer Institute (CA104620).

References (33)

C. Biernacki et al.
An improvement of the NEC criterion for assessing the number of clusters in a mixture model
Pattern Recognition Lett.
(1999)
R.J. Cho et al.
A genome wide transcriptional analysis of the mitotic cell cycle
Mol. Cell
(1998)
O. Alter et al.
Singular value decomposition for genome-wide expression data processing and modeling
Proc. Natl. Acad. Sci. USA
(2000)
J.D. Banfield et al.
Model-based Gaussian and non-Gaussian clustering
Biometrics
(1993)
D.B. Carr et al.
Templates for looking at gene expression clustering
Statist. Comput. Statist. Graphics Newslett.
(1997)
W.C. Chang
On using principal components before separating a mixture of two multivariate normal distributions
Appl. Statist.
(1983)
G. Chen et al.
Evaluation and comparison of clustering algorithms in analyzing es cell gene expression data
Statist. Sinica
(2002)
D.L. Davies et al.
A cluster separation measure
IEEE Trans. Pattern Anal. Machine Intell.
(1979)
A.P. Dempster et al.
Maximum likelihood for incomplete data via the EM algorithm (with discussion)
J. Roy. Statist. Soc., Ser. B
(1977)
M.B. Eisen et al.
Cluster analysis and display of genome-wide expression patterns
Proc. Natl. Acad. Sci. USA
(1998)

C. Fraley et al.

Model-based clustering, discriminant analysis, and density estimation

J. Amer. Statist. Assoc.

(2002)

T. Hastie et al.

Gene shaving as a method for identifying distinct sets of genes with similar expression patterns

Genome Biol.

(2000)

N.S. Holter et al.

Fundamental patterns underlying gene expression profiles: simplicity from complexity

Proc. Natl. Acad. Sci. USA

(2000)

D. Horn et al.

Novel clustering algorithm for microarray expression data in a truncated SVD space

Bioinformatics

(2003)

Hubert, L., Arabie, P., 1985. Comparing partitions. J. Classification,...

V.R. Iyer et al.

The transcriptional program in the response of human fibroblasts to serum

Science

(1999)

Cited by (18)

An ExPosition of multivariate analysis with the singular value decomposition in R
2014, Computational Statistics and Data Analysis
Citation Excerpt :
The singular value decomposition (svd; Yanai et al., 2011) is an indispensable statistical technique used in many domains, such as neuroimaging (McIntosh and Mišić, 2013), complex systems (Tuncer et al., 2008), text reconstruction (Gomez and Moens, 2012), sensory analyses (Husson et al., 2007), and genetics (Liang, 2007).
ExPosition is a new comprehensive R package providing crisp graphics and implementing multivariate analysis methods based on the singular value decomposition (svd). The core techniques implemented in ExPosition are: principal components analysis, (metric) multidimensional scaling, correspondence analysis, and several of their recent extensions such as barycentric discriminant analyses (e.g., discriminant correspondence analysis), multi-table analyses (e.g.,multiple factor analysis, Statis, and distatis), and non-parametric resampling techniques (e.g., permutation and bootstrap). Several examples highlight the major differences between ExPosition and similar packages. Finally, the future directions of ExPosition are discussed.
Polarization of forecast densities: A new approach to time series classification
2014, Computational Statistics and Data Analysis
Citation Excerpt :
Time series classification techniques have been applied in a wide range of fields. For applications in economics and finance, see Liu and Maharaj (2013), Salcedo et al. (2012), Maharaj and D’Urso (2010), Miskiewicz and Ausloos (2008), Ausloos and Lambiotte (2007), Dose and Cincotti (2005), Basalto et al. (2005) and Pattarin et al. (2004); for environmental applications see Macchiato et al. (1995), Cowpertwait and Cox (1992); for gene studies see Douzal-Chouakria et al. (2009), Liu et al. (2008), Park et al. (2008), Scrucca (2007), Liang (2007) and Kim et al. (2006); for applications in health sciences see Alonso et al. (2012), Slaets et al. (2012) and Volant et al. (2012); for studies in astronomy, see Harvill et al. (2013). Most approaches to time series classification are either feature-based or model-based, producing classification solutions based on historical or current information extracted from time series.
Time series classification has been extensively explored in many fields of study. Most methods are based on the historical or current information extracted from data. However, if interest is in a specific future time period, methods that directly relate to forecasts of time series are much more appropriate. An approach to time series classification is proposed based on a polarization measure of forecast densities of time series. By fitting autoregressive models, forecast replicates of each time series are obtained via the bias-corrected bootstrap, and a stationarity correction is considered when necessary. Kernel estimators are then employed to approximate forecast densities, and discrepancies of forecast densities of pairs of time series are estimated by a polarization measure, which evaluates the extent to which two densities overlap. Following the distributional properties of the polarization measure, a discriminant rule and a clustering method are proposed to conduct the supervised and unsupervised classification, respectively. The proposed methodology is applied to both simulated and real data sets, and the results show desirable properties.
A hypothesis test using bias-adjusted AR estimators for classifying time series in small samples
2013, Computational Statistics and Data Analysis
Citation Excerpt :
In medicine, the study of physiological data such as Electrocardiography (ECG) data (Corduas and Piccolo, 2008; Kalpakis et al., 2001), Electroencephalography (EEG) data (Prado et al., 2006; Xiong and Yeung, 2004; Alagón, 1989) and Electromyography (EMG) data (Kang et al., 1995) can be used for medical diagnosis by grouping the observed signals and the ones generated by particular diseases. In biology, clustering and classification help understanding the gene function, gene regulation, and cellular processes (Douzal-Chouakria et al., 2009; Liu et al., 2008; Park et al., 2008; Scrucca, 2007; Liang, 2007; Kim et al., 2006). In seismology, it is important to discriminate between seismic waves that are respectively caused by earthquakes and explosions (Maharaj and Alonso, 2007; Kakizawa et al., 1998; Dargahi-Noubary, 1992).
A new test of hypothesis for classifying stationary time series based on the bias-adjusted estimators of the fitted autoregressive model is proposed. It is shown theoretically that the proposed test has desirable properties. Simulation results show that when time series are short, the size and power estimates of the proposed test are reasonably good, and thus this test is reliable in discriminating between short-length time series. As the length of the time series increases, the performance of the proposed test improves, but the benefit of bias-adjustment reduces. The proposed hypothesis test is applied to two real data sets: the annual real GDP per capita of six European countries, and quarterly real GDP per capita of five European countries. The application results demonstrate that the proposed test displays reasonably good performance in classifying relatively short time series.
Extracting plants core genes responding to abiotic stresses by penalized matrix decomposition
2012, Computers in Biology and Medicine
Citation Excerpt :
Referring to the definition in the Section 2.1, the left singular vectors span the space of the sample expression profiles {sj} and the right singular vectors span the space of the gene transcriptional responses {ri}. Following the convention [22], we refer to the right singular vectors {vk}, i.e., the columns of V, as eigenpatterns, to the left singular vectors {uk}, i.e., the columns of U, as eigensamples and to the rows of U as eigengenes. eigensamples, eigengenes, eigenpattern and other definitions are shown in Fig. 1.
Sparse methods have a significant advantage to reduce the complexity of genes expression data and to make them more comprehensible and interpretable. In this paper, based on penalized matrix decomposition (PMD), a novel approach is proposed to extract plants core genes, i.e., the characteristic gene set, responding to abiotic stresses. Core genes can capture the changes of the samples. In other words, the features of samples can be caught by the core genes. The experimental results show that the proposed PMD-based method is efficient to extract the core genes closely related to the abiotic stresses.
Statistical genetics & statistical genomics: Where biology, epistemology, statistics, and computation collide
2009, Computational Statistics and Data Analysis
Adaptive clustering for time series: Application for identifying cell cycle expressed genes
2009, Computational Statistics and Data Analysis
Citation Excerpt :
DNA microarray technology allows us to monitor simultaneously the expression levels of thousands of genes during important biological processes and across collections of related experiments. Clustering and classification techniques have proved to be helpful in understanding gene function, gene regulation, and cellular processes (Park et al. (2008), Liu et al. (2008), Scrucca (2007), Zhang et al. (2007), Liang (2007), Kim et al. (2006) and He et al. (2006), etc.). Though most cells in our bodies contain the same genes, not all of them intervene in each cell: genes are turned on (i.e., expressed) when needed.
The biological problem of identifying the active genes during the cell division process is addressed. The cell division ensures the proliferation of cells, which is drastically aberrant in cancer cells. The studied genes are described by their expression profiles during the cell division cycle. Commonly, the identification process is a supervised approach based on an a priori set of reference genes, assumed as well-characterizing the cell cycle phases. Each studied gene is then classified by its peak similarity to one pre-specified reference gene. This classical approach suffers from two limitations. On the one hand, there is no consensus between biologists about the set of reference genes to consider for the identification process. On the other hand, the proximity measures used for genes expression profiles are unjustified and mainly based on the expression values regardless of the genes expression behavior. To identify genes expression profiles, a new adaptive clustering approach is proposed which consists of two main points. First, it allows in an unsupervised way the selection of a well-justified set of reference genes, to be compared with the pre-specified ones. Secondly, it enables the users to learn the appropriate proximity measure to use for genes expression data, a measure which will cover both proximity on values and on behavior. The adaptive clustering method is compared to a correlation-based approach through public and simulated genes expression data.

View all citing articles on Scopus

View full text

Use of SVD-based probit transformation in clustering gene expression profiles

Abstract

Introduction

Section snippets

SVD-based probit transformation

Example I

A real data example

Conclusion

Acknowledgments

Pattern Recognition Lett.

Mol. Cell

Singular value decomposition for genome-wide expression data processing and modeling

Proc. Natl. Acad. Sci. USA

Model-based Gaussian and non-Gaussian clustering

Biometrics

Templates for looking at gene expression clustering

Statist. Comput. Statist. Graphics Newslett.

On using principal components before separating a mixture of two multivariate normal distributions

Appl. Statist.

Evaluation and comparison of clustering algorithms in analyzing es cell gene expression data

Statist. Sinica

A cluster separation measure

IEEE Trans. Pattern Anal. Machine Intell.

Maximum likelihood for incomplete data via the EM algorithm (with discussion)

J. Roy. Statist. Soc., Ser. B

Cluster analysis and display of genome-wide expression patterns

Proc. Natl. Acad. Sci. USA

Model-based clustering, discriminant analysis, and density estimation

J. Amer. Statist. Assoc.

Gene shaving as a method for identifying distinct sets of genes with similar expression patterns

Genome Biol.

Fundamental patterns underlying gene expression profiles: simplicity from complexity

Proc. Natl. Acad. Sci. USA

Novel clustering algorithm for microarray expression data in a truncated SVD space

Bioinformatics

The transcriptional program in the response of human fibroblasts to serum

Science