A novel HMM-based clustering algorithm for the analysis of gene expression time-course data

https://doi.org/10.1016/j.csda.2005.07.007Get rights and content

Abstract

A novel hidden Markov model (HMM) and clustering algorithm for the analysis of gene expression time-course data is proposed. The proposed model, called the profile-HMM, is specifically designed to explicitly take into account the dynamic nature of temporal gene expression profiles, which is ignored by many clustering methods existing in the literature. In this model, gene expression dynamics are represented by a special set of paths, with each path characterizing a stochastic pattern. The profile-HMM is trained to contain the most likely set of stochastic patterns given the dynamic microarray data, and the clustering result is obtained by grouping together the time-series that are most likely to be related to the same pattern. The novelty of the method is that the behavior of the whole gene expression dataset is modeled by a single HMM acting as a self-organizing map, so that all the clusters are implicitly and jointly defined in the model during the training phase. An attractive property of the profile-HMM clustering algorithm is its ability to automatically identify the number of clusters. The resulting performance is demonstrated by its application on simulated and biological data.

Introduction

This paper addresses the problem of time-course data clustering, which is motivated by the increasing importance of temporal profiles in the context of gene expression data analysis. These temporal profiles, also referred to as gene expression time-course data, provide insights about how biological systems evolve in time and about how genes are co-regulated during a specific biological process. Therefore, it is well accepted that similar expression profiles potentially indicate related functions. This similarity can be explored by clustering analysis, which groups genes according to their expression time-course data. This technique has been widely applied in this field.

Early studies on clustering of gene expression time-course data focused on the implementation of some popular distance-based clustering methods, such as K-means clustering (Tavazoie et al., 1999), hierarchial clustering (Spellman et al., 1998), and self-organizing maps (SOM) (Tamayo et al., 1999). Although these traditional methods produce meaningful results for some datasets, none of them is able to take into account the dependences existing between observations belonging to subsequent time-points (e.g., the results obtained from these methods are invariant to arbitrary permutations of the time-points). This temporal dependence is an important feature of time-course data, and it is intuitive that its exploitation in the clustering process should lead to higher quality results.

The problem of capturing the dynamic patterns in the particular case of gene expression time-course data has been recently addressed by several authors (Bar-Joseph et al., 2002, Luan and Li, 2003, Moller-Levet et al., 2003). The common idea in these studies is to represent the gene expression time-series as continuous or piecewise continuous curves, and then to perform clustering based on the estimated curves. The fuzzy short time-series (FSTS) clustering method proposed by Moller-Levet et al. (2003) is based on the incorporation of a new distance metric, which utilizes a piecewise linear model, into a standard fuzzy clustering scheme. This method is simple and fast, but its underlying linear assumption may be an oversimplification in the type of problems encountered in real biological applications. More flexible models, such as spline models, have been applied in this context (Luan and Li, 2003, Moller-Levet et al., 2003). However, in order to produce reliable results, they require the data to be sampled at a sufficiently high rate.

Another popular way to exploit time dependences is the use of hidden Markov models (HMMs). An HMM can be viewed as a stochastic generalization of a finite-state automata, and it provides a probabilistic description of temporal dependences. Although HMMs have been widely used in many fields, such as speech recognition and digital communications, their application on the clustering of temporal gene expression profiles has not been widespread. One line of work utilizes HMMs to devise model-based metrics for time-series. The idea is to generate an HMM for each sequence, and then to compute the log-likelihood (LL) of each HMM for any of the sequences. This information is used to build a matrix of distances between sequences, and then the data is clustered by applying a distance-based clustering method employing such a matrix (Bicego et al., 2003, Smyth, 1997). Alternatively, the LLs can be directly utilized as features for later clustering processing (Panuccio et al., 2002). However, there are several limitations in the LL's calculation, which may degrade the performance of the whole clustering analysis. First, in order to calculate the LLs, one HMM is trained for each sequence. Since reliable training requires long sequences, the reliability of the clustering result may be heavily degraded when dealing with short sequences of gene expression data. Second, since each HMM is trained separately and independently, the model lacks a global view on the overall distribution of the patterns in the data. Finally, this technique assumes that for each gene the transitions between neighboring temporal observations follow the same stationary stochastic process. However, this time-invariance assumption does not usually hold in microarray data, especially when the expression measures are taken non-uniformly in time.

The first two aforementioned limitations have been addressed by Smyth (1997) by constructing a mixture of K-HMMs, where each component HMM represents one cluster. Each HMM is built on a separate set of states, and no transition is allowed between states from different component HMMs. The model initialization requires a K-cluster partition, so that each component is trained by one cluster. Then, the expectation-maximization (EM) algorithm is applied on the overall model to retrain the parameters of all components using the whole observation data. Obviously, the performance of this method is dependent on the availability of a good prediction for the number of clusters, K, which was estimated by Smyth (1997) utilizing Monte Carlo cross-validation techniques. Extensions of this work can be found in Ji et al. (2003) and Schliep et al. (2003). Another related but different approach was presented by Ramoni et al. (2002), who developed a software tool called CAGED (clustering analysis of gene expression data). In this software, the gene expression time-course data is represented by an autoregressive (AR) HMM and the clustering result is obtained using an agglomerative hierarchical method in a Bayesian framework. The number of clusters in the final result is decided on the basis of the posterior probability of the model, and each cluster is represented by an AR model trained over its corresponding time-series.

Although the work mentioned in the previous paragraph addresses the first two limitations existing in LL-based approaches, it still keeps the restrictive assumption of time-invariance: the general HMMs in Ji et al. (2003), Schliep et al. (2003) and Smyth (1997) utilize the same group of parameters to describe the temporal dependences in different time intervals, while in the CAGED software the autoregressive coefficients within a given model are still assumed fixed through all the time intervals. This questionable assumption may degrade the clustering performance, especially when dealing with irregularly sampled time-course data. Moreover, because of this assumption, it may be difficult to interpret the final model. Besides, although these methods consider the whole dataset during the training stage, the “communication” between the component HMMs representing each cluster is very limited: In Ji et al. (2003), Schliep et al. (2003) and Smyth (1997), “crossover” is not allowed among the component HMMs, while in CAGED each one of the AR models (which represents a cluster) is trained only by the time-series associated with its corresponding cluster.

In this paper, we introduce a new HMM-based clustering method for the analysis of gene expression time-series data. The proposed HMM model, which we call the profile-HMM, provides a natural description for multiple dependent stochastic patterns, and explicitly takes into account the dynamic nature of temporal gene expression profiles. Based on this model, the similarity between two time-series is defined according to the probability that they are related to the same stochastic pattern, and the training task is to find the most probable set of patterns characterizing the observed time-series. Finally, the clustering result is obtained by grouping together the time-series that are most likely to be related with the same pattern. It is important to remark that the training and clustering procedures in the profile-HMM are nothing more than standard techniques commonly applied in HMMs, which allows an easy implementation. In particular, the profile-HMM is trained by the Baum–Welch algorithm (Rabiner, 1989), and, once the model has been trained, clustering is performed utilizing the Viterbi algorithm (Rabiner, 1989). Another important feature of the profile-HMM is its ability to automatically identify the number of clusters contained in the dataset.

The main novelty of the profile-HMM approach with respect to previous work is the way in which it represents the stochastic patterns hidden in the observed time-series by using a single left–right HMM. This novel structure allows an easy connection between the elements in the model and the real application, since the patterns hidden in the data are characterized in a natural way. Moreover, the use of a single HMM forces the representative patterns to be selected simultaneously. Compared with other approaches in which a different HMM is created for each cluster, an advantage of the profile-HMM is that each stochastic pattern (and consequently each cluster) is built according to both negative samples (i.e., profiles related to other pattern) and positive samples (profiles related to this pattern). Another important difference with respect to previous work is that the time dependences at different time intervals are modeled separately, which relaxes the time-invariance assumption and makes the profile-HMM more general and flexible to describe the microarray data.

The rest of this paper is organized as follows. In Section 2, we introduce the definition of the profile-HMM and describe the details of the proposed clustering algorithm. The performance of the proposed method is evaluated using simulation data in Section 3, where we carry out a comparison study with other clustering techniques. We also applied the profile-HMM to real gene expression time-course data collected for the study of real biological problems. Section 4 analyzes and discusses the results obtained in these applications. Conclusions are provided in Section 5.

Section snippets

Hidden Markov models

An HMM is formally defined by the following elements (Rabiner, 1989):

  • S=s1,,sN, the finite set of possible (hidden) states.

  • A=aij1i,jN, the transition matrix, which describes the transition probability distribution among the associated states. In this matrix, the entry aij represents the probability of moving from state si to sj aij=Prqt+1=sj|qt=si,1i,jN..

  • B=bj(o),1jN, the emission matrix, which indicates the probability of emission of observation oR when the system state is sjbj(o)=Proat

Simulation data

In order to investigate the performance of the profile-HMM method, and to compare it with that of other clustering techniques, we consider first a simple simulation-based example for which the correct results are known. The simulation dataset is a collection of 550 time-series, each of length T=5, generated from the five basic sequences shown in Fig. 2. In order to produce similar but distinct time-series (grouped in as many clusters as basic sequences), a continuous noise generated using

Sporulation dataset

This dataset was collected to study the transcriptional program of sporulation in budding yeast (Chu et al., 1998). In this study, DNA microarrays, containing 97% of the known or predicted genes of Saccharomyces cerevisiae, are used to explore the temporal program of gene expression during meiosis and spore formation. Changes in the concentrations of mRNA transcripts from each gene were measured at seven uneven time intervals, and the sampling times were chosen based on the expression patterns

Conclusion and future work

This paper proposes a hidden Markov model-based clustering algorithm, which is able to efficiently account for the time dependences existing in time-course gene expression data. The proposed model, the profile-HMM, provides a natural description for multiple dependent stochastic patterns, where each pattern is represented as a path in the proposed model. This method shows a self-organizing nature in both its performance and its structure, and is able to automatically decide the number of

References (22)

  • X. Ji et al.

    Mining gene expression data using a novel approach based on hidden Markov models

    FEBS Lett.

    (2003)
  • Bar-Joseph, Z., Gerber, G., Gifford, D.K., Jaakkola, T.S., 2002. A new approach to analyzing gene expression time...
  • Bicego, M., Murino, V., Figueiredo, M., 2003. Similarity-based clustering of sequences using hidden Markov models. MLDM...
  • S. Chu et al.

    The transcriptional program of sporulation in budding yeast

    Science

    (1998)
  • D.L. Davies et al.

    A cluster separation measure

    IEEE Trans. Pattern Anal. Machine Intell.

    (1979)
  • M.B. Eisen et al.

    Clustering analysis and display of genome-wide expression patterns

    Proc. Natl Acad. Sci. USA

    (1998)
  • Fraley, C., Raftery, A., How many clusters? Which clustering method? Answers via model-based cluster analysis....
  • A. Gordon

    Classification

    (1999)
  • M. Halkidi et al.

    On clustering validation techniques

    J. Intell. Information Syst.

    (2001)
  • V.R. Iyer et al.

    The transcriptional program in the response of human fibroblasts to serum

    Science

    (1999)
  • A.K. Jain et al.

    Algorithms for Clustering Data

    (1988)
  • Cited by (27)

    • Clustering gene expression data analysis using an improved em algorithm based on multivariate elliptical contoured mixture models

      2014, Optik
      Citation Excerpt :

      Other common clustering algorithms include CAST algorithm [11]. SVM clustering [12] and model-based clustering [13,14]. Gene expression data has a lot of noise, and large amounts of data behind many variables cannot be observed.

    • Gene expression profiling in sepsis: Timing, tissue, and translational considerations

      2014, Trends in Molecular Medicine
      Citation Excerpt :

      A number of statistical methods have been developed to analyze time course gene expression data. Approaches include Markov models [66–68], ANOVA [69], and the use of cubic splines to model changing expression levels over time [70]. Time course gene expression data from trauma and burn patients have been used to develop statistical methods for the analysis of leukocyte gene expression over time, such as the riboleukogram, which uses principal components analysis to graphically represent a patient's genomic trajectory over time [36,71].

    • Gene module level analysis: identification to networks and dynamics

      2008, Current Opinion in Biotechnology
      Citation Excerpt :

      Furthermore, clustering methods do not take into account the temporal dependencies of the expression data, limiting their ability to capture the dynamic behavior of genes [80]. To address this issue, approaches based on dynamic models [45,46] or temporal trends of the gene expression data [47,48] have been developed to identify modules from time-series microarray data. Pathway-based approaches aim to identify pathways, namely, groups of genes, rather than individual genes involved in biological processes.

    • 2nd Special Issue on Statistical Signal Extraction and Filtering

      2007, Computational Statistics and Data Analysis
    View all citing articles on Scopus
    View full text