A novel HMM-based clustering algorithm for the analysis of gene expression time-course data

doi:10.1016/j.csda.2005.07.007

Computational Statistics & Data Analysis

Volume 50, Issue 9, 1 May 2006, Pages 2472-2494

https://doi.org/10.1016/j.csda.2005.07.007 Get rights and content

Abstract

A novel hidden Markov model (HMM) and clustering algorithm for the analysis of gene expression time-course data is proposed. The proposed model, called the profile-HMM, is specifically designed to explicitly take into account the dynamic nature of temporal gene expression profiles, which is ignored by many clustering methods existing in the literature. In this model, gene expression dynamics are represented by a special set of paths, with each path characterizing a stochastic pattern. The profile-HMM is trained to contain the most likely set of stochastic patterns given the dynamic microarray data, and the clustering result is obtained by grouping together the time-series that are most likely to be related to the same pattern. The novelty of the method is that the behavior of the whole gene expression dataset is modeled by a single HMM acting as a self-organizing map, so that all the clusters are implicitly and jointly defined in the model during the training phase. An attractive property of the profile-HMM clustering algorithm is its ability to automatically identify the number of clusters. The resulting performance is demonstrated by its application on simulated and biological data.

Introduction

This paper addresses the problem of time-course data clustering, which is motivated by the increasing importance of temporal profiles in the context of gene expression data analysis. These temporal profiles, also referred to as gene expression time-course data, provide insights about how biological systems evolve in time and about how genes are co-regulated during a specific biological process. Therefore, it is well accepted that similar expression profiles potentially indicate related functions. This similarity can be explored by clustering analysis, which groups genes according to their expression time-course data. This technique has been widely applied in this field.

Early studies on clustering of gene expression time-course data focused on the implementation of some popular distance-based clustering methods, such as K-means clustering (Tavazoie et al., 1999), hierarchial clustering (Spellman et al., 1998), and self-organizing maps (SOM) (Tamayo et al., 1999). Although these traditional methods produce meaningful results for some datasets, none of them is able to take into account the dependences existing between observations belonging to subsequent time-points (e.g., the results obtained from these methods are invariant to arbitrary permutations of the time-points). This temporal dependence is an important feature of time-course data, and it is intuitive that its exploitation in the clustering process should lead to higher quality results.

The problem of capturing the dynamic patterns in the particular case of gene expression time-course data has been recently addressed by several authors (Bar-Joseph et al., 2002, Luan and Li, 2003, Moller-Levet et al., 2003). The common idea in these studies is to represent the gene expression time-series as continuous or piecewise continuous curves, and then to perform clustering based on the estimated curves. The fuzzy short time-series (FSTS) clustering method proposed by Moller-Levet et al. (2003) is based on the incorporation of a new distance metric, which utilizes a piecewise linear model, into a standard fuzzy clustering scheme. This method is simple and fast, but its underlying linear assumption may be an oversimplification in the type of problems encountered in real biological applications. More flexible models, such as spline models, have been applied in this context (Luan and Li, 2003, Moller-Levet et al., 2003). However, in order to produce reliable results, they require the data to be sampled at a sufficiently high rate.

Another popular way to exploit time dependences is the use of hidden Markov models (HMMs). An HMM can be viewed as a stochastic generalization of a finite-state automata, and it provides a probabilistic description of temporal dependences. Although HMMs have been widely used in many fields, such as speech recognition and digital communications, their application on the clustering of temporal gene expression profiles has not been widespread. One line of work utilizes HMMs to devise model-based metrics for time-series. The idea is to generate an HMM for each sequence, and then to compute the log-likelihood (LL) of each HMM for any of the sequences. This information is used to build a matrix of distances between sequences, and then the data is clustered by applying a distance-based clustering method employing such a matrix (Bicego et al., 2003, Smyth, 1997). Alternatively, the LLs can be directly utilized as features for later clustering processing (Panuccio et al., 2002). However, there are several limitations in the LL's calculation, which may degrade the performance of the whole clustering analysis. First, in order to calculate the LLs, one HMM is trained for each sequence. Since reliable training requires long sequences, the reliability of the clustering result may be heavily degraded when dealing with short sequences of gene expression data. Second, since each HMM is trained separately and independently, the model lacks a global view on the overall distribution of the patterns in the data. Finally, this technique assumes that for each gene the transitions between neighboring temporal observations follow the same stationary stochastic process. However, this time-invariance assumption does not usually hold in microarray data, especially when the expression measures are taken non-uniformly in time.

The first two aforementioned limitations have been addressed by Smyth (1997) by constructing a mixture of K-HMMs, where each component HMM represents one cluster. Each HMM is built on a separate set of states, and no transition is allowed between states from different component HMMs. The model initialization requires a K-cluster partition, so that each component is trained by one cluster. Then, the expectation-maximization (EM) algorithm is applied on the overall model to retrain the parameters of all components using the whole observation data. Obviously, the performance of this method is dependent on the availability of a good prediction for the number of clusters, K, which was estimated by Smyth (1997) utilizing Monte Carlo cross-validation techniques. Extensions of this work can be found in Ji et al. (2003) and Schliep et al. (2003). Another related but different approach was presented by Ramoni et al. (2002), who developed a software tool called CAGED (clustering analysis of gene expression data). In this software, the gene expression time-course data is represented by an autoregressive (AR) HMM and the clustering result is obtained using an agglomerative hierarchical method in a Bayesian framework. The number of clusters in the final result is decided on the basis of the posterior probability of the model, and each cluster is represented by an AR model trained over its corresponding time-series.

Although the work mentioned in the previous paragraph addresses the first two limitations existing in LL-based approaches, it still keeps the restrictive assumption of time-invariance: the general HMMs in Ji et al. (2003), Schliep et al. (2003) and Smyth (1997) utilize the same group of parameters to describe the temporal dependences in different time intervals, while in the CAGED software the autoregressive coefficients within a given model are still assumed fixed through all the time intervals. This questionable assumption may degrade the clustering performance, especially when dealing with irregularly sampled time-course data. Moreover, because of this assumption, it may be difficult to interpret the final model. Besides, although these methods consider the whole dataset during the training stage, the “communication” between the component HMMs representing each cluster is very limited: In Ji et al. (2003), Schliep et al. (2003) and Smyth (1997), “crossover” is not allowed among the component HMMs, while in CAGED each one of the AR models (which represents a cluster) is trained only by the time-series associated with its corresponding cluster.

In this paper, we introduce a new HMM-based clustering method for the analysis of gene expression time-series data. The proposed HMM model, which we call the profile-HMM, provides a natural description for multiple dependent stochastic patterns, and explicitly takes into account the dynamic nature of temporal gene expression profiles. Based on this model, the similarity between two time-series is defined according to the probability that they are related to the same stochastic pattern, and the training task is to find the most probable set of patterns characterizing the observed time-series. Finally, the clustering result is obtained by grouping together the time-series that are most likely to be related with the same pattern. It is important to remark that the training and clustering procedures in the profile-HMM are nothing more than standard techniques commonly applied in HMMs, which allows an easy implementation. In particular, the profile-HMM is trained by the Baum–Welch algorithm (Rabiner, 1989), and, once the model has been trained, clustering is performed utilizing the Viterbi algorithm (Rabiner, 1989). Another important feature of the profile-HMM is its ability to automatically identify the number of clusters contained in the dataset.

The main novelty of the profile-HMM approach with respect to previous work is the way in which it represents the stochastic patterns hidden in the observed time-series by using a single left–right HMM. This novel structure allows an easy connection between the elements in the model and the real application, since the patterns hidden in the data are characterized in a natural way. Moreover, the use of a single HMM forces the representative patterns to be selected simultaneously. Compared with other approaches in which a different HMM is created for each cluster, an advantage of the profile-HMM is that each stochastic pattern (and consequently each cluster) is built according to both negative samples (i.e., profiles related to other pattern) and positive samples (profiles related to this pattern). Another important difference with respect to previous work is that the time dependences at different time intervals are modeled separately, which relaxes the time-invariance assumption and makes the profile-HMM more general and flexible to describe the microarray data.

The rest of this paper is organized as follows. In Section 2, we introduce the definition of the profile-HMM and describe the details of the proposed clustering algorithm. The performance of the proposed method is evaluated using simulation data in Section 3, where we carry out a comparison study with other clustering techniques. We also applied the profile-HMM to real gene expression time-course data collected for the study of real biological problems. Section 4 analyzes and discusses the results obtained in these applications. Conclusions are provided in Section 5.

Section snippets

Hidden Markov models

An HMM is formally defined by the following elements (Rabiner, 1989):

•
$S = \{s_{1}, \dots, s_{N}\}$ , the finite set of possible (hidden) states.
•
$A = {[a_{ij}]}_{1 ⩽ i, j ⩽ N}$ , the transition matrix, which describes the transition probability distribution among the associated states. In this matrix, the entry $a_{ij}$ represents the probability of moving from state $s_{i}$ to $s_{j}$ $(a_{ij} = \Pr (q_{t + 1} = s_{j} | q_{t} = s_{i}), 1 ⩽ i, j ⩽ N .)$ .
•
$B = \{b_{j} (o), 1 ⩽ j ⩽ N\}$ , the emission matrix, which indicates the probability of emission of observation $o \in R$ when the system state is $s_{j}$ $(b_{j} (o) = \Pr (o at))$

Simulation data

In order to investigate the performance of the profile-HMM method, and to compare it with that of other clustering techniques, we consider first a simple simulation-based example for which the correct results are known. The simulation dataset is a collection of 550 time-series, each of length $T = 5$ , generated from the five basic sequences shown in Fig. 2. In order to produce similar but distinct time-series (grouped in as many clusters as basic sequences), a continuous noise generated using

Sporulation dataset

This dataset was collected to study the transcriptional program of sporulation in budding yeast (Chu et al., 1998). In this study, DNA microarrays, containing 97% of the known or predicted genes of Saccharomyces cerevisiae, are used to explore the temporal program of gene expression during meiosis and spore formation. Changes in the concentrations of mRNA transcripts from each gene were measured at seven uneven time intervals, and the sampling times were chosen based on the expression patterns

Conclusion and future work

This paper proposes a hidden Markov model-based clustering algorithm, which is able to efficiently account for the time dependences existing in time-course gene expression data. The proposed model, the profile-HMM, provides a natural description for multiple dependent stochastic patterns, where each pattern is represented as a path in the proposed model. This method shows a self-organizing nature in both its performance and its structure, and is able to automatically decide the number of

References (22)

X. Ji et al.
Mining gene expression data using a novel approach based on hidden Markov models
FEBS Lett.
(2003)
Bar-Joseph, Z., Gerber, G., Gifford, D.K., Jaakkola, T.S., 2002. A new approach to analyzing gene expression time...
Bicego, M., Murino, V., Figueiredo, M., 2003. Similarity-based clustering of sequences using hidden Markov models. MLDM...
S. Chu et al.
The transcriptional program of sporulation in budding yeast
Science
(1998)
D.L. Davies et al.
A cluster separation measure
IEEE Trans. Pattern Anal. Machine Intell.
(1979)
M.B. Eisen et al.
Clustering analysis and display of genome-wide expression patterns
Proc. Natl Acad. Sci. USA
(1998)
Fraley, C., Raftery, A., How many clusters? Which clustering method? Answers via model-based cluster analysis....
A. Gordon
Classification
(1999)
M. Halkidi et al.
On clustering validation techniques
J. Intell. Information Syst.
(2001)
V.R. Iyer et al.
The transcriptional program in the response of human fibroblasts to serum
Science
(1999)

A.K. Jain et al.

Algorithms for Clustering Data

(1988)

Cited by (27)

Clustering gene expression data analysis using an improved em algorithm based on multivariate elliptical contoured mixture models
2014, Optik
Citation Excerpt :
Other common clustering algorithms include CAST algorithm [11]. SVM clustering [12] and model-based clustering [13,14]. Gene expression data has a lot of noise, and large amounts of data behind many variables cannot be observed.
Clustering gene expression data is an important research topic in bioinformatics because knowing which genes act similarly can lead to the discovery of important biological information. Many clustering algorithms have been used in the field of gene clustering. The multivariate Gaussian mixture distribution function was frequently used as the component of the finite mixture model for clustering, however the clustering cannot be restricted to the normal distribution in the real dataset. In order to make the cluster algorithm strong adaptability, this paper proposes a new scheme for clustering gene expression data based on the multivariate elliptical contoured mixture models (MECMMs). To solve the problem of over-reliance on the initialization, we propose an improved expectation maximization (EM) algorithm by adding and deleting initial value for the classical EM algorithm, and the number of clusters can be treated as a known parameter and inferred with the QAIC criterion. The improved EM algorithm based on the MECMMs is tested and compared with some other clustering algorithms, the performance of our clustering algorithm has been extensively compared over several simulated and real gene expression datasets. Our results indicated that improved EM clustering algorithm is superior to the classical EM algorithm and the support vector machines (SVMs) algorithm, and can be widely used for gene clustering.
Gene expression profiling in sepsis: Timing, tissue, and translational considerations
2014, Trends in Molecular Medicine
Citation Excerpt :
A number of statistical methods have been developed to analyze time course gene expression data. Approaches include Markov models [66–68], ANOVA [69], and the use of cubic splines to model changing expression levels over time [70]. Time course gene expression data from trauma and burn patients have been used to develop statistical methods for the analysis of leukocyte gene expression over time, such as the riboleukogram, which uses principal components analysis to graphically represent a patient's genomic trajectory over time [36,71].
Sepsis is a complex inflammatory response to infection. Microarray-based gene expression studies of sepsis have illuminated the complex pathogen recognition and inflammatory signaling pathways that characterize sepsis. More recently, gene expression profiling has been used to identify diagnostic and prognostic gene signatures, as well as novel therapeutic targets. Studies in pediatric cohorts suggest that transcriptionally distinct subclasses might account for some of the heterogeneity seen in sepsis. Time series analyses have pointed to rapid and dynamic shifts in transcription patterns associated with various phases of sepsis. These findings highlight current challenges in sepsis knowledge translation, including the need to adapt complex and time-consuming whole-genome methods for use in the intensive care unit environment, where rapid diagnosis and treatment are essential.
Identifying cluster number for subspace projected functional data clustering
2011, Computational Statistics and Data Analysis
We propose a new approach, the forward functional testing (FFT) procedure, to cluster number selection for functional data clustering. We present a framework of subspace projected functional data clustering based on the functional multiplicative random-effects model, and propose to perform functional hypothesis tests on equivalence of cluster structures to identify the number of clusters. The aim is to find the maximum number of distinctive clusters while retaining significant differences between cluster structures. The null hypotheses comprise equalities between the cluster mean functions and between the sets of cluster eigenfunctions of the covariance kernels. Bootstrap resampling methods are developed to construct reference distributions of the derived test statistics. We compare several other cluster number selection criteria, extended from methods of multivariate data, with the proposed FFT procedure. The performance of the proposed approaches is examined by simulation studies, with applications to clustering gene expression profiles.
Gene module level analysis: identification to networks and dynamics
2008, Current Opinion in Biotechnology
Citation Excerpt :
Furthermore, clustering methods do not take into account the temporal dependencies of the expression data, limiting their ability to capture the dynamic behavior of genes [80]. To address this issue, approaches based on dynamic models [45,46] or temporal trends of the gene expression data [47,48] have been developed to identify modules from time-series microarray data. Pathway-based approaches aim to identify pathways, namely, groups of genes, rather than individual genes involved in biological processes.
Nature exhibits modular design in biological systems. Gene module level analysis is based on this module concept, aiming to understand biological network design and systems behavior in disease and development by emphasizing on modules of genes rather than individual genes. Module level analysis has been extensively applied in genome wide level analysis, exploring the organization of biological systems from identifying modules to reconstructing module networks and analyzing module dynamics. Such module level perspective provides a high level representation of the regulatory scenario and design of biological systems, promising to revolutionize our view of systems biology, genetic engineering as well as disease mechanisms and molecular medicine.
2nd Special Issue on Statistical Signal Extraction and Filtering
2007, Computational Statistics and Data Analysis
Optimization of number and range of shunt valve performance levels in infant hydrocephalus: a machine learning analysis
2024, Frontiers in Bioengineering and Biotechnology

View all citing articles on Scopus

View full text

A novel HMM-based clustering algorithm for the analysis of gene expression time-course data

Abstract

Introduction

Section snippets

Hidden Markov models

Simulation data

Sporulation dataset

Conclusion and future work

FEBS Lett.

The transcriptional program of sporulation in budding yeast

Science

A cluster separation measure

IEEE Trans. Pattern Anal. Machine Intell.

Clustering analysis and display of genome-wide expression patterns

Proc. Natl Acad. Sci. USA

Classification

On clustering validation techniques

J. Intell. Information Syst.

The transcriptional program in the response of human fibroblasts to serum

Science

Algorithms for Clustering Data