An expert system to identify co-regulated gene groups from time-lagged gene clusters using cell cycle expression data

doi:10.1016/j.eswa.2009.07.053

Expert Systems with Applications

Volume 37, Issue 3, 15 March 2010, Pages 2202-2213

https://doi.org/10.1016/j.eswa.2009.07.053 Get rights and content

Abstract

Motivation

The analysis of time series gene expression data can provide us with the opportunity to find co-regulated genes that show a similar expression patterns under a contiguous subset of experimental conditions. However, these co-regulated genes may behave almost independently under other conditions. Furthermore, the similarity in the expression pattern might be time-shifted. In that case, we need to be concerned with grouping genes that share similar expression patterns under a contiguous subset of conditions and where the similarity in expression pattern might have time lags. In addition, to be considered a time-shifted similar pattern, because co-regulated genes in a biological process may show a periodic pattern in their cell cycle expression, we also should group genes with periodic similar patterns over multiple cell cycles. If this is carried out, we can regard such grouped genes as cell-cycle regulated genes.

Results

We propose a method that follows the q-cluster concept [Ji, L., & Tan, K. L. (2005). Identifying time-lagged gene clusters using gene expression data. Bioinformatics, 21(4), 509–516] and further advances this approach towards the identification of cell-cycle regulated genes using cell cycle microarray data. We used our method to cluster a microarray time series of yeast genes to assess the statistically biological significance of the obtained clusters we used the p-value obtained from the hypergeometric distribution. We found that several clusters provided findings suggesting a TF–target relationship. In order to test whether our method could group related genes that other methods have found difficult to group, we compared our method with other measures such as Spearman Rank Correlation and Pearson Correlation. The results of the comparison demonstrate that our method indeed could group known related genes that these measures regard as having only a weak association.

Introduction

DNA microarray technology enables the simultaneous study of gene expression levels on a large scale. Expression level is the logarithm of the abundance of the mRNA of a gene under a specific condition. The gene expression data of a microarray is arranged as a data matrix. Each gene corresponds to one row and each condition to one column. Each element of this matrix represents the expression level of a gene under a specific condition. The conditions of a microarray may be different time points, different environmental conditions or different organs. The analysis of microarray data has facilitated the study of genetic regulatory networks. The correlation patterns of genes with experimental conditions can be used to identify the networks that are comprised of correlated genes and thus how the correlated genes interact with each other.

Clustering methods can be applied to either the genes or the conditions of the microarray matrix separately. However, some problems occur when applying clustering to the analysis of gene expression data. A set of genes may simultaneously activate a particular biological process over certain contiguous conditions but behave independently under other condition. Therefore, we need to group genes that have similar behavior under a specific subset of the conditions. Clustering can not satisfy this requirement. The biclustering method is a technique that makes this possible and allows the grouping of genes and conditions simultaneously within a data matrix. The goal of biclustering is to find a bicluster that is a subset of genes that show similar behavior under a specific subset of the conditions (Cheng & Church, 2000). Thus, genes in the same bicluster are co-expressed and further are likely to be co-regulated.

In the analysis of time-series expression data, a set of genes may activates a particular biological process over a certain contiguous set of conditions instead under a discrete condition. In such a case, we should find biclusters for a contiguous subset of conditions. However, in fact, co-expressed genes do not regulate each other simultaneously but only after a certain time lag. That is to say, there is a transcriptional time lag whereby the regulator gene takes time to express its protein product and a further delay occurs as the target gene responds to the regulator protein. Hence, because of the transcriptional time lag of co-regulated genes, we need to take time-lagged co-regulated genes into consideration when forming biclusters. In addition, when considering time-lagged similarity of expression patterns between genes, it is necessary to consider biclusters with coherent values for both the rows and columns of the expression matrix. This is because their expression relies on a promoter that is a structural regulatory sequence recognized by a TF of the RNA polymerase holoenzyme. The reason that co-expressed genes share a common sequence within their promoter will therefore result in shared expression. However, the recognition efficiency of this TF is not the same for every gene having the same promoter. This condition leads to biclusters having a variety of coherence values. There are two kinds of coherence values for a bicluster. The first is a shifted similarity pattern that can be viewed as based on an additive model. The second is a scaled similarity pattern that can be viewed as based on a multiplicative model. In a mathematical sense, scaling and vertical shifting of the expression level can be referred to as linear transformations. Consider two time-series x and y. In this case y is a linear transformation of x if it can be expressed as $y = mx + b$ .

Many biological processes show periodic pattern such as the cell cycle process, therefore it is useful to find periodically regulated genes with similar periodic patterns of expression. We can then use these cell-cycle regulated genes to map the transcriptional regulatory network that controls the cell cycle.

A suffix tree is a data structure that contains all suffixes of a string s. It has been widely used for string matching and exact sequence comparison (Ukkonen, 1995). This approach was used to develop an algorithm for building a suffix tree that runs in time $O (n)$ . Once a suffix tree is built, most problems can be solved in linear-time using it. The suffix tree built for a set of strings is called a generalized suffix tree (Gusfield, 1997). In order to avoid creating empty suffixes, we usually append to s an extra character $ before the building of the suffix tree. The key feature of the generalized suffix tree is that any leaves in this tree contain two pieces of information: The first is the string number and the second is the starting position of a suffix that makes up this string.

Section snippets

Related work

A great many approaches have been developed for the identification of co-regulated genes from microarrays. The correlation method is one that determines whether two variables have a strong global association, but this approach does not take time lag issue into consideration. However, another correlation method, the Cross-Correlation Method (Kato et al., 2001), is different from the traditional Pearson Correlation approach. It takes into account time-lagged co-regulations when testing the

Methods

In order to take into account the bicluster’s coherent values when biclustering, we firstly transform the expression data into event strings. We then use the set of event strings to construct a generalized tree. When the generalized tree has been built, we can use it to form biclusters quickly while taking time lags into consideration. Furthermore, during the analysis of cell cycle expression data, we need to focus on finding cell-cycle regulated genes. Therefore, we further transform the event

Clustering with the aim of finding time-lagged co-regulated genes

In this experimental system, we applied Spellman’s yeast cell cycle dataset that includes 6331 open reading frames. The full dataset contains all the expression data for the alpha factor, cdc 15 and elutriation time course experiments. We used only the alpha factor dataset to validate our approach. We examined the results obtained with this dataset by our method for the detection of time-lag co-regulated genes over the cell cycle. We applied the approach to the cell-cycle regulated genes that

Comparison of other approaches

In order to test whether our method could group co-regulated genes that other method find hard to group, we compared our method with some correlation measures such as Spearman Rank Correlation and Pearson Correlation. Initially, we picked two genes, MCM7 and MCM4, which are components of the MCM complex (Davey et al., 2003). Fig. 6 shows the expression levels of these two genes and similar expression patterns that were detected by our method. Then, we calculated the pairwise similarity between

Conclusion

By converting the gene expression values, we present a local method to find potential TF–target relationships allowing the detection of time-lagged gene clusters based on periodically similar patterns over multiple cycles. Genes that have the same periodic patterns are grouped together and these genes are likely to be cell-cycle regulated genes that are controlled simultaneously by a TF. Generally, these TFs are included in same group as the regulated genes. The similarity of the two expression

References (16)

R.J. Cho et al.
A genome-wide transcriptional analysis of the mitotic cell cycle
Molecular Cell
(1998)
M. Davey et al.
Reconstitution of the Mcm2-7p heterohexamer, subunit arrangement, and ATP site architecture
Journal of Biological Chemistry
(2003)
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules, Proc. of the 20th VLDB Conf.,...
R. Balasubramaniyan et al.
Clustering of gene expression data using a local shape-based similarity measure
Bioinformatics
(2005)
Cheng, Y., & Church, G. M. (2000). Biclustering of expression data. In Proceedings of the International Conference on...
Filkov, V., Skiena, S., et al. (2001). Identifying gene regulatory networks from experimental data. In Proceedings of...
D. Gusfield
Algorithms on strings, trees and sequences: Computer science and computational biology
(1997)
L. Ji et al.
Identifying time-lagged gene clusters using gene expression data
Bioinformatics
(2005)

There are more references available in the full text version of this article.

Cited by (3)

Novel techniques and an efficient algorithm for closed pattern mining
2014, Expert Systems with Applications
Citation Excerpt :
The benefit of this kind of general approach has been presented in Gyenesei, Wagner, Barkow-Oesterreicher, Stolte, and Schlapbach (2007) using gene expression data. The key benefit of the generalized method is the gained ability to make a distinction between up-regulated and down-regulated genes and thus, discover previously hidden closed patterns (Cano, García, López, & Blanco, 2009; Gyenesei et al., 2007; Wu, Huang, Horng, & Huang, 2010). At the end of the section we will show how traditional methods developed only for binary data could also be applied to (-1, 0, 1) data with the cost of performing a few additional matrix transformation steps and post filtering to remove duplicated patterns.
In this paper we show that frequent closed itemset mining and biclustering, the two most prominent application fields in pattern discovery, can be reduced to the same problem when dealing with binary (0–1) data. FCPMiner, a new powerful pattern mining method, is then introduced to mine such data efficiently. The uniqueness of the proposed method is its extendibility to non-binary data. The mining method is coupled with a novel visualization technique and a pattern aggregation method to detect the most meaningful, non-overlapping patterns. The proposed methods are rigorously tested on both synthetic and real data sets.
Reliable multiclass cancer classification of microarray gene expression profiles using an improved wavelet neural network
2011, Expert Systems with Applications
Citation Excerpt :
Separating each data point into only one cluster is impractical. There are different issues in the microarray experiment, such as gene selection and finding the minimum number of genes needed for an optimal classification, that have been studied by other researchers (Blazadonakis & Zervakis, 2008; Chen, Feng, & Szeto, 2006; Horng et al., 2009; Huang, Lee, & Ho, 2007; Li & Tang, 2007; Shim et al., 2009; Wang, Chu, & Xie, 2007; Wang et al., 2005; Wu, Huang, Horng, & Huang, 2010). In this paper, the latter issue is not our main concern.
Properly designing a wavelet neural network (WNN) is crucial for achieving the optimal generalization performance. In this paper, two different approaches were proposed for improving the predictive capability of WNNs. First, the types of activation functions used in the hidden layer of the WNN were varied. Second, the proposed enhanced fuzzy c-means clustering algorithm—specifically, the modified point symmetry-based fuzzy c-means (MSFCM) algorithm—was employed in selecting the locations of the translation vectors of the WNN. The modified WNN was then applied to heterogeneous cancer classification using four different microarray benchmark datasets. The comparative experimental results showed that the proposed methodology achieved an almost 100% classification accuracy in multiclass cancer prediction, leading to superior performance with respect to other clustering algorithms. Subsequently, performance comparisons with other classifiers were made. An assessment analysis showed that this proposed approach outperformed most of the other classifiers.
Efficient Approximation of Statistical Significance in Local Trend Analysis of Dependent Time Series
2022, Frontiers in Genetics

View full text

An expert system to identify co-regulated gene groups from time-lagged gene clusters using cell cycle expression data

Abstract

Motivation

Results

Introduction

Section snippets

Related work

Methods

Clustering with the aim of finding time-lagged co-regulated genes

Comparison of other approaches

Conclusion

Molecular Cell

Journal of Biological Chemistry

Clustering of gene expression data using a local shape-based similarity measure

Bioinformatics

Algorithms on strings, trees and sequences: Computer science and computational biology

Identifying time-lagged gene clusters using gene expression data

Bioinformatics