A data mining approach to discover unusual folding regions in genome sequences
Introduction
Complete genomic sequence data are being accumulated at an unprecedented pace. A wide variety of computational methods for analyzing genomic sequences have been developed [1], [2]. Most of the problems in these methods are essentially statistical. Computational analyses of the distinct sequence pattern can help to understand the structure and function of genomic sequences. The discovery of biological knowledge from sequence data consisting of bases A, C, G, T/U in biological databases, such as Genbank, is especially important in a post-genomics age.
RNA is a single-stranded conformationally polymorphic macromolecule with its nucleotide sequence identical to that of one of the DNA strands except for a base replacement of T to U. The RNA sequence often folds back on itself between complementary segments to form various local structures guided by Watson–Crick rules. In addition to the Watson–Crick A–U and G–C base pairs, wobble G–U base pairs also contribute to the thermodynamic stability of an RNA structure. It has been demonstrated that some structures folded by local RNA segments are functional elements of the control for gene regulations in different levels [3], [4]. These functional elements are often closely associated with unusual folding regions (UFRs) where the folding free energy of the UFR is significantly lower than that expected by chance [5], [6], [7], [8], [9], [10], [11], [12]. The development of an efficient data mining approach to extract these potentially functional structured elements in the sequence database is highly desirable.
Knowledge discovery of functional structured elements in a genomic sequence is an important step to reach our goal from genome data to biological knowledge. The thermodynamic stability of an RNA/DNA fragment in the genome is often measured by the free energy of the formation of the folded RNA/DNA segment. Based on accumulated data [3], [4], [13], UFRs in an RNA sequence are assessed by the two z-scores, significant score (SIGSCR) and stability score (STBSCR) [13], [14]. SIGSCR signifies the difference of thermodynamic stability between a local, natural RNA fragment and the average of its randomly shuffled sequences. Similarly, STBSCR indicates the difference of the stability between a specific fragment at a given place and the average from all other fragments of the same size in the sequence. As an example of our data mining, we analyze the complete genome sequence of Mycoplasma genitalium (M. gen).
Our data mining approach consists of three steps. In the first stage, we compute SIGSCR and STBSCR by sliding a fixed window with a step of one base along the sequence from the start to end position. Our statistical analysis shows that the distributions of the two z-scores in the sequence do not follow a simple normal distribution. In order to obtain useful information from an extraordinarily large number of sample observations in the analysis, we have to derive a reliable statistical model to describe the distributions of the two z-scores. In the second step we develop a linearly transformed non-central Student's t statistical model to delineate the distributions of SIGSCR and STBSCR in the entire genomic sequence by means of a non-central Student's t distribution theory [15]. Statistical tests show that the linearly transformed non-central Student's t distribution (LTNSTD) is a good statistical model to describe the distributions of the two scores computed in the genome. In the last step, the significant UFRs that are either much more stable or unstable than expected by chance are discovered based on the derived, well-fitted LTNSTD.
As a comparison, we also compute the distributions of SIGSCR and STBSCR in the randomly shuffled sequence of the complete M. gen genome. Our results further demonstrate that the statistical extremes of UFRs are not selected by chance in M. gen. The UFRs in the genome may imply the biological functions of the primary sequence data and provide useful information in further searching for functional structured elements involved in the control of regulatory genes [5], [6], [7], [8], [9], [10], [11], [12].
Section snippets
SIGSCR and STBSCR of a folding segment
The quality of a local structure in a DNA/RNA sequence is often evaluated by the thermodynamic stability of the structured segment. The greater the free energy of the formation of the structure in negative numbers, the more stable the folded structure of a fragment. In this study, the biological information of such structured fragments in an RNA sequence is evaluated by SIGSCR and STBSCR of a local segment. SIGSCR and STBSCR are a standard z-score and given byand
First step of our data mining: computing SIGSCR and STBSCR
In the first step of our data mining approach, SIGSCR and STBSCR in a sequence are computed by the program SIGSTB, a modified version of SEGFOLD [14], using fixed windows of 100, 300 and 500 bases. The program SIGSTB first computes E, Er, stdr and SIGSCR for the fragment with the same size as the selected window from the beginning of the sequence. The lowest free energy E is computed by folding the segment using the dynamic programming algorithm [17] and Turner energy rules [18]. Er and stdr of
Second step of our data mining: deriving a LTNSTD for SIGSCR and STBSCR
Since neighboring scores in the six samples are possibly not fully independent, we also take a random sample with size of 5000 observations (SIGSCR or STBSCR) for each of the six samples so that the distance between any two neighboring observations in the randomly selected sample is equal to or larger than 100 bases. We compute the sample mean (ȳ), sample standard deviation (sy) and sample coefficient of skewness (k) for data SIGSCR and STBSCR in these randomly selected samples. For a given
Third step of our data mining: discoveries of UFRs
For a continuous distribution of a random variable x, we define the quantile [24] qα with probability α in the distribution as P(x≤qα)=α. For a given probability α in the derived theoretical cumulative distribution, F(x; f, δ) of LTNSTD we calculate the quantile qα by solving the equation qα=F−1(x; f, δ), where F−1(x; f, δ) is the inverse function of F(x; f, δ). In practice, the qα is computed by the function NCTINV in the statistical toolbox of MATLAB software. In general, we calculate
Statistics of SIGSCR and STBSCR in the M. gen genome
Statistics of local thermodynamic stability in the M. gen sequence are listed in Table 1. It is clear that the distributions of SIGSCR and STBSCR computed by windows of 100, 300 and 500 bases are asymmetric in the M. gen sequence. These distributions do not follow a normal distribution because of large skewness in the samples (see Fig. 1). We also computed the means of SIGSCR and STBSCR in the protein coding, RNA gene and non coding regions by means of the known gene structures of M. gen listed
Conclusions and perspectives
In this study, we present a data mining approach to discover UFRs in the M. gen genome sequence. At the first stage of the approach, we calculate two z-scores of SIGSCR and STBSCR in the sequence. Next, we derive a LTNSTD statistical model to describe the distributions of the two scores in the M. gen sequence. Finally, we discover the UFRs in M. gen based on the derived LTNSTDs, whose SIGSCR and STBSCR values are significantly deviated from their sample means. The approach is generally
Acknowledgements
The contents of this publication do not necessarily reflect the views or policies of the Department of Health and Human Services, neither does mention of trade names, commercial products, or organizations imply endorsement by the US Government. The program SEGFOLD and its modified version SIGSTB are available via anonymous ftp as /home/ftp/pub/users/shuyun/sigfold at ftp.ncifcrf.gov. The script files of performing these statistical analyses in this study are available upon request from the
References (24)
- et al.
A method for assessing the statistical significance of RNA folding
J. Theor. Biol.
(1989) - et al.
Local thermodynamic stability scores are well represented by a non-central Student's t distribution
J. Theor. Biol.
(2001) - et al.
Comparative DNA analysis across diverse genomes
Annu. Rev. Genet.
(1998) - et al.
Biological Sequence Analysis
(1998) The RNA World
RNA Structure and Function
- et al.
The HIV-1 rev trans-activator acts through a structured target sequence to activate nuclear export of unspliced viral mRNA
Nature
(1989) - et al.
conserved RNA folding region coincident with the Rev response element of primate immunodeficiency viruses
Nucl. Acids Res.
(1990) - et al.
Stability of RNA stem-loop structure and distribution of non-random structure in the human immunodeficiency virus (HIV-1)
Nucl. Acids Res.
(1988) - et al.
Thermodynamic stability and statistical significance of potential stem-loop structures situated at the frameshift sites of retro-viruses
Nucl. Acids Res.
(1989)
Identification of unusual RNA folding patterns encoded by bacteriophage T4 gene 60
Gene
A common structural core in the internal ribosome entry sites of picornavirus, hepatitis C virus, and pestivirus
Virus Gene
Cited by (10)
Segmentation of DNA using simple recurrent neural network
2012, Knowledge-Based SystemsCitation Excerpt :Hidden Markov model were also used in extracting motifs for predicting the binding sites of unknown transcription factors, without a priori knowledge, from functionally related DNA sequences [3]. Machine learning methods are capable of building the models automatically and, then, the huge number of combinations of features can be tested [17,18]. For example, Sonnenburg et al. [4] use the kernel weight to determine the exon start.
Two-tiered approach identifies a network of cancer and liver disease-related genes regulated by miR-122
2011, Journal of Biological ChemistryCitation Excerpt :Correlation of each of the predicted targets was evaluated using Pearson correlation analysis. The computational prediction of miRNA targets relied upon a set of computer programs, including Target, SigStb, SegFold, and Scanfd (26–28). We initially used Target to search for putative target regions containing complementary sequences with an miR-122 seed sequence (P2–P8) in which only one wobble base pair G:U or U:G was allowed in P3–P8.
Predicting protein secondary structure using a mixed-modal SVM method in a compound pyramid model
2011, Knowledge-Based SystemsCitation Excerpt :Meanwhile, increasing the accuracy of protein secondary structure prediction can play an important role in improving the accuracy of tertiary structure prediction, as demonstrated for ab initio and protein threading methods [2–4]. Many approaches have been successfully applied to the prediction of protein secondary structure, such as neural networks [5–7], hidden Markov models [8], support vector machines (SVM) [9,10], data mining [11–15] and so on. In this article, we firstly introduce a mixed-modal SVM method for predicting protein secondary structure, and then construct a novel compound pyramid model (CPM) to achieve higher prediction accuracy, using KDD∗ [16], Maradbcm [16], mixed-modal BP and mixed-modal SVM approaches.
Algorithms for pattern matching and discovery in RNA secondary structure
2005, Theoretical Computer ScienceData mining of functional RNA structures in genomic sequences
2011, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery