A data mining approach to discover unusual folding regions in genome sequences

https://doi.org/10.1016/S0950-7051(01)00146-0Get rights and content

Abstract

Numerous experiments and analyses of RNA structures have revealed that the local distinct structure closely correlates with the biological function. In this study, we present a data mining approach to discover such unusual folding regions (UFRs) in genome sequences. Our approach is a three-step procedure. During the first step, the quality of a local structure different from a random folding in a genomic sequence is evaluated by two z-scores, significance score (SIGSCR) and stability score (STBSCR) of the local segment. The two scores are computed by sliding a fixed window stepped a base along the sequence from the start to end position. Next, based on the non-central Student's t distribution theory we derive a linearly transformed non-central Student's t distribution (LTNSTD) to describe the distribution of SIGSCR and STBSCR computed in the sequence. In the third step, we extract these significant UFRs from the sequence whose SIGSCR and/or STBSCR are greater or less than a given threshold calculated from the derived LTNSTD. Our data mining approach is successfully applied to the complete genome of Mycoplasma genitalium (M. gen) and discovers these statistical extremes in the genome. By comparisons with the two scores computed from randomly shuffled sequences of the entire M. gen genome, our results demonstrate that the UFRs in the M. gen sequence are not selected by chance. These UFRs may imply an important structure role involved in their sequence information.

Introduction

Complete genomic sequence data are being accumulated at an unprecedented pace. A wide variety of computational methods for analyzing genomic sequences have been developed [1], [2]. Most of the problems in these methods are essentially statistical. Computational analyses of the distinct sequence pattern can help to understand the structure and function of genomic sequences. The discovery of biological knowledge from sequence data consisting of bases A, C, G, T/U in biological databases, such as Genbank, is especially important in a post-genomics age.

RNA is a single-stranded conformationally polymorphic macromolecule with its nucleotide sequence identical to that of one of the DNA strands except for a base replacement of T to U. The RNA sequence often folds back on itself between complementary segments to form various local structures guided by Watson–Crick rules. In addition to the Watson–Crick A–U and G–C base pairs, wobble G–U base pairs also contribute to the thermodynamic stability of an RNA structure. It has been demonstrated that some structures folded by local RNA segments are functional elements of the control for gene regulations in different levels [3], [4]. These functional elements are often closely associated with unusual folding regions (UFRs) where the folding free energy of the UFR is significantly lower than that expected by chance [5], [6], [7], [8], [9], [10], [11], [12]. The development of an efficient data mining approach to extract these potentially functional structured elements in the sequence database is highly desirable.

Knowledge discovery of functional structured elements in a genomic sequence is an important step to reach our goal from genome data to biological knowledge. The thermodynamic stability of an RNA/DNA fragment in the genome is often measured by the free energy of the formation of the folded RNA/DNA segment. Based on accumulated data [3], [4], [13], UFRs in an RNA sequence are assessed by the two z-scores, significant score (SIGSCR) and stability score (STBSCR) [13], [14]. SIGSCR signifies the difference of thermodynamic stability between a local, natural RNA fragment and the average of its randomly shuffled sequences. Similarly, STBSCR indicates the difference of the stability between a specific fragment at a given place and the average from all other fragments of the same size in the sequence. As an example of our data mining, we analyze the complete genome sequence of Mycoplasma genitalium (M. gen).

Our data mining approach consists of three steps. In the first stage, we compute SIGSCR and STBSCR by sliding a fixed window with a step of one base along the sequence from the start to end position. Our statistical analysis shows that the distributions of the two z-scores in the sequence do not follow a simple normal distribution. In order to obtain useful information from an extraordinarily large number of sample observations in the analysis, we have to derive a reliable statistical model to describe the distributions of the two z-scores. In the second step we develop a linearly transformed non-central Student's t statistical model to delineate the distributions of SIGSCR and STBSCR in the entire genomic sequence by means of a non-central Student's t distribution theory [15]. Statistical tests show that the linearly transformed non-central Student's t distribution (LTNSTD) is a good statistical model to describe the distributions of the two scores computed in the genome. In the last step, the significant UFRs that are either much more stable or unstable than expected by chance are discovered based on the derived, well-fitted LTNSTD.

As a comparison, we also compute the distributions of SIGSCR and STBSCR in the randomly shuffled sequence of the complete M. gen genome. Our results further demonstrate that the statistical extremes of UFRs are not selected by chance in M. gen. The UFRs in the genome may imply the biological functions of the primary sequence data and provide useful information in further searching for functional structured elements involved in the control of regulatory genes [5], [6], [7], [8], [9], [10], [11], [12].

Section snippets

SIGSCR and STBSCR of a folding segment

The quality of a local structure in a DNA/RNA sequence is often evaluated by the thermodynamic stability of the structured segment. The greater the free energy of the formation of the structure in negative numbers, the more stable the folded structure of a fragment. In this study, the biological information of such structured fragments in an RNA sequence is evaluated by SIGSCR and STBSCR of a local segment. SIGSCR and STBSCR are a standard z-score and given bySIGSCR=(E−Er)/stdrandSTBSCR=(E−Ew

First step of our data mining: computing SIGSCR and STBSCR

In the first step of our data mining approach, SIGSCR and STBSCR in a sequence are computed by the program SIGSTB, a modified version of SEGFOLD [14], using fixed windows of 100, 300 and 500 bases. The program SIGSTB first computes E, Er, stdr and SIGSCR for the fragment with the same size as the selected window from the beginning of the sequence. The lowest free energy E is computed by folding the segment using the dynamic programming algorithm [17] and Turner energy rules [18]. Er and stdr of

Second step of our data mining: deriving a LTNSTD for SIGSCR and STBSCR

Since neighboring scores in the six samples are possibly not fully independent, we also take a random sample with size of 5000 observations (SIGSCR or STBSCR) for each of the six samples so that the distance between any two neighboring observations in the randomly selected sample is equal to or larger than 100 bases. We compute the sample mean (ȳ), sample standard deviation (sy) and sample coefficient of skewness (k) for data SIGSCR and STBSCR in these randomly selected samples. For a given

Third step of our data mining: discoveries of UFRs

For a continuous distribution of a random variable x, we define the quantile [24] qα with probability α in the distribution as P(xqα)=α. For a given probability α in the derived theoretical cumulative distribution, F(x; f, δ) of LTNSTD we calculate the quantile qα by solving the equation qα=F−1(x; f, δ), where F−1(x; f, δ) is the inverse function of F(x; f, δ). In practice, the qα is computed by the function NCTINV in the statistical toolbox of MATLAB software. In general, we calculate

Statistics of SIGSCR and STBSCR in the M. gen genome

Statistics of local thermodynamic stability in the M. gen sequence are listed in Table 1. It is clear that the distributions of SIGSCR and STBSCR computed by windows of 100, 300 and 500 bases are asymmetric in the M. gen sequence. These distributions do not follow a normal distribution because of large skewness in the samples (see Fig. 1). We also computed the means of SIGSCR and STBSCR in the protein coding, RNA gene and non coding regions by means of the known gene structures of M. gen listed

Conclusions and perspectives

In this study, we present a data mining approach to discover UFRs in the M. gen genome sequence. At the first stage of the approach, we calculate two z-scores of SIGSCR and STBSCR in the sequence. Next, we derive a LTNSTD statistical model to describe the distributions of the two scores in the M. gen sequence. Finally, we discover the UFRs in M. gen based on the derived LTNSTDs, whose SIGSCR and STBSCR values are significantly deviated from their sample means. The approach is generally

Acknowledgements

The contents of this publication do not necessarily reflect the views or policies of the Department of Health and Human Services, neither does mention of trade names, commercial products, or organizations imply endorsement by the US Government. The program SEGFOLD and its modified version SIGSTB are available via anonymous ftp as /home/ftp/pub/users/shuyun/sigfold at ftp.ncifcrf.gov. The script files of performing these statistical analyses in this study are available upon request from the

References (24)

  • S.-Y. Le et al.

    A method for assessing the statistical significance of RNA folding

    J. Theor. Biol.

    (1989)
  • S.-Y. Le et al.

    Local thermodynamic stability scores are well represented by a non-central Student's t distribution

    J. Theor. Biol.

    (2001)
  • S. Karlin et al.

    Comparative DNA analysis across diverse genomes

    Annu. Rev. Genet.

    (1998)
  • R. Durbin et al.

    Biological Sequence Analysis

    (1998)
  • The RNA World

  • RNA Structure and Function

  • M.H. Malim et al.

    The HIV-1 rev trans-activator acts through a structured target sequence to activate nuclear export of unspliced viral mRNA

    Nature

    (1989)
  • S.-Y. Le et al.

    conserved RNA folding region coincident with the Rev response element of primate immunodeficiency viruses

    Nucl. Acids Res.

    (1990)
  • S.-Y. Le et al.

    Stability of RNA stem-loop structure and distribution of non-random structure in the human immunodeficiency virus (HIV-1)

    Nucl. Acids Res.

    (1988)
  • S.Y. Le et al.

    Thermodynamic stability and statistical significance of potential stem-loop structures situated at the frameshift sites of retro-viruses

    Nucl. Acids Res.

    (1989)
  • S.Y. Le et al.

    Identification of unusual RNA folding patterns encoded by bacteriophage T4 gene 60

    Gene

    (1993)
  • S.Y. Le et al.

    A common structural core in the internal ribosome entry sites of picornavirus, hepatitis C virus, and pestivirus

    Virus Gene

    (1996)
  • Cited by (10)

    • Segmentation of DNA using simple recurrent neural network

      2012, Knowledge-Based Systems
      Citation Excerpt :

      Hidden Markov model were also used in extracting motifs for predicting the binding sites of unknown transcription factors, without a priori knowledge, from functionally related DNA sequences [3]. Machine learning methods are capable of building the models automatically and, then, the huge number of combinations of features can be tested [17,18]. For example, Sonnenburg et al. [4] use the kernel weight to determine the exon start.

    • Two-tiered approach identifies a network of cancer and liver disease-related genes regulated by miR-122

      2011, Journal of Biological Chemistry
      Citation Excerpt :

      Correlation of each of the predicted targets was evaluated using Pearson correlation analysis. The computational prediction of miRNA targets relied upon a set of computer programs, including Target, SigStb, SegFold, and Scanfd (26–28). We initially used Target to search for putative target regions containing complementary sequences with an miR-122 seed sequence (P2–P8) in which only one wobble base pair G:U or U:G was allowed in P3–P8.

    • Predicting protein secondary structure using a mixed-modal SVM method in a compound pyramid model

      2011, Knowledge-Based Systems
      Citation Excerpt :

      Meanwhile, increasing the accuracy of protein secondary structure prediction can play an important role in improving the accuracy of tertiary structure prediction, as demonstrated for ab initio and protein threading methods [2–4]. Many approaches have been successfully applied to the prediction of protein secondary structure, such as neural networks [5–7], hidden Markov models [8], support vector machines (SVM) [9,10], data mining [11–15] and so on. In this article, we firstly introduce a mixed-modal SVM method for predicting protein secondary structure, and then construct a novel compound pyramid model (CPM) to achieve higher prediction accuracy, using KDD∗ [16], Maradbcm [16], mixed-modal BP and mixed-modal SVM approaches.

    • Data mining of functional RNA structures in genomic sequences

      2011, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
    View all citing articles on Scopus
    View full text