Abstract
Over the last few decades, much effort has been taken to develop approaches for identifying good predictions of RNA secondary structure. This is due to the fact that most computational prediction methods based on free energy minimization compute a number of suboptimal foldings and we have to identify the native folding among all these possible secondary structures. Using the abstract shapes approach as introduced by Giegerich et al. (Nucleic Acids Res 32(16):4843–4851, 2004), each class of similar secondary structures is represented by one shape and the native structures can be found among the top shape representatives. In this article, we derive some interesting results answering enumeration problems for abstract shapes and secondary structures of RNA. We compute precise asymptotics for the number of different shape representations of size n and for the number of different shapes showing up when abstracting from secondary structures of size n under a combinatorial point of view. A more realistic model taking primary structures into account remains an open challenge. We give some arguments why the present techniques cannot be applied in this case.
Similar content being viewed by others
Notes
Later, we will speak of the minimal length of hairpin loops being 1, but so far we do not have the right vocabulary.
Allowing pseudoknots makes secondary structure prediction to become \({{\mathcal{NP}}}\) complete, which probably is the reason for their exclusion.
It would be an easy task to change the definition to allow loops of length at least 3 only. However, when changing to enumeration and corresponding methods from singularity analysis, such a change would imply polynomials of higher degree and the need to compute their roots. Thus, to keep the mathematics behind the model manageable, one probably resigned this modification. Nevertheless, for covariance models, where these reasons do not apply, one sometimes allows loops of length 0 in the consensus.
In accordance with observations independently made by R. Giegerich at about the same time (personal communication).
According to the informal description of level 1 shapes given in Janssen et al.(2008), it is not clear whether the (one and only, but always existing) unpaired region in a hairpin must be recorded on this shape abstraction level or not. Here, we decided to follow the definition used by the RNAShapes tool, which is available at http://bibiserv.techfak.uni-bielefeld.de/rnashapes/welcome.html. This tool assumes that hairpin loops are not recorded.
Note that it does not matter if a hairpin is represented only by a pair of corresponding squared brackets or by a pair of corresponding squared brackets with an underscore in between, as there must always exist an unpaired region of length at least one in any hairpin.
Unambiguity is necessary, as we will later use these grammars to construct generating functions counting the numbers of type i shapes, 1 ≤ i ≤ 5. If there are more than one leftmost derivations for a type i shape sh, 1 ≤ i ≤ 5, then sh is counted more than once by the corresponding generating function.
Note that in this article, we will not recall the fundamental definitions and methods regarding generating functions. An introduction to generating functions and some of their uses in discrete mathematics can be found for example in Flajolet and Sedgewick (2009) and Wilf (1994). Several good examples for generating functions can be found in Comtet (1974). Furthermore, for an introduction to some advanced methods that have to be used for more difficult problems, see for example Greene and Knuth (1990).
In this paper, we use [z n]S(z) to denote the coefficient at z n in the expansion of S(z) around z = 0.
In the considered version of Darboux’s theorem as given in Knuth and Wilf (1989), the variable m is used to choose the number of terms for the computed asymptotic. In fact, by choosing m = 0, the resulting asymptotic consists of the leading term only.
Within our grammar, the rule \(B\rightarrow\varepsilon\) generates from a sentential form ...[B]... such a pair of brackets and therefore has to be weighted by a factor z.
References
Abrahams JP, van den Berg M, van Batenburg E, Pleij CW (1990) Prediction of RNA secondary structure, including pseudoknotting, by computer simulation. Nucleic Acids Res 18(10):3035–3044
Comtet L (1974) Advanced combinatorics; the art of finite and infinite expansions. Reidel, Dordrecht
Chomsky N, Schützenberger MP (1963) The algebraic theorey of context-free languages. In: Braffort P, Hirschberg D (eds) Computer programming and formal systems. North-Holland, Amsterdam, pp 118–161
Ding Y, Chan C, Lawrence CE (2004) Sfold web server for statistical folding and rational design of nucleic acids. Nucleic Acids Res 32:W135–W141
Ding Y, Lawrence CE (2003) A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res 31(24):7280–7301
Dam E, Pleij K, Draper D (1992) Structural and functional aspects of RNA pseudoknots. Biochemistry 31:11665–11676
Flajolet P, Sedgewick R (2009) Analytic combinatorics. Cambridge University Press, London
Greene DH, Knuth DE (1990) Mathematics for the analysis of algorithms, 3rd edn. Birkhäuser, Boston
Giegerich R, Voß B, Rehmsmeier M (2004) Abstract shapes of RNA. Nucleic Acids Res 32(16):4843–4851
Gutell RR, Woese CR (1990) Higher order structural elements in ribosomal RNAs: pseudo-knots and the use of noncanonical pairs. Proc Natl Acad Sci USA 87:663–667
Harrison MA (1978) Introduction to formal language theory. Addison-Wesley, Reading
Hopcroft JE, Motwani R, Ullman JD (2001) Introduction to automata theory, languages, and computation, 2nd edn. Addison-Wesley, Reading
Janssen S, Reeder J, Giegerich R (2008) Shape based indexing for faster search of RNA family databases. BMC Bioinformatics 9(1):131
Knuth DE, Wilf HS (1989) A short proof of Darboux’s lemma. Appl Math Lett 2:139–140
Lorenz WA, Ponty Y, Clote P (2008) Asymptotics of RNA shapes. J Comput Biol 15(1):31–63
Nebel ME (2004) Investigation of the Bernoulli-model of RNA secondary structures. Bull Math Biol 66:925–964
Nussinov R, Jacobson AB (1980) Fast algorithms for predicting the secondary structure of single-stranded RNA. Proc Natl Acad Sci USA 77(11):6309–6313
Nussinov R, Pieczenik G, Griggs JR, Kleitman DJ (1978) Algorithms for loop matchings. SIAM J Appl Math 35:68–82
Pleij CW, Bosch L (1989) RNA pseudoknots: structure, detection, and prediction. Methods Enzymol 180:289–303
Pleij CW (1994) RNA pseudoknots. Curr Opin Struct Biol 4:337–344
Reeder J, Giegerich R (2005) Consensus shapes: an alternative to the Sankoff algorithm for RNA consensus structure prediction. Bioinformatics 21(17):3516–3523
Sankoff D, Kruskal JB, Mainville S, Cedergren RJ (1983) Fast algorithms to determine RNA secondary structures containing multiple loops. In: Time warps, string edits, and macromolecules: the theory and practice of sequence comparison, chap 3. Addison-Wesley, Reading, pp 93–120
Scheid A, Nebel ME (2008) On abstract shapes of RNA. Technical report, Technische Universität Kaiserslautern
Steffen P, Voß B, Rehmsmeier M, Reeder J, Giegerich R (2006a) RNAshapes 2.1.1 manual
Steffen P, Voß B, Rehmsmeier M, Reeder J, Giegerich R (2006b) RNAshapes: an integrated RNA analysis package based on abstract shapes. Bioinformatics 22(4):500–503
Viennot G, Vauchaussade de Chaumont M (1985) Enumeration of RNA secondary structures by complexity. Math Med Biol Lect Notes Biomath 57:360–365
Voß B, Giegerich R, Rehmsmeier M (2006) Complete probabilistic analysis of RNA shapes. BMC Biol 4(5)
Waterman MS (1978) Secondary structure of single-stranded nucleic acids. Adv Math Suppl Stud 1:167–212
Wuchty S, Fontana W, Hofacker I, Schuster P (1999) Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers 49:145–165
Wilf HS (1994) Generatingfunctionology, 2nd edn. Academic Press, London
Zuker M, Stiegler P (1981) Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res 9:133–148
Zuker M, Sankoff D (1984) RNA secondary structures and their prediction. Bull Math Biol 46:591–621
Zuker M (1989) On finding all suboptimal foldings of an RNA molecule. Science 244:48–52
Zuker M (2003) Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 31(13):3406–3415
Acknowledgments
The authors wish to thank the two anonymous reviewers for their careful and helpful remarks and suggestions made for a previous version of this article.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
During our investigations, we computed precise asymptotics for the size of the folding space F(s) for different models of secondary structures. The models differ with respect to structural restrictions (minimal length of hairpin loops, isolated base pairs) and the complementary assumed (Watson–Crick pairings only, wobble GU pairs allowed), expecting a uniform distribution for the bases or a skewed one (p A = p U = 2/10, p C = p G = 3/10), according to the experiments performed in Giegerich et al. (2004) and Voß et al. (2006).
Even if they were of no use to our investigations related to abstract shapes due to the problems reported when analyzing shape spaces, we expect those results to be of use for the future and, therefore, decided to present them in this appendix without proof. Few of those results may already be found in literature (see e.g., Nebel 2004), but such a complete presentation does not exist.
Theorem 6.1
Considering a uniform distribution of the bases A, C, G and U resp. the skewed distribution p A = p U = 2/10, p C = p G = 3/10, regarding Watson–Crick pairings only or allowing wobble GU pairs and under the assumption of each possible combination of a minimum hairpin loop length minLhairpin ∈ {1, 3}, and a minimum helix length minLladder ∈ {1, 2}, the asymptotic expected folding space sizes card(F(s)) for a random primary structure s of size \(n, n \rightarrow \infty,\) are those given in Table 3 shown in roman resp. italics.
Rights and permissions
About this article
Cite this article
Nebel, M.E., Scheid, A. On quantitative effects of RNA shape abstraction. Theory Biosci. 128, 211–225 (2009). https://doi.org/10.1007/s12064-009-0074-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12064-009-0074-z