Elsevier

Biosystems

Volume 91, Issue 1, January 2008, Pages 183-194
Biosystems

Discriminating coding, non-coding and regulatory regions using rescaled range and detrended fluctuation analysis

https://doi.org/10.1016/j.biosystems.2007.05.019Get rights and content

Abstract

In this paper we analyse the efficiency of two methods, rescaled range analysis and detrended fluctuation analysis, in distinguishing between coding DNA, regulatory DNA and non-coding non-regulatory DNA of Drosophila melanogaster. Both methods were used to estimate the degree of sequential dependence (or persistence) among nucleotides.

We found that these three types of DNA can be discriminated by both methods, although rescaled range analysis performs slightly better than detrended fluctuation analysis. On average, non-coding, non-regulatory DNA has the highest degree of sequential persistence. Coding DNA could be characterised as being anti-persistent, which is in line with earlier findings of latent periodicity. Regulatory regions are shown to possess intermediate sequential dependency.

Together with other available methods, rescaled range and detrended fluctuation analysis on the basis of a combined purine/pyrimidine and weak/strong classification of the nucleotides are useful tools for refined structural and functional segmentation of DNA.

Introduction

It is increasingly acknowledged that variation in the complexity of organisms is due to the regulation of gene activity rather than to the genetic specifications for protein coding per se. Gene activity is dynamic and affected by, among other things, metabolism and cell signalling (such as communication via hormones) (Rees et al., 2000, Jump and Clarke, 1999, Yamada and Noguchi, 1999). According to Markstein et al. (2002) as much as 50% of the metazoan genome is regulatory. However, most of this is not yet deciphered as it is extremely difficult to identify the components of regulatory regions.

Regulatory regions contain transcription factor binding sites (TFBS), short sequences of DNA which are often located upstream or downstream the start position of gene transcription begins (although they may also occur within a gene). In turn, these binding sites are “recognized” by transcription factors, proteins that – upon binding to them – act as repressors or activators, thus controlling the rate of transcription of DNA into mRNA and ultimately that of translation into proteins. The identification of regulatory regions and TFBSs is an obviously essential, but unfortunately far from easy, step to obtain a deeper understanding of the regulation of individual genes. The difficulties are especially outspoken for higher eukaryotes, where some regulatory regions – called enhancers – are located far upstream or downstream the target gene.

The desire for large-scale comprehension has driven the development of high throughput methods. In turn, this has favoured computational approaches to the prediction of genomic components such as exons and regulatory regions because these sidestep the ultimately more reliable but slow and expensive route of experimental verification. The work reported in this paper aims to contribute to the computational detection of regulatory DNA by contrasting their statistical characteristics with those of (non-)coding but non-regulatory regions.

Various methods have been used to characterize the statistical properties of genomic components. Nucleotide composition is commonly investigated with tools from information theory, i.e. by estimating the entropy of parts of the genome (Abnizova et al., 2006; Orlov and Potapov, 2004, Orlov et al., 2006) and statistical linguistics, such as those based on Zipf's law (Mantegna et al., 1994). Statistical dependencies between nucleotides have been analysed using mutual information functions (Li, 1997), spectra (Voss, 1992, Vaidyanathan and Yoon, 2004, Bernola-Galván et al., 1999), hidden Markov models (Yoon and Vaidyanathan, 2006) and methods derived from random walk dynamics, such as detrended fluctuation- and rescaled range analysis (Ossadnik et al., 1994), to assess long-range correlations among nucleotides. The latter have attracted much attention (Abnizova et al., 2006, Voss, 1992, Herzel and Große, 1997, Azbel, 1995) and correlations up to 1000 bp have been found in particular for non-coding DNA. Coding regions appear to lack such long-range correlations (Abnizova et al., 2006, Buldyrev et al., 1992; but see Voss, 1992) but seem instead to be characterised by three cycle periodicities, as has been established by spectral analysis (Vaidyanathan and Yoon, 2004).

A conventional way to study DNA is by functional segmentation, a top-down approach in which a genomic sequence is partitioned into segments and these are identified as a particular functional types of DNA (such as coding- or regulatory regions) if their statistical properties match with those of experimentally verified cases of that functional type. The opposite strategy could be called serial prospecting. This is a bottom-up procedure that maps the DNA landscape by assessing local statistical properties while moving along the sequence. By using sliding windows, changes in these local statistical properties can be detected. Regions with striking statistical features or change points therein are candidates for further analysis.

In both approaches the choice of segment- and moving window size is subjective. The problem is that too small segments and windows may not encompass the complete region of interest or reveal too much detail. However, when segments and windows are too large they may overlook critically important local differences or, in case the sequence is non-stationary, contain regions of different statistical structure. Non-stationarity violates assumptions of most algorithms currently in use and makes the results of Markov models and information theoretic measures worse than meaningless.

In our previous work (Abnizova et al., 2006, Orlov et al., 2006), we advocated a combined approach that also addresses the problem of non-stationarity. This procedure involves:

  • (i)

    Using statistical descriptors that distinguish between different types of DNA on the basis of the compositional heterogeneity and non-stationarity of a nucleotide sequence. We have used informational entropy for measuring compositional heterogeneity and rescaled range analysis to estimate the Hurst exponent (H), which we use as a measure for the degree of non-stationarity; a value of H < 0.5 points to short-term correlations, H = 0.5 represents a series of independent and identically distributed measurements (as in white noise) and H > 0.5 indicates the presence of long-range correlations (Schroeder, 1991).

  • (ii)

    Adapting the window size so that compositional homogeneity (maximal entropy) and stationarity (a minimal value of |H  0.5|) are locally optimised (for further explanation, see Section 2.2.1).

We found that this procedure detected start positions of exons quite well (see Table A.1 in Appendix A), implying that these units can be typified as being relatively homogenous and stationary. The results were more pronounced if windows adapted their size to contain local minima of the Hurst exponent than to optimize around local maxima of entropy. Furthermore, the Hurst exponent appeared to be remarkably good in characterizing sequences of experimentally verified exons, which were found to have a significantly lower average H than non-coding regions (Abnizova et al., 2006, Orlov et al., 2006).

However, questions remain. First of all, why does the Hurst exponent perform so well? Hurst exponents are estimated by means of rescaled range analysis, a procedure that is not known as being particular powerful (Clegg, 2005). What would the results have been if we had used detrended fluctuation analysis instead, a method of estimating sequential persistence that is currently favoured by many (for a review also see Buldyrev et al., 1995; the website http://reylab.bidmc.harvard.edu/tutorial/DFA/node5.html and Bernola-Galván et al., 1999)?

Another issue concerns the way in which rescaled range analysis and detrended fluctuation analysis work. Both methods require that prior to further analysis the investigated sequence is binary coded and for this a pyrimidine/purine (P/P) classification is often chosen (but see Buldyrev et al. (1995) for an application of all possible coding schemes). However, there is no reason not to use other dichotomies, such a weak/strong bonding (W/S) categorisation. It is important to study the effect of alternative classification conventions to validate the established view that coding regions have generally low sequential persistence due to a lack of long-range correlations (Buldyrev et al., 1995). Indeed, most studies report a Hurst exponent of coding DNA of around 0.5, which corresponds to a series of independently positioned purine or pyrimidines nucleotides. But how does this relate to the three cycle periodicities found by spectral analysis of coding regions, which imply short-range correlations (and hence a value of H < 0.5)?

We will address these questions by comparing how well rescaled range analysis and detrended fluctuation analysis, based on both a P/P and a W/S binary classification convention, discriminate between coding, regulatory and non-coding, non-regulatory sequences.

The analysis presented in this paper differs from other similar comparative studies in that we explicitly focus on exons and regulatory regions rather than just coding and non-coding regions. Furthermore, the exons and regulatory regions are exclusively experimentally verified sequences and are analysed at the level of individual sequences of a single species (i.e. estimates are not pooled or averaged over species) using rigorous statistics (balanced repeated measurement ANOVA and non-parametric procedures).

Section snippets

Data

The size of the data set is constrained by our requirement to work exclusively with experimentally verified coding- and regulatory enhancer (and not promoter) regions. The sample sizes should be as large as possible but balanced, i.e. the regulatory regions should not be outnumbered by the in principle very large numbers of coding and non-coding, non-regulatory sequences that could be obtained. By thus opting for a data set of high quality rather than quantity, the limiting factor was the

ANOVA analysis of DNA type discrimination

The degree of persistence was estimated for three types of sequences (CODING = coding DNA; REGREG = regulatory regions; NCNREG = non-coding, non-regulatory regions), by two methods (RRA, DFA) using two classification conventions (P/P, W/S). Correspondingly, we have three possible factors affecting the degree of persistence: DNATYPE, METHOD and BINCODE.

We intended to estimate the effects of the three factors on the degree of persistence by means of a multi-variate ANOVA of a between-groups, repeated

Similarities and discrepancies with other studies

In accordance with the observations of Peng et al. (1994) and Buldyrev et al., 1993, Buldyrev et al., 1995, we found a significant difference between different functional parts of DNA, with sequential persistence for coding regions being lower than for non-coding DNA. However, there are also differences, the most striking one being our low values for exons (which actually suggest an anti-persistent sequential structure). Whereas the average Hurst exponent and α for exons and non-conserved,

References (43)

  • Boeva, V, Makeev, V., Régnier, M., 2004. SWAN: searching for highly divergent tandem repeats in DNA sequences and...
  • Buldyrev, S.V., Goldberger, A.L., Havlin, S., Peng, C.K., Simons, M., Sciortino, F., Stanley, H.E., 1992. Long range...
  • S.V. Buldyrev et al.

    Long range fractal correlations in DNA

    Phys. Rev. Lett.

    (1993)
  • S.V. Buldyrev et al.

    Fractals in biology and medicine: from DNA to the heartbeat

  • S.V. Buldyrev et al.

    Long-range correlational properties of coding and noncoding DNA sequences: GenBank analysis in DNA

    Phys. Rev.

    (1995)
  • V. Chechetkin et al.

    Study of correlations in segmented DNA sequences: application to structure coupling between exons and introns

    J. Theor. Biol.

    (1988)
  • Clegg, R., 2005. A Practical Guide to Measuring the Hurst Parameter,...
  • R. Damon et al.

    Experimental Design ANOVA and Regression

    (1987)
  • H. Herzel et al.

    Correlations in DNA sequences: the role of protein coding segments

    Phys. Rev. E

    (1997)
  • Gneiting, T., Schlather, M., 2003. Stochastic Tools That Separate Fractal Dimension and Hurst Effect. Technical Report...
  • D.B. Jump et al.

    Regulation of gene expression by dietary fat

    Annu. Rev. Nutr.

    (1999)
  • Cited by (9)

    • Numericalization of the self adaptive spectral rotation method for coding region prediction

      2012, Journal of Theoretical Biology
      Citation Excerpt :

      Coding region prediction is an active issue in the field of computational biology (Bennetzen and Hall, 1982; Staden and McLachlan, 1982; Claverie and Bougueleret, 1986; Peng et al., 1992; Li, 1997; Zhang and Wang, 2000; Stanke and Waack, 2003; Haimovich et al., 2006; Orlov et al., 2006; Boekhorst et al., 2008; Do and Choi, 2006).

    • Prediction and Analysis of Pseudomonas Aeruginosa Promoters Based on Sequence Features

      2019, 2019 IEEE 11th International Conference on Advanced Infocomm Technology, ICAIT 2019
    • Signal detection in genome sequences using complexity based features

      2013, Proc. of the 12th Int. Workshop on Data Mining in Bioinformatics, BIOKDD 2013 - Held in Conjunction with the 19th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, SIGKDD 2013
    • Predicting coding region candidates in the DNA sequence based on visualization without training

      2011, IEEE SSCI 2011 - Symposium Series on Computational Intelligence - CIBCB 2011: 2011 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology
    View all citing articles on Scopus
    View full text