Discriminating coding, non-coding and regulatory regions using rescaled range and detrended fluctuation analysis
Introduction
It is increasingly acknowledged that variation in the complexity of organisms is due to the regulation of gene activity rather than to the genetic specifications for protein coding per se. Gene activity is dynamic and affected by, among other things, metabolism and cell signalling (such as communication via hormones) (Rees et al., 2000, Jump and Clarke, 1999, Yamada and Noguchi, 1999). According to Markstein et al. (2002) as much as 50% of the metazoan genome is regulatory. However, most of this is not yet deciphered as it is extremely difficult to identify the components of regulatory regions.
Regulatory regions contain transcription factor binding sites (TFBS), short sequences of DNA which are often located upstream or downstream the start position of gene transcription begins (although they may also occur within a gene). In turn, these binding sites are “recognized” by transcription factors, proteins that – upon binding to them – act as repressors or activators, thus controlling the rate of transcription of DNA into mRNA and ultimately that of translation into proteins. The identification of regulatory regions and TFBSs is an obviously essential, but unfortunately far from easy, step to obtain a deeper understanding of the regulation of individual genes. The difficulties are especially outspoken for higher eukaryotes, where some regulatory regions – called enhancers – are located far upstream or downstream the target gene.
The desire for large-scale comprehension has driven the development of high throughput methods. In turn, this has favoured computational approaches to the prediction of genomic components such as exons and regulatory regions because these sidestep the ultimately more reliable but slow and expensive route of experimental verification. The work reported in this paper aims to contribute to the computational detection of regulatory DNA by contrasting their statistical characteristics with those of (non-)coding but non-regulatory regions.
Various methods have been used to characterize the statistical properties of genomic components. Nucleotide composition is commonly investigated with tools from information theory, i.e. by estimating the entropy of parts of the genome (Abnizova et al., 2006; Orlov and Potapov, 2004, Orlov et al., 2006) and statistical linguistics, such as those based on Zipf's law (Mantegna et al., 1994). Statistical dependencies between nucleotides have been analysed using mutual information functions (Li, 1997), spectra (Voss, 1992, Vaidyanathan and Yoon, 2004, Bernola-Galván et al., 1999), hidden Markov models (Yoon and Vaidyanathan, 2006) and methods derived from random walk dynamics, such as detrended fluctuation- and rescaled range analysis (Ossadnik et al., 1994), to assess long-range correlations among nucleotides. The latter have attracted much attention (Abnizova et al., 2006, Voss, 1992, Herzel and Große, 1997, Azbel, 1995) and correlations up to 1000 bp have been found in particular for non-coding DNA. Coding regions appear to lack such long-range correlations (Abnizova et al., 2006, Buldyrev et al., 1992; but see Voss, 1992) but seem instead to be characterised by three cycle periodicities, as has been established by spectral analysis (Vaidyanathan and Yoon, 2004).
A conventional way to study DNA is by functional segmentation, a top-down approach in which a genomic sequence is partitioned into segments and these are identified as a particular functional types of DNA (such as coding- or regulatory regions) if their statistical properties match with those of experimentally verified cases of that functional type. The opposite strategy could be called serial prospecting. This is a bottom-up procedure that maps the DNA landscape by assessing local statistical properties while moving along the sequence. By using sliding windows, changes in these local statistical properties can be detected. Regions with striking statistical features or change points therein are candidates for further analysis.
In both approaches the choice of segment- and moving window size is subjective. The problem is that too small segments and windows may not encompass the complete region of interest or reveal too much detail. However, when segments and windows are too large they may overlook critically important local differences or, in case the sequence is non-stationary, contain regions of different statistical structure. Non-stationarity violates assumptions of most algorithms currently in use and makes the results of Markov models and information theoretic measures worse than meaningless.
In our previous work (Abnizova et al., 2006, Orlov et al., 2006), we advocated a combined approach that also addresses the problem of non-stationarity. This procedure involves:
- (i)
Using statistical descriptors that distinguish between different types of DNA on the basis of the compositional heterogeneity and non-stationarity of a nucleotide sequence. We have used informational entropy for measuring compositional heterogeneity and rescaled range analysis to estimate the Hurst exponent (H), which we use as a measure for the degree of non-stationarity; a value of H < 0.5 points to short-term correlations, H = 0.5 represents a series of independent and identically distributed measurements (as in white noise) and H > 0.5 indicates the presence of long-range correlations (Schroeder, 1991).
- (ii)
Adapting the window size so that compositional homogeneity (maximal entropy) and stationarity (a minimal value of |H − 0.5|) are locally optimised (for further explanation, see Section 2.2.1).
We found that this procedure detected start positions of exons quite well (see Table A.1 in Appendix A), implying that these units can be typified as being relatively homogenous and stationary. The results were more pronounced if windows adapted their size to contain local minima of the Hurst exponent than to optimize around local maxima of entropy. Furthermore, the Hurst exponent appeared to be remarkably good in characterizing sequences of experimentally verified exons, which were found to have a significantly lower average H than non-coding regions (Abnizova et al., 2006, Orlov et al., 2006).
However, questions remain. First of all, why does the Hurst exponent perform so well? Hurst exponents are estimated by means of rescaled range analysis, a procedure that is not known as being particular powerful (Clegg, 2005). What would the results have been if we had used detrended fluctuation analysis instead, a method of estimating sequential persistence that is currently favoured by many (for a review also see Buldyrev et al., 1995; the website http://reylab.bidmc.harvard.edu/tutorial/DFA/node5.html and Bernola-Galván et al., 1999)?
Another issue concerns the way in which rescaled range analysis and detrended fluctuation analysis work. Both methods require that prior to further analysis the investigated sequence is binary coded and for this a pyrimidine/purine (P/P) classification is often chosen (but see Buldyrev et al. (1995) for an application of all possible coding schemes). However, there is no reason not to use other dichotomies, such a weak/strong bonding (W/S) categorisation. It is important to study the effect of alternative classification conventions to validate the established view that coding regions have generally low sequential persistence due to a lack of long-range correlations (Buldyrev et al., 1995). Indeed, most studies report a Hurst exponent of coding DNA of around 0.5, which corresponds to a series of independently positioned purine or pyrimidines nucleotides. But how does this relate to the three cycle periodicities found by spectral analysis of coding regions, which imply short-range correlations (and hence a value of H < 0.5)?
We will address these questions by comparing how well rescaled range analysis and detrended fluctuation analysis, based on both a P/P and a W/S binary classification convention, discriminate between coding, regulatory and non-coding, non-regulatory sequences.
The analysis presented in this paper differs from other similar comparative studies in that we explicitly focus on exons and regulatory regions rather than just coding and non-coding regions. Furthermore, the exons and regulatory regions are exclusively experimentally verified sequences and are analysed at the level of individual sequences of a single species (i.e. estimates are not pooled or averaged over species) using rigorous statistics (balanced repeated measurement ANOVA and non-parametric procedures).
Section snippets
Data
The size of the data set is constrained by our requirement to work exclusively with experimentally verified coding- and regulatory enhancer (and not promoter) regions. The sample sizes should be as large as possible but balanced, i.e. the regulatory regions should not be outnumbered by the in principle very large numbers of coding and non-coding, non-regulatory sequences that could be obtained. By thus opting for a data set of high quality rather than quantity, the limiting factor was the
ANOVA analysis of DNA type discrimination
The degree of persistence was estimated for three types of sequences (CODING = coding DNA; REGREG = regulatory regions; NCNREG = non-coding, non-regulatory regions), by two methods (RRA, DFA) using two classification conventions (P/P, W/S). Correspondingly, we have three possible factors affecting the degree of persistence: DNATYPE, METHOD and BINCODE.
We intended to estimate the effects of the three factors on the degree of persistence by means of a multi-variate ANOVA of a between-groups, repeated
Similarities and discrepancies with other studies
In accordance with the observations of Peng et al. (1994) and Buldyrev et al., 1993, Buldyrev et al., 1995, we found a significant difference between different functional parts of DNA, with sequential persistence for coding regions being lower than for non-coding DNA. However, there are also differences, the most striking one being our low values for exons (which actually suggest an anti-persistent sequential structure). Whereas the average Hurst exponent and α for exons and non-conserved,
References (43)
- et al.
Correlation approach to identify coding regions in DNA sequences
Biophys. J.
(1994) - et al.
Maternal protein deficiency causes hypermethylation of DNA in the livers of rat fetuses
J. Nutr.
(2000) - et al.
On the wavelet spectrum diagnostic for Hurst parameter estimation in the analysis of Internet traffic
Comput. Networks
(2005) - et al.
Multifractal characterisation of length sequences of coding and noncoding segments in a complete genome
Physica A
(2001) - et al.
New methods to infer DNA function from sequence information
- et al.
Longrange correlations in genomic DNA: a signature of the nucleosomal structure
Phys. Rev. Lett.
(2001) - et al.
Wavelet analysis of DNA bending profiles reveals structural constraints on the evolution of genomic sequences
J. Biol. Phys.
(2004) Fitting interconnected Markov chain models-DNA sequences and test cricket matches
Statistician
(2002)Universality in a DNA statistical structure
Phys. Rev. Lett.
(1995)- et al.
Decomposition of DNA Sequence Complexity
Phys. Rev. Lett.
(1999)
Long range fractal correlations in DNA
Phys. Rev. Lett.
Fractals in biology and medicine: from DNA to the heartbeat
Long-range correlational properties of coding and noncoding DNA sequences: GenBank analysis in DNA
Phys. Rev.
Study of correlations in segmented DNA sequences: application to structure coupling between exons and introns
J. Theor. Biol.
Experimental Design ANOVA and Regression
Correlations in DNA sequences: the role of protein coding segments
Phys. Rev. E
Regulation of gene expression by dietary fat
Annu. Rev. Nutr.
Cited by (9)
Numericalization of the self adaptive spectral rotation method for coding region prediction
2012, Journal of Theoretical BiologyCitation Excerpt :Coding region prediction is an active issue in the field of computational biology (Bennetzen and Hall, 1982; Staden and McLachlan, 1982; Claverie and Bougueleret, 1986; Peng et al., 1992; Li, 1997; Zhang and Wang, 2000; Stanke and Waack, 2003; Haimovich et al., 2006; Orlov et al., 2006; Boekhorst et al., 2008; Do and Choi, 2006).
The role played by exons in genomic DNA sequence correlations
2010, Journal of Theoretical BiologyPrediction and Analysis of Pseudomonas Aeruginosa Promoters Based on Sequence Features
2019, 2019 IEEE 11th International Conference on Advanced Infocomm Technology, ICAIT 2019Signal detection in genome sequences using complexity based features
2013, Proc. of the 12th Int. Workshop on Data Mining in Bioinformatics, BIOKDD 2013 - Held in Conjunction with the 19th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, SIGKDD 2013A statistical feature of Hurst exponents of essential genes in bacterial genomes
2012, Integrative BiologyPredicting coding region candidates in the DNA sequence based on visualization without training
2011, IEEE SSCI 2011 - Symposium Series on Computational Intelligence - CIBCB 2011: 2011 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology