Orphan and gene related CpG Islands follow power-law-like distributions in several genomes: Evidence of function-related and taxonomy-related modes of distribution

https://doi.org/10.1016/j.compbiolchem.2014.08.013Get rights and content

Abstract

CpG Islands (CGIs) are compositionally defined short genomic stretches, which have been studied in the human, mouse, chicken and later in several other genomes. Initially, they were assigned the role of transcriptional regulation of protein-coding genes, especially the house-keeping ones, while more recently there is found evidence that they are involved in several other functions as well, which might include regulation of the expression of RNA genes, DNA replication etc. Here, an investigation of their distributional characteristics in a variety of genomes is undertaken for both whole CGI populations as well as for CGI subsets that lie away from known genes (gene-unrelated or “orphan” CGIs). In both cases power-law-like linearity in double logarithmic scale is found. An evolutionary model, initially put forward for the explanation of a similar pattern found in gene populations is implemented. It includes segmental duplication events and eliminations of most of the duplicated CGIs, while a moderate rate of non-duplicated CGI eliminations is also applied in some cases. Simulations reproduce all the main features of the observed inter-CGI chromosomal size distributions. Our results on power-law-like linearity found in orphan CGI populations suggest that the observed distributional pattern is independent of the analogous pattern that protein coding segments were reported to follow. The power-law-like patterns in the genomic distributions of CGIs described herein are found to be compatible with several other features of the composition, abundance or functional role of CGIs reported in the current literature across several genomes, on the basis of the proposed evolutionary model.

Introduction

Genomic CpG islands, in which CpG dinucleotides are abundant and non-methylated, have been initially detected experimentally and defined as short (hundreds of nucleotides-long) stretches in vertebrate genomes, thus offered as cleavable sites for mCpG-sensitive restriction enzymes (HTF islands, see e.g. Bird, 1986). As lengthy DNA sequences and whole genomes became progressively available, sequence-based definitions of CpG Islands (CGIs) and computational algorithms using a sliding window combined with threshold values for key quantities were put forward. Thresholds for sequence-based search of CGIs are considered for: (i) the minimal island length (Lmin); (ii) the percentage of cytosine and guanine content (C + G, abbreviated in the following as CG%); and (iii) the observed over expected frequency of occurrence of the CpG dinucleotide (CpGo/e). Gardiner-Garden and Frommer (1987) introduced the first widely used set of thresholds (Lmin = 200 bp, CG > 50%, CpGo/e > 0.6; to which we will hereafter refer as “relaxed criteria”; bp stands for ‘base pairs’), while later, Takai and Jones (2002) used more conservative threshold values, named hereafter “stringent criteria” (Lmin = 500 bp, CG > 55%, CpGo/e > 0.65). In the following, these threshold choices will be abridged as G-G&F and T&J respectively. Alternatively, methods based on the degree of CpG dinucleotide clustering (Hackenberg et al., 2006, Glass et al., 2007) and on entropic edge detection (Luque-Escamilla et al., 2005) have also been introduced. In order to avoid false positives (e.g. due to high C + G Alu sequences in the human genome), especially when relaxed criteria are used, repeat-masking is usually applied before searching for CGIs.

CGIs are widely accepted as markers for the existence of protein-coding genes, the promoter regions of which are usually proximal to or overlapping with the islands. Consequently, the search for CGIs was motivated by the need of detection of yet unannotated protein coding genes. Therefore, most of the introduced CGI finding algorithms were judged upon their ability to find CGIs in the proximity of known genes (see e.g. Bird, 1987, Han and Zhao, 2008, Han and Zhao, 2009).

The functional character of a CGI is related to the condition that its CpG dinucleotides are predominantly unmethylated. Genome-wide methylation data were not available until recently. The CpG depletion of the vertebrate genome is explained on the basis of heavy methylation and the consequent mutation of 5-methylcytosine to thymine, verified by the observed increase in TpG and CpA abundances and is extensively discussed in relation to CGIs evolution and function. See Bird (1986), Antequera (2003) and Jabbari and Bernardi (2004), where however the opposite position (i.e. the independence of CpG deficiency and TpG, CpA excess on the level of DNA methylation) is held. The propensity of 5-methylcytosine to quickly mutate to thymine has driven to the conjecture that the methylation is hindered in genomic sequences where CpG frequency is close to the expected one, at least in the germ line. This lack of methylation on specific locations may be seen as the result either of protection from point-mutations (mutational “cold spots”) or of purifying selection due to functional roles which are fulfilled only if specific compositional traits are preserved in the underlying sequences. CGIs cannot be generally seen as mutational cold spots, as there are converging data that many CGIs in several lineages have been lost when they ceased to be under purifying selection. These are retained for long evolutionary time in other species where they remain functional, e.g. in human and mouse genome comparisons; see Antequera and Bird (1993), Matsuo et al. (1993).

After the systematic search for CGIs in the human and other genomes using the simple sequence criteria described above, researchers attempted to introduce several forms of epigenomic information. Bock et al. (2007) used a method combining a standard sliding window algorithm with available epigenomic data for human chromosomes 21 and 22. Tanay et al. (2007), based on human and chimpanzee genomes’ comparison, found regions with reduced CpG mutability, while the same domains largely coincide with Polycomb-binding sites. Later on, Illingworth et al. (2010) provided a comprehensive list of CGIs in the human and mouse genomes based principally on methylation data. The most important of this work’s observations was that there exist numerous “orphan” CpG Islands (i.e. not connected to a known protein-coding gene), independently of the method used for their detection. Illingworth and co-workers formulated a hypothesis according to which most of orphan CGIs are related to functional promoters of unknown genes, many of which could be functional RNA genes. More recently, CGIs were shown to often coincide with Origins of Replication (ORIs) thus suggesting their functional roles may extend beyond transcriptional regulation (Cayrou et al., 2011). To what extent the colocalization of CGI and origins of replication is due to a direct causal link or an indirect effect mediated by positional preferences of gene transcription start sites remains, nonetheless, unclear.

In previous works (Sellis et al., 2007, Sellis and Almirantis, 2009, Klimopoulos et al., 2012) we have observed that distances between transposable elements (TE) belonging to the same family, as well as, distances between protein-coding segments (PCS) often follow “power-law-like” size distributions in entire chromosomes. Such distributions present extended linear regions in log–log scale (see Methods). Our principal result herein is that CGIs also, either studied as entire populations or when only orphan islands are considered, follow power-law-like size distributions. We have put forward an evolutionary model, including biologically plausible steps, for the temporal evolution of genomic components that are subjected to purifying selection. This model, when tested numerically, is shown to systematically generate power-law-like size distributions, like the ones found in the CGI chromosomal distributions.

Here we investigate the large-scale genomic distribution of CpG Islands in several genomes, after masking for known repeated sequences (for all organisms apart from S. cerevisiae). Power-law-like distributions are extensively observed and through the proposed model we attempt to extend our understanding of the evolution and function of CGIs. In the cases of human and mouse genomes, the data of Illingworth et al. (2010) are also used alongside with CGI coordinates derived from the implementation of compositional thresholds. For all studied organisms, besides the “standard” G-G&F and T&J threshold sets, several additional choices of sequence composition are also tested and the resulting inter-CGI distances’ distributions are critically presented. A separate study of populations of orphan CGIs is performed, principally in order to exclude that the observed power-law-like distributions in whole CGI populations are a mere consequence of similar distributions already known to exist for the inter-genic distances (Sellis and Almirantis, 2009). The persistence of orphan CGIs’ chromosomal distributions to form power-laws, when taken alone, apart from an indication of their functional role and conservation through purifying selection, can provide valuable insight into the overall organization of genome architecture and the mechanisms under which sequences are attributed with functionality.

Section snippets

Origin of the genomic and epigenomic data

The genomes of the following organisms are used in this study: Apis melifera, Bos taurus, Caenorhabditis elegans, Canis familiaris, Danio rerio, Drosophila melanogaster, Gallus gallus, Homo sapiens, Monodelphis domestica, Mus musculus, Saccharomyces cerevisiae. Several or all chromosomes of each genome are studied. Only entire chromosomes are considered. Information about the origin of the genomic sequences used may be found in the supplementary file “genome data”. These data are used for the

Chromosomal distributions of CpG-Islands in various genomes

In Fig. 1 we present examples of the complementary cumulative inter-CGI distances’ size distributions, in double logarithmic scale, for whole chromosomes from the 11 considered genomes. The G-G&F or T&J thresholds values have been used here. More plots are included in “Supplementary Plots”. In Table 1 the following information is included: (i) mean values of linearity extent E (M.V.) for all studied chromosomes from each organism; (ii) mean values of linearity Extent E for the five more

Analysis of the observed power-law-like chromosomal CGI distributions. Taxonomy-related evidence

The property of the formulated simple model to produce power-law-like distributions is an indication that it could be extended to include more detailed modeling of conservation through evolutionary time, in line with the view that functionality is a necessary condition for CpG islands to form stable power-laws. The similarity of the power-law-like pattern emerging in simulations of this model with the genomic distributions of inter-CGI distances cannot be considered, however, as an absolute

Conclusions

In this work we investigated the distributional features of CpG–Islands in several genomes by means of a study of inter-CGI distances considering only entire chromosomes. Linearity in log–log scale was often found in the range between two and three orders of magnitude. A simple evolutionary model, formally analog to one previously proposed by our group for the understanding of genic segment distributions was formulated and simulations have shown that it might account for the distributional

Acknowledgements

We would like to acknowledge Professors Stavros J. Hamodrakas and George C. Rodakis (both at the Faculty of Biology, University of Athens) for serving as academic advisors to G.T., and Philippa Constantinou for having participated in an initial stage of this study. We are particularly indebted to Professor Oliver Clay who predicted the particularly high power-law-linearity of yeast chromosome III on the basis of the findings of Bradnam et al. (1999) about the compositional mosaicism of this

References (47)

  • F. Antequera

    Structure, function and evolution of CpG island promoters

    Cell. Mol. Life Sci.

    (2003)
  • L. Athanasopoulou et al.

    Scaling properties and fractality in the distribution of coding segments in eukaryotic genomes revealed through a block entropy approach

    Phys. Rev. E

    (2010)
  • J.A. Bailey et al.

    Recent segmental duplications in the human genome

    Science

    (2002)
  • A.P. Bird

    CpG-rich islands and the function of DNA methylation

    Nature

    (1986)
  • C. Bock et al.

    CpG island mapping by epigenome prediction

    PloS Comp. Biol.

    (2007)
  • K.R. Bradnam et al.

    G + C content variation along and among saccharomyces cerevisiae chromosomes

    Mol. Biol. Evol

    (1999)
  • C. Cayrou et al.

    Genome-scale analysis of metazoan replication origins reveals their organization in specific but flexible sites defined by conserved features

    Genome Res.

    (2011)
  • Clauset, A., Shalizi, C.R., Newman, M.E.J., 2007. Power-law distributions in empirical data. arXiv. 0706.1062v1...
  • T.J. Gibson et al.

    Evidence in favour of ancient octaploidy in the vertebrate genome

    Biochem. Soc. Trans.

    (2000)
  • J.L. Glass et al.

    CG dinucleotide clustering is a species-specific property of the genome

    Nucl. Acids Res.

    (2007)
  • M. Hackenberg et al.

    CpGcluster: a distance-based algorithm for CpG-island detection

    BMC Bioinformatics

    (2006)
  • L. Han et al.

    CpG island density and its correlations with genomic features in mammalian genomes

    Genome Biol.

    (2008)
  • L. Han et al.

    Comparative analysis of CpG islands in four fish genomes

    Comp. Funct. Genomics

    (2008)
  • Cited by (0)

    View full text