Elsevier

Biosystems

Volume 101, Issue 2, August 2010, Pages 109-116
Biosystems

A sequence-based analysis of the pointer distribution of stichotrichous ciliates

https://doi.org/10.1016/j.biosystems.2010.05.003Get rights and content

Abstract

Micronuclear genes in stichotrichous ciliates are broken into blocks separated by noncoding sequences, sometimes with the blocks in a shuffled order, some even inverted. During reproduction, all blocks are assembled in the correct order and orientation. This process is possible due to the special structure of micronuclear genes: each coding block M ends with a short nucleotide sequence (called pointer) that is repeated at the beginning of the coding block that should follow M in the assembled gene. Many of the pointers have multiple occurrences along both strands of the gene. This yields a very high number of pointer-induced possible divisions into coding and noncoding blocks.

We investigate the distribution of pointers for all currently sequenced micronuclear ciliate genes with the goal of identifying what distinguishes the real gene structure among all possible coding/noncoding divisions. We find a sharp criterion in the average a/t-content of the noncoding blocks: the real division has, in most cases, the maximum such content among all possible combinations. Even for pointers as short as two nucleotides, the real division is one of very few with an average a/t-content of its noncoding blocks over 80%. The separation is most clear when the loci of pointers of up to four nucleotides (even three in the case of unscrambled genes) are fixed (e.g., through a template-based recombination mechanism).

Introduction

Ciliates are unicellular eukaryotes forming an old and diverse group. They have two types of nuclei: a somatic one, called macronucleus (MAC) and a germline one, called micronucleus (MIC). Following conjugation, a mitotic copy of the micronucleus develops into a new macronucleus, while the old macronuclei are destroyed. This process involves massive DNA manipulations, including sequence eliminations and rearrangements (inversions and translocations). The process is especially pronounced in a clade of ciliates called Stichotrichs, on which we focus in the paper. This DNA processing is called for by the drastically different genomic organizations in MIC and MAC. A macronuclear gene is a sequence of basepairs, very often placed on its own DNA molecules. The same gene in the micronucleus is placed on long chromosomes and broken into blocks (called macronuclear destined sequences, or MDS s), separated by noncoding blocks (called internally eliminated sequences, or IES s). Moreover, in some cases, the MDSs are presented in a scrambled order, some of them even being inverted, see Fig. 1. We refer to (Prescott, 1994) for a survey on the topic.

A clue about the mechanism for assembling the MDSs in the orthodox order is given by their structure. It turns out that each MDS ends with a nucleotide sequence that is repeated in the beginning of the MDS that should follow it in the macronuclear gene, see Fig. 2 and DuBois and Prescott (1997). These sequences are called pointers. In the following, we denote by Pi the pointer that the ith MDS of the macronuclear gene, say Mi, starts with, for all i1. MDS Mi ends with the pointer Pi+1 that has an occurrence also in the beginning of MDS Mi+1. The sequences in the beginning of the first MDS and at the end of the last MDS are called (beginning and ending, respectively) markers. The division of the micronuclear gene into MDSs and IESs is uniquely determined by the locations of all pointers and markers. E.g., the first MDS starts with the beginning marker and stretches over to the pointer adjacent to it. The strand where the beginning marker lies gives the strand of the whole MDS and in particular, that of the first pointer. The second occurrence of that pointer identifies the beginning of the second MDS, as well as its ending pointer, etc. The last MDS ends with the end marker of the gene.

During gene assembly, the IESs are excised while the MDSs are spliced together on their common pointers to yield the assembled macronuclear gene. It is well understood by now, see (Prescott et al., 2003, Angeleska et al., 2007, Nowacki et al., 2008) that many of the pointers are too short to guarantee unique identification. Indeed, as we also show in this paper, many pointers have multiple occurrences along the micronuclear chromosome. An additional mechanism for the unambiguous identification of all pointers has been proposed in Prescott et al. (2003) and in Angeleska et al. (2007) and experimentally demonstrated in Nowacki et al. (2008). The idea is that ciliates would be able to use a macronucleus-originated, DNA (as in Prescott et al., 2003) or RNA (as in Angeleska et al., 2007) template, allowing for unambiguous DNA recognition of whole MDSs rather than just on (short) pointers. For the kinetic details of the proposed template-based DNA recombination mechanisms we refer to Prescott et al. (2003) and Angeleska et al. (2007). We also refer to Prescott et al. (2007) for an evolutionary hypothesis for the structure of micronuclear genes.

The motivation of our study is to investigate the possible evolutionary pressure towards selecting such a complex mechanism for gene assembly in ciliates as the template-based DNA recombination. We assume a scenario where ciliate genes are assembled in the absence of the template-based mechanism. We show that the combinatorial complexity of the process in this case is huge, and thus unlikely to succeed in a cellular environment. The combinatorial complexity of the problem comes from only a few (short) pointers having a high number of occurrences in the MIC gene. When the positions of these pointers (of length up to 4 bp, and even 3 bp for unscrambled genes) are fixed on their real loci (e.g., through a template-based recombination mechanism), the complexity of the problem is drastically reduced. For this reduced problem, we describe in this paper a clustering principle that allows for a clear separation between the real pointer distribution and all the other combinations of pointer occurrences. We describe our approach in the following.

We consider the problem of identifying the correct occurrences of pointers and so, the correct division into MDSs/IESs, when only the nucleotide sequence of all pointers are known. We consider all currently sequenced micronuclear ciliate genes, see Cavalcanti et al. (2005), and their pointers. We identify all occurrences of each pointer sequence along the micronuclear gene. We consider the occurrences of each pointer on both strands of the gene. For each pointer we consider all possible pairs of non-overlapping occurrences. We repeat the process for all pointers and consider all combinations of pairs of non-overlapping pointer loci. For example, consider a gene has two pointers P and Q, first one having three loci P1, P2, P3 and the second having four Q1, Q2, Q3, Q4. If none of the seven loci overlaps with another one, then we consider all combinations of pointer loci {Pi,Pj,Qk,Ql}, where 1i<j3 and 1k<l4. If for example locus P1 overlaps with locus Q2, then we exclude all combinations where both P1 and Q2 are selected. We add to each combination of pointer loci the loci of the two markers of the gene and exclude any combinations where the marker loci overlap with any pointer locus.

For some of these combinations it is possible to divide the gene into blocks that are similar to the MDSs and the IESs: the first block starts with the beginning marker and stretches over to the adjacent pointer, being placed on the strand of the beginning marker, the second block starts with the second occurrence of that pointer, etc. The last block should end with the ending marker. Somewhat abusing the terminology we call all such blocks MDSs and the remaining blocks of the gene IESs. Clearly, the construction may fail for some combinations of pointers and markers loci. Also, for some of the decompositions, the resulting MDSs may be assembled without losing any MDS, while some others may not. We call any such decomposition valid. We introduce them formally in Section 2.

It turns out that in many cases, the number of the valid decompositions is huge. The real MDS decomposition (leading to the macronuclear gene) of the micronuclear gene is one of them, see Section 3 for examples. We discuss what distinguishes this real decomposition among all the other alternatives. We only follow one simple criterion in our comparison of all possible combinations: the sum of the a/t-content of all the induced IESs. The results in all case studies are remarkable: even for pointers as short as two nucleotides, the real assembly is one of very few assemblies with an average a/t-content per IES over 80%. This separation is most evident when the shortest pointers (having most occurrences along the chromosome) are fixed on their real positions and only combinations of longer pointers are investigated. Our examples suggest that, as long as the real loci of pointers with at most four nucleotides (or even three for unscrambled genes) is known, the real assembly has the maximum a/t-content per IES of all possible MDS assemblies.

Section snippets

Mathematical Preliminaries

Consider an alphabet Mn={M1,M2Mn}, for some n1 and its disjoint copy M¯n: M¯n={M¯1,M¯2,,M¯n}. For each MMn, we set M¯¯=M. Also, we denote ||M¯||=||M||=M, for all MMn. We denote (MnM¯n) the set of all strings over alphabet MnM¯n. Any mapping ρ:Mn(MnM¯n) can be extended to a string morphism ρ:(MnM¯n)(MnM¯n) by setting ρ(ā)=ρ(a)¯ and ρ(a1a2al)=ρ(a1)ρ(a2)ρ(al), for all a,a1,,alMnM¯n.

A string π over the alphabet Mn is called a permutation if each letter MMn has exactly one

Approach

We consider all ciliate genes from the database (Cavalcanti et al., 2005) that can be found at http://oxytricha.princeton.edu/dimorphism/database.htm. We take all genes for which the nucleotide sequences of all pointers are known. We include also the DNA Polymerase Alpha gene in Paraurostyla weissei, for which only the last pointer is not known. In this case we have replaced the unknown pointer by the end marker. The list of genes considered in our study is summarized in Table 1. In this list,

Methods

For all genes in Table 1, we use the following algorithmic procedure:

  • 1.

    Consider the nucleotide sequences of all the pointers of the gene. For every pointer sequence, find all its occurrences on both strands of the gene.

  • 2.

    Consider all possible combinations of non-overlapping pointer occurrences having exactly two occurrences of each pointer. Select only those strings that are realistic (we recall that these are the strings that lead to a valid decomposition into MDSs and IESs, modulo relabeling). If

Results

In this section we present some of the other results. Due to space restrictions, we skip the results of the analysis for some of the genes with few pointers. For full results we refer to Verlan et al. (2008).

Conclusion

A template-based recombination mechanism has been proposed in Prescott et al. (2003) and Angeleska et al. (2007) to account for the problem of unambiguous pointer identification in ciliates genes. This mechanism is very complex, involving two recombining molecules (or two parts of the same molecule), as well as a template molecule guiding the recombination. We investigated in this paper the objective needs for such a complex mechanism. We focused on the difficulty of the pointer identification

Acknowledgments

This work was supported by the Science and Technology Center in Ukraine [4032 to SV and AA], by the Japan Society for Promotion of Science and Grant-in-Aid [2008364 to AA], and by the Academy of Finland [108421, 203647 to IP].

References (11)

There are more references available in the full text version of this article.

Cited by (0)

View full text