A sequence-based analysis of the pointer distribution of stichotrichous ciliates
Introduction
Ciliates are unicellular eukaryotes forming an old and diverse group. They have two types of nuclei: a somatic one, called macronucleus (MAC) and a germline one, called micronucleus (MIC). Following conjugation, a mitotic copy of the micronucleus develops into a new macronucleus, while the old macronuclei are destroyed. This process involves massive DNA manipulations, including sequence eliminations and rearrangements (inversions and translocations). The process is especially pronounced in a clade of ciliates called Stichotrichs, on which we focus in the paper. This DNA processing is called for by the drastically different genomic organizations in MIC and MAC. A macronuclear gene is a sequence of basepairs, very often placed on its own DNA molecules. The same gene in the micronucleus is placed on long chromosomes and broken into blocks (called macronuclear destined sequences, or MDS s), separated by noncoding blocks (called internally eliminated sequences, or IES s). Moreover, in some cases, the MDSs are presented in a scrambled order, some of them even being inverted, see Fig. 1. We refer to (Prescott, 1994) for a survey on the topic.
A clue about the mechanism for assembling the MDSs in the orthodox order is given by their structure. It turns out that each MDS ends with a nucleotide sequence that is repeated in the beginning of the MDS that should follow it in the macronuclear gene, see Fig. 2 and DuBois and Prescott (1997). These sequences are called pointers. In the following, we denote by the pointer that the ith MDS of the macronuclear gene, say , starts with, for all . MDS ends with the pointer that has an occurrence also in the beginning of MDS . The sequences in the beginning of the first MDS and at the end of the last MDS are called (beginning and ending, respectively) markers. The division of the micronuclear gene into MDSs and IESs is uniquely determined by the locations of all pointers and markers. E.g., the first MDS starts with the beginning marker and stretches over to the pointer adjacent to it. The strand where the beginning marker lies gives the strand of the whole MDS and in particular, that of the first pointer. The second occurrence of that pointer identifies the beginning of the second MDS, as well as its ending pointer, etc. The last MDS ends with the end marker of the gene.
During gene assembly, the IESs are excised while the MDSs are spliced together on their common pointers to yield the assembled macronuclear gene. It is well understood by now, see (Prescott et al., 2003, Angeleska et al., 2007, Nowacki et al., 2008) that many of the pointers are too short to guarantee unique identification. Indeed, as we also show in this paper, many pointers have multiple occurrences along the micronuclear chromosome. An additional mechanism for the unambiguous identification of all pointers has been proposed in Prescott et al. (2003) and in Angeleska et al. (2007) and experimentally demonstrated in Nowacki et al. (2008). The idea is that ciliates would be able to use a macronucleus-originated, DNA (as in Prescott et al., 2003) or RNA (as in Angeleska et al., 2007) template, allowing for unambiguous DNA recognition of whole MDSs rather than just on (short) pointers. For the kinetic details of the proposed template-based DNA recombination mechanisms we refer to Prescott et al. (2003) and Angeleska et al. (2007). We also refer to Prescott et al. (2007) for an evolutionary hypothesis for the structure of micronuclear genes.
The motivation of our study is to investigate the possible evolutionary pressure towards selecting such a complex mechanism for gene assembly in ciliates as the template-based DNA recombination. We assume a scenario where ciliate genes are assembled in the absence of the template-based mechanism. We show that the combinatorial complexity of the process in this case is huge, and thus unlikely to succeed in a cellular environment. The combinatorial complexity of the problem comes from only a few (short) pointers having a high number of occurrences in the MIC gene. When the positions of these pointers (of length up to 4 bp, and even 3 bp for unscrambled genes) are fixed on their real loci (e.g., through a template-based recombination mechanism), the complexity of the problem is drastically reduced. For this reduced problem, we describe in this paper a clustering principle that allows for a clear separation between the real pointer distribution and all the other combinations of pointer occurrences. We describe our approach in the following.
We consider the problem of identifying the correct occurrences of pointers and so, the correct division into MDSs/IESs, when only the nucleotide sequence of all pointers are known. We consider all currently sequenced micronuclear ciliate genes, see Cavalcanti et al. (2005), and their pointers. We identify all occurrences of each pointer sequence along the micronuclear gene. We consider the occurrences of each pointer on both strands of the gene. For each pointer we consider all possible pairs of non-overlapping occurrences. We repeat the process for all pointers and consider all combinations of pairs of non-overlapping pointer loci. For example, consider a gene has two pointers and , first one having three loci , , and the second having four , , , . If none of the seven loci overlaps with another one, then we consider all combinations of pointer loci , where and . If for example locus overlaps with locus , then we exclude all combinations where both and are selected. We add to each combination of pointer loci the loci of the two markers of the gene and exclude any combinations where the marker loci overlap with any pointer locus.
For some of these combinations it is possible to divide the gene into blocks that are similar to the MDSs and the IESs: the first block starts with the beginning marker and stretches over to the adjacent pointer, being placed on the strand of the beginning marker, the second block starts with the second occurrence of that pointer, etc. The last block should end with the ending marker. Somewhat abusing the terminology we call all such blocks MDSs and the remaining blocks of the gene IESs. Clearly, the construction may fail for some combinations of pointers and markers loci. Also, for some of the decompositions, the resulting MDSs may be assembled without losing any MDS, while some others may not. We call any such decomposition valid. We introduce them formally in Section 2.
It turns out that in many cases, the number of the valid decompositions is huge. The real MDS decomposition (leading to the macronuclear gene) of the micronuclear gene is one of them, see Section 3 for examples. We discuss what distinguishes this real decomposition among all the other alternatives. We only follow one simple criterion in our comparison of all possible combinations: the sum of the a/t-content of all the induced IESs. The results in all case studies are remarkable: even for pointers as short as two nucleotides, the real assembly is one of very few assemblies with an average a/t-content per IES over 80%. This separation is most evident when the shortest pointers (having most occurrences along the chromosome) are fixed on their real positions and only combinations of longer pointers are investigated. Our examples suggest that, as long as the real loci of pointers with at most four nucleotides (or even three for unscrambled genes) is known, the real assembly has the maximum a/t-content per IES of all possible MDS assemblies.
Section snippets
Mathematical Preliminaries
Consider an alphabet , for some and its disjoint copy : . For each , we set . Also, we denote , for all . We denote the set of all strings over alphabet . Any mapping can be extended to a string morphism by setting and , for all .
A string over the alphabet is called a permutation if each letter has exactly one
Approach
We consider all ciliate genes from the database (Cavalcanti et al., 2005) that can be found at http://oxytricha.princeton.edu/dimorphism/database.htm. We take all genes for which the nucleotide sequences of all pointers are known. We include also the DNA Polymerase Alpha gene in Paraurostyla weissei, for which only the last pointer is not known. In this case we have replaced the unknown pointer by the end marker. The list of genes considered in our study is summarized in Table 1. In this list,
Methods
For all genes in Table 1, we use the following algorithmic procedure:
- 1.
Consider the nucleotide sequences of all the pointers of the gene. For every pointer sequence, find all its occurrences on both strands of the gene.
- 2.
Consider all possible combinations of non-overlapping pointer occurrences having exactly two occurrences of each pointer. Select only those strings that are realistic (we recall that these are the strings that lead to a valid decomposition into MDSs and IESs, modulo relabeling). If
Results
In this section we present some of the other results. Due to space restrictions, we skip the results of the analysis for some of the genes with few pointers. For full results we refer to Verlan et al. (2008).
Conclusion
A template-based recombination mechanism has been proposed in Prescott et al. (2003) and Angeleska et al. (2007) to account for the problem of unambiguous pointer identification in ciliates genes. This mechanism is very complex, involving two recombining molecules (or two parts of the same molecule), as well as a template molecule guiding the recombination. We investigated in this paper the objective needs for such a complex mechanism. We focused on the difficulty of the pointer identification
Acknowledgments
This work was supported by the Science and Technology Center in Ukraine [4032 to SV and AA], by the Japan Society for Promotion of Science and Grant-in-Aid [2008364 to AA], and by the Academy of Finland [108421, 203647 to IP].
References (11)
- et al.
RNA-template guided DNA assembly
J. Theor. Biol.
(2007) - et al.
Template-guided recombination for IES elimination and unscrambling of genes in stichotrichous ciliates
J. Theor. Biol.
(2003) - et al.
MDS_IES_DB: a database of macronuclear and micronuclear genes in spirotrichous ciliates
Nucleic Acids Res.
(2005) - et al.
Gene unscrambler for detangling scrambled genes in ciliates
Bioinformatics
(2004) - et al.
Volatility of internal eliminated segments in germ line genes of hypotrichous ciliates Mol
Cell Biol.
(1997)