Abstract
In many applications of tandem repeats the outcome depends critically on the choice of boundaries (beginning and end) of the repeated motif: for example, different choices of pattern boundaries can lead to different duplication history trees. However, the best choice of boundaries or parsing of the tandem repeat is often ambiguous, as the flanking regions before and after the tandem repeat often contain partial approximate copies of the motif, making it difficult to determine where the tandem repeat (and hence the motif) begins and ends. We define the parsing problem for tandem repeats to be the problem of discriminating among the possible choices of parsing.
In this paper we propose and compare three heuristic methods for solving the parsing problem, under the assumption that the parsing is fixed throughout the duplication history of the tandem repeat. The three methods are PAIR, which minimises the number of pairs of common adjacent mutations which span a boundary; VAR, which minimises the total number of variants of the motif; and MST, which minimises the length of the minimum spanning tree connecting the variants, where the weight of each edge is the Hamming distance of the pair of variants. We test the methods on simulated data over a range of motif lengths and relative rates of substitutions to duplications, and show that all three perform better than choosing the parsing arbitrarily. Of the three MST typically performs the best, followed by VAR then PAIR.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Behzadi, B., Steyaert, J.-M.: An Improved Algorithm for Generalized Comparison of Minisatellites. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 32–41. Springer, Heidelberg (2003)
Benson, G.: Tandem repeats finder: a program to analyze DNA sequences. Nucl. Acids Res. 27(2), 573–580 (1999)
Benson, G., Dong, L.: Reconstructing the duplication history of a tandem repeat. In: Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pp. 44–53. AAAI Press (1999)
Berard, S., Rivals, E.: Comparison of minisatellites. Journal of Computational Biology 10(3-4), 357–372 (2003)
Bertrand, D., Lajoie, M., El-Mabrouk, N.: Inferring ancestral gene orders for a family of tandemly arrayed genes. Journal of Computational Biology 15(8), 1063–1077 (2008)
Chauve, C., Doyon, J.P., El-Mabrouk, N.: Gene family evolution by duplication, speciation, and loss. Journal of Computational Biology 15(8), 1043–1062 (2008)
Crochemore, M.: An optimal algorithm for computing the repetitions in a word. Inf. Process. Lett. 12(5), 244–250 (1981)
Fitch, W.M.: Phylogenies constrained by the crossover process as illustrated by human hemoglobins and a thirteen-cycle, eleven-amino-acid repeat in human apolipoprotein a-i. Genetics 86(3), 623–644 (1977)
Gascuel, O., Hendy, M.D., Jean-Marie, A., McLachlan, R.: The combinatorics of tandem duplication trees. Systematic Biology 52(1), 110–118 (2003)
Hauth, A.M., Joseph, D.: Beyond tandem repeats: complex pattern structures and distant regions of similarity. In: ISMB, pp. 31–37 (2002)
Jeffreys, A.J., Wilson, V., Thein, S.L.: Individual-specific fingerprints of human DNA. Nature 51(2), 71–88 (1980)
Kimura, M.: Estimation of evolutionary distances between homologous nucleotide sequences. Proceedings of the National Academy of Sciences 78(1), 454–458 (1981)
Lajoie, M., Bertrand, D., El-Mabrouk, N., Gascuel, O.: Duplication and inversion history of a tandemly repeated genes family. Journal of Computational Biology 14(4), 462–478 (2007)
Matroud, A.A., Hendy, M.D., Tuffley, C.P.: Ntrfinder: a software tool to find nested tandem repeats. Nucleic Acids Research 40(3), e17 (2012)
Rivals, E.: A survey on algorithmic aspects of tandem repeats evolution. Int. J. Found. Comput. Sci. 15(2), 225–257 (2004)
Sagot, M.F., Myers, E.W.: Identifying satellites and periodic repetitions in biological sequences. Journal of Computational Biology 5(3), 539–554 (1998)
Sammeth, M., Stoye, J.: Comparing tandem repeats with duplications and excisions of variable degree. IEEE/ACM Transactions on Computational Biology and Bioinformatics 3, 395–407 (2006)
Stoye, J., Gusfield, D.: Simple and flexible detection of contiguous repeats using a suffix tree. Theor. Comput. Sci. 270(1-2), 843–856 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Matroud, A.A., Tuffley, C.P., Bryant, D., Hendy, M.D. (2012). A Comparison of Three Heuristic Methods for Solving the Parsing Problem for Tandem Repeats. In: de Souto, M.C., Kann, M.G. (eds) Advances in Bioinformatics and Computational Biology. BSB 2012. Lecture Notes in Computer Science(), vol 7409. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31927-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-31927-3_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31926-6
Online ISBN: 978-3-642-31927-3
eBook Packages: Computer ScienceComputer Science (R0)