Abstract
We describe an efficient local multiple alignment filtration heuristic for identification of conserved regions in one or more DNA sequences. The method incorporates several novel ideas: (1) palindromic spaced seed patterns to match both DNA strands simultaneously, (2) seed extension (chaining) in order of decreasing multiplicity, and (3) procrastination when low multiplicity matches are encountered. The resulting local multiple alignments may have nucleotide substitutions and internal gaps as large as w characters in any occurrence of the motif. The algorithm consumes \(\mathcal{O}(wN)\) memory and \(\mathcal{O}(wN \log wN)\) time where N is the sequence length. We score the significance of multiple alignments using entropy-based motif scoring methods. We demonstrate the performance of our filtration method on Alu-repeat rich segments of the human genome and a large set of Hepatitis C virus genomes. The GPL implementation of our algorithm in C++ is called procrastAligner and is freely available from http://gel.ahabs.wisc.edu/procrastination
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)
Brudno, M., Morgenstern, B.: Fast and sensitive alignment of large genomic sequences. In: Proc IEEE CSB 2002, pp. 138–147 (2002)
Noé, L., Kucherov, G.: Improved hit criteria for DNA local alignment. BMC Bioinformatics 5 (2004)
Kahveci, T., Ljosa, V., Singh, A.K.: Speeding up whole-genome alignment by indexing frequency vectors. Bioinformatics 20, 2122–2134 (2004)
Choi, P., Zeng, K., Zhang, F.L.: Good spaced seeds for homology search. Bioinformatics 20, 1053–1059 (2004)
Li, M., Ma, B., Zhang, L.: Superiority and complexity of the spaced seeds. In: Proc. SODA 2006, pp. 444–453 (2006)
Sun, Y., Buhler, J.: Designing multiple simultaneous seeds for DNA similarity search. J. Comput. Biol. 12, 847–861 (2005)
Xu, J., Brown, D.G., Li, M., Ma, B.: Optimizing multiple spaced seeds for homology search. In: CPM 2004, pp. 47–58 (2004)
Flannick, J., Batzoglou, S.: Using multiple alignments to improve seeded local alignment algorithms. Nucleic Acids Res. 33, 4563–4577 (2005)
Li, L., Stoeckert, C.J., Roos, D.S.: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189 (2003)
Jaffe, D.B., Butler, J., Gnerre, S., Mauceli, E., Lindblad-Toh, K., Mesirov, J.P., Zody, M.C., Lander, E.S.: Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91–96 (2003)
Ane, C., Sanderson, M.: Missing the forest for the trees: phylogenetic compression and its implications for inferring complex evolutionary histories. Syst. Biol. 54, I311–I317 (2005)
Margulies, M., et al.: Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005)
Darling, A.C.E., Mau, B., Blattner, F.R., Perna, N.T.: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 14(7), 1394–1403 (2004)
Hohl, M., Kurtz, S., Ohlebusch, E.: Efficient multiple genome alignment. Bioinformatics 18(suppl. 1), S312–S320 (2002)
Treangen, T., Messeguer, X.: M-GCAT: Multiple Genome Comparison and Alignment Tool (submitted, 2006)
Dewey, C.N., Pachter, L.: Evolution at the nucleotide level: the problem of multiple whole-genome alignment. Hum. Mol. Genet. 15(suppl. 1) (2006)
Sammeth, M., Heringa, J.: Global multiple-sequence alignment with repeats. Proteins (2006)
Raphael, B., Zhi, D., Tang, H., Pevzner, P.: A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Res. 14(11), 2336–2346 (2004)
Edgar, R.C., Myers, E.W.: PILER: identification and classification of genomic repeats. Bioinformatics 21(suppl. 1) (2005)
Kurtz, S., Ohlebusch, E., Schleiermacher, C., Stoye, J., Giegerich, R.: Computation and visualization of degenerate repeats in complete genomes. In: Proc. 8th Intell. Syst. Mol. Biol. ISMB 2000, pp. 228–238 (2000)
Jurka, J., Kapitonov, V.V., Pavlicek, A., Klonowski, P., Kohany, O., Walichiewicz, J.: Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 110, 462–467 (2005)
Zhang, Y., Waterman, M.S.: An Eulerian path approach to local multiple alignment for DNA sequences. PNAS 102, 1285–1290 (2005)
Siddharthan, R., Siggia, E.D., van Nimwegen, E.: PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput. Biol. 1 (2005)
Nagarajan, N., Jones, N., Keich, U.: Computing the P-value of the information content from an alignment of multiple sequences. Bioinformatics 21(suppl. 1) (2005)
Szklarczyk, R., Heringa, J.: Tracking repeats using significance and transitivity. Bioinformatics 20(suppl. 1), 311–317 (2004)
Kuiken, C., Yusim, K., Boykin, L., Richardson, R.: The Los Alamos hepatitis C sequence database. Bioinformatics 21, 379–384 (2005)
Prakash, A., Tompa, M.: Statistics of local multiple alignments. Bioinformatics 21, i344–i350 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Darling, A.E., Treangen, T.J., Zhang, L., Kuiken, C., Messeguer, X., Perna, N.T. (2006). Procrastination Leads to Efficient Filtration for Local Multiple Alignment. In: Bücher, P., Moret, B.M.E. (eds) Algorithms in Bioinformatics. WABI 2006. Lecture Notes in Computer Science(), vol 4175. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11851561_12
Download citation
DOI: https://doi.org/10.1007/11851561_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39583-6
Online ISBN: 978-3-540-39584-3
eBook Packages: Computer ScienceComputer Science (R0)