Procrastination Leads to Efficient Filtration for Local Multiple Alignment

Darling, Aaron E.; Treangen, Todd J.; Zhang, Louxin; Kuiken, Carla; Messeguer, Xavier; Perna, Nicole T.

doi:10.1007/11851561_12

Aaron E. Darling²¹,
Todd J. Treangen²²,
Louxin Zhang²⁴,
Carla Kuiken²⁵,
Xavier Messeguer²² &
…
Nicole T. Perna²³

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4175))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

947 Accesses
7 Citations

Abstract

We describe an efficient local multiple alignment filtration heuristic for identification of conserved regions in one or more DNA sequences. The method incorporates several novel ideas: (1) palindromic spaced seed patterns to match both DNA strands simultaneously, (2) seed extension (chaining) in order of decreasing multiplicity, and (3) procrastination when low multiplicity matches are encountered. The resulting local multiple alignments may have nucleotide substitutions and internal gaps as large as w characters in any occurrence of the motif. The algorithm consumes \(\mathcal{O}(wN)\) memory and \(\mathcal{O}(wN \log wN)\) time where N is the sequence length. We score the significance of multiple alignments using entropy-based motif scoring methods. We demonstrate the performance of our filtration method on Alu-repeat rich segments of the human genome and a large set of Hepatitis C virus genomes. The GPL implementation of our algorithm in C++ is called procrastAligner and is freely available from http://gel.ahabs.wisc.edu/procrastination

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Evaluation of BLAST-based edge-weighting metrics used for homology inference with the Markov Clustering algorithm

Article Open access 10 July 2015

Crumple: An Efficient Tool to Explore Thoroughly the RNA Folding Landscape

Sensitive inference of alignment-safe intervals from biodiverse protein sequence clusters using EMERALD

Article Open access 17 July 2023

References

Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)
Article Google Scholar
Brudno, M., Morgenstern, B.: Fast and sensitive alignment of large genomic sequences. In: Proc IEEE CSB 2002, pp. 138–147 (2002)
Google Scholar
Noé, L., Kucherov, G.: Improved hit criteria for DNA local alignment. BMC Bioinformatics 5 (2004)
Google Scholar
Kahveci, T., Ljosa, V., Singh, A.K.: Speeding up whole-genome alignment by indexing frequency vectors. Bioinformatics 20, 2122–2134 (2004)
Article Google Scholar
Choi, P., Zeng, K., Zhang, F.L.: Good spaced seeds for homology search. Bioinformatics 20, 1053–1059 (2004)
Article Google Scholar
Li, M., Ma, B., Zhang, L.: Superiority and complexity of the spaced seeds. In: Proc. SODA 2006, pp. 444–453 (2006)
Google Scholar
Sun, Y., Buhler, J.: Designing multiple simultaneous seeds for DNA similarity search. J. Comput. Biol. 12, 847–861 (2005)
Article Google Scholar
Xu, J., Brown, D.G., Li, M., Ma, B.: Optimizing multiple spaced seeds for homology search. In: CPM 2004, pp. 47–58 (2004)
Google Scholar
Flannick, J., Batzoglou, S.: Using multiple alignments to improve seeded local alignment algorithms. Nucleic Acids Res. 33, 4563–4577 (2005)
Article Google Scholar
Li, L., Stoeckert, C.J., Roos, D.S.: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189 (2003)
Article Google Scholar
Jaffe, D.B., Butler, J., Gnerre, S., Mauceli, E., Lindblad-Toh, K., Mesirov, J.P., Zody, M.C., Lander, E.S.: Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91–96 (2003)
Article Google Scholar
Ane, C., Sanderson, M.: Missing the forest for the trees: phylogenetic compression and its implications for inferring complex evolutionary histories. Syst. Biol. 54, I311–I317 (2005)
Article Google Scholar
Margulies, M., et al.: Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005)
Google Scholar
Darling, A.C.E., Mau, B., Blattner, F.R., Perna, N.T.: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 14(7), 1394–1403 (2004)
Article Google Scholar
Hohl, M., Kurtz, S., Ohlebusch, E.: Efficient multiple genome alignment. Bioinformatics 18(suppl. 1), S312–S320 (2002)
Google Scholar
Treangen, T., Messeguer, X.: M-GCAT: Multiple Genome Comparison and Alignment Tool (submitted, 2006)
Google Scholar
Dewey, C.N., Pachter, L.: Evolution at the nucleotide level: the problem of multiple whole-genome alignment. Hum. Mol. Genet. 15(suppl. 1) (2006)
Google Scholar
Sammeth, M., Heringa, J.: Global multiple-sequence alignment with repeats. Proteins (2006)
Google Scholar
Raphael, B., Zhi, D., Tang, H., Pevzner, P.: A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Res. 14(11), 2336–2346 (2004)
Article Google Scholar
Edgar, R.C., Myers, E.W.: PILER: identification and classification of genomic repeats. Bioinformatics 21(suppl. 1) (2005)
Google Scholar
Kurtz, S., Ohlebusch, E., Schleiermacher, C., Stoye, J., Giegerich, R.: Computation and visualization of degenerate repeats in complete genomes. In: Proc. 8th Intell. Syst. Mol. Biol. ISMB 2000, pp. 228–238 (2000)
Google Scholar
Jurka, J., Kapitonov, V.V., Pavlicek, A., Klonowski, P., Kohany, O., Walichiewicz, J.: Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 110, 462–467 (2005)
Article Google Scholar
Zhang, Y., Waterman, M.S.: An Eulerian path approach to local multiple alignment for DNA sequences. PNAS 102, 1285–1290 (2005)
Article MathSciNet Google Scholar
Siddharthan, R., Siggia, E.D., van Nimwegen, E.: PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput. Biol. 1 (2005)
Google Scholar
Nagarajan, N., Jones, N., Keich, U.: Computing the P-value of the information content from an alignment of multiple sequences. Bioinformatics 21(suppl. 1) (2005)
Google Scholar
Szklarczyk, R., Heringa, J.: Tracking repeats using significance and transitivity. Bioinformatics 20(suppl. 1), 311–317 (2004)
Article Google Scholar
Kuiken, C., Yusim, K., Boykin, L., Richardson, R.: The Los Alamos hepatitis C sequence database. Bioinformatics 21, 379–384 (2005)
Article Google Scholar
Prakash, A., Tompa, M.: Statistics of local multiple alignments. Bioinformatics 21, i344–i350 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Wisconsin, USA
Aaron E. Darling
Department of Computer Science, Technical University of Catalonia, Barcelona, Spain
Todd J. Treangen & Xavier Messeguer
Department of Animal Health and Biomedical Sciences, Genome Center, University of Wisconsin, USA
Nicole T. Perna
Department of Mathematics, National University of Singapore, Singapore
Louxin Zhang
T-10 Theoretical Biology Division, Los Alamos National Laboratory, USA
Carla Kuiken

Authors

Aaron E. Darling
View author publications
You can also search for this author in PubMed Google Scholar
Todd J. Treangen
View author publications
You can also search for this author in PubMed Google Scholar
Louxin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Carla Kuiken
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Messeguer
View author publications
You can also search for this author in PubMed Google Scholar
Nicole T. Perna
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Ecole Polytechnique Fédérale de Lausanne, Switzerland
Philipp Bücher
Laboratory for Computational Biology and Bioinformatics, EPFL (Ecole Polytechnique Fédérale de Lausanne), Swiss Institute of Bioinformatics, Lausanne, Switzerland
Bernard M. E. Moret

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Darling, A.E., Treangen, T.J., Zhang, L., Kuiken, C., Messeguer, X., Perna, N.T. (2006). Procrastination Leads to Efficient Filtration for Local Multiple Alignment. In: Bücher, P., Moret, B.M.E. (eds) Algorithms in Bioinformatics. WABI 2006. Lecture Notes in Computer Science(), vol 4175. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11851561_12

Download citation

DOI: https://doi.org/10.1007/11851561_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39583-6
Online ISBN: 978-3-540-39584-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics