Toward a Phylogenetically Aware Algorithm for Fast DNA Similarity Search

Buhler, Jeremy; Nordgren, Rachel

doi:10.1007/978-3-540-32290-0_2

Jeremy Buhler²⁰ &
Rachel Nordgren²⁰

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3388))

Included in the following conference series:

RECOMB Workshop on Comparative Genomics

357 Accesses
2 Citations

Abstract

High-throughput DNA sequencing is now producing collections of genomes from moderately or closely related organisms. Such a collection may be represented as a multiple alignment M of orthologous sequences, which induces a phylogenetic tree τ. Long-range genomic alignments with phylogenies have not yet found a prominent place in BLAST-like similarity search algorithms, though using them directly as databases can potentially yield more accurate and more informative alignments.

This work describes how to construct local alignments between a query and a multiple alignment in a way that explicitly uses a phylogenetic tree τ. We give an EM algorithm to find a locally optimal alignment when the location of the query on the tree τ is not known. An initial implementation of the method is tested on a large multiple alignment of sequences from eight vertebrate genomes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Cliften, P., Sudarsanam, P., Desikan, A., Fulton, L., Fulton, B., Majors, J., Waterston, R., Cohen, B.A., Johnston, M.: Finding functional features in saccharomyces genomes by phylogenetic footprinting. Science 301, 71–76 (2003)
Article Google Scholar
Bahl, A., Brunk, B., Crabtree, J., Fraunhoz, M.J., et al.: PlasmoDB: the Plasmodium genome resource. Nucleic Acids Research 31, 212–215 (2003)
Article Google Scholar
Thomas, J.W., Touchman, J.W., Blakesley, R.W., Bouffard, G.G., et al.: Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424, 788–793 (2003)
Article Google Scholar
Schwartz, S., Zhang, Z., Frazer, K.A., Smit, A.F., et al.: PipMaker – a web server for aligning two genomic DNA sequences. Genome Research 10, 577–586 (2000)
Article Google Scholar
Höhl, M., Kurtz, S., Ohlebusch, E.: Efficient multiple genome alignment. Bioinformatics 18, S312–S320 (2002)
Google Scholar
Bray, N., Dubchak, I., Pachter, L.: AVID: a global alignment program. Genome Research 13, 97–102 (2003)
Article Google Scholar
Brudno, M., Do, C., Cooker, G., Kim, M.F., Davydov, E., Green, E.D., Sidow, A., Batzoglou, S.: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Research 13, 721–731 (2003)
Article Google Scholar
Siepel, A., Haussler, D.: Computational identification of evolutionarily conserved exons. In: Proceedings of the Eighth Annual International Conference on Computational Molecular Biology (RECOMB 2004), San Diego, CA, pp. 177–186 (2004)
Google Scholar
McAuliffe, J.D., Pachter, L., Jordan, M.I.: Multiple-sequence functional annotation and the generalized hidden Markov phylogeny. Bioinformatics 20, 1850–1860 (2004)
Article Google Scholar
Altschul, S.F., Gish, W.: Local alignment statistics. Methods: a Companion to Methods in Enzymology 266, 460–480 (1996)
Article Google Scholar
Altschul, S.F., Madden, T.L., Scháffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)
Article Google Scholar
Yona, G., Levitt, M.: A unified sequence-structure classificatin of proteins: combining sequence and structure in a map of protein space. In: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB 2000), Tokyo, Japan, pp. 308–317 (2000)
Google Scholar
Wang, T., Stormo, G.D.: Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics 19, 2369–2380 (2003)
Article Google Scholar
Tamura, K., Nei, M.: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Molecular Biology and Evolution 10, 512–526 (1993)
Google Scholar
McGuire, G., Denham, M.C., Balding, D.J.: Models of sequence evolution for DNA sequences containing gaps. Molecular Biology and Evolution 18, 481–490 (2001)
Google Scholar
Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis. Cambridge University Press, New York (1998)
Book Google Scholar
Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. In: Proceedings of the Seventh Annual International Conference on Computational Molecular Biology (RECOMB 2003), Berlin, Germany, pp. 67–75 (2003)
Google Scholar
States, D.J., Gish, W., Altschul, S.F.: Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods: a Companion to Methods in Enzymology 3, 66–70 (1991)
Article Google Scholar
Karlin, S., Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. PNAS 87, 2264–2268 (1990)
Article MATH Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39, 1–38 (1977)
MATH MathSciNet Google Scholar
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Article Google Scholar
Meza, J.C., Hough, P.D., Williams, P.J.: Opt++ optimization library 2.1r3 (2004), http://csmr.ca.sandia.gov/projects/opt++
Strimmer, K., von Haeseler, A.: Nucleotide substitution models. In: Salemi, M., Vandamme, A.M. (eds.) The Phylogenetic Handbook. Cambridge University Press, New York (2003)
Google Scholar
Siepel, A., Haussler, D.: Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Molecular Biology and Evolution 21, 468–488 (2004)
Article Google Scholar
Smit, A.F., Green, P.: Repeatmasker (1999), http://ftp.genome.washington.edu/RM/RepeatMasker.html

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Washington University, St. Louis, MO, 63130, USA
Jeremy Buhler & Rachel Nordgren

Authors

Jeremy Buhler
View author publications
You can also search for this author in PubMed Google Scholar
Rachel Nordgren
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

KTH, Royal Institute of Technology, Stockholm, Sweden
Jens Lagergren

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Buhler, J., Nordgren, R. (2005). Toward a Phylogenetically Aware Algorithm for Fast DNA Similarity Search. In: Lagergren, J. (eds) Comparative Genomics. RCG 2004. Lecture Notes in Computer Science(), vol 3388. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-32290-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-540-32290-0_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24455-4
Online ISBN: 978-3-540-32290-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics