Skip to main content

Toward a Phylogenetically Aware Algorithm for Fast DNA Similarity Search

  • Conference paper
Comparative Genomics (RCG 2004)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3388))

Included in the following conference series:

Abstract

High-throughput DNA sequencing is now producing collections of genomes from moderately or closely related organisms. Such a collection may be represented as a multiple alignment M of orthologous sequences, which induces a phylogenetic tree τ. Long-range genomic alignments with phylogenies have not yet found a prominent place in BLAST-like similarity search algorithms, though using them directly as databases can potentially yield more accurate and more informative alignments.

This work describes how to construct local alignments between a query and a multiple alignment in a way that explicitly uses a phylogenetic tree τ. We give an EM algorithm to find a locally optimal alignment when the location of the query on the tree τ is not known. An initial implementation of the method is tested on a large multiple alignment of sequences from eight vertebrate genomes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cliften, P., Sudarsanam, P., Desikan, A., Fulton, L., Fulton, B., Majors, J., Waterston, R., Cohen, B.A., Johnston, M.: Finding functional features in saccharomyces genomes by phylogenetic footprinting. Science 301, 71–76 (2003)

    Article  Google Scholar 

  2. Bahl, A., Brunk, B., Crabtree, J., Fraunhoz, M.J., et al.: PlasmoDB: the Plasmodium genome resource. Nucleic Acids Research 31, 212–215 (2003)

    Article  Google Scholar 

  3. Thomas, J.W., Touchman, J.W., Blakesley, R.W., Bouffard, G.G., et al.: Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424, 788–793 (2003)

    Article  Google Scholar 

  4. Schwartz, S., Zhang, Z., Frazer, K.A., Smit, A.F., et al.: PipMaker – a web server for aligning two genomic DNA sequences. Genome Research 10, 577–586 (2000)

    Article  Google Scholar 

  5. Höhl, M., Kurtz, S., Ohlebusch, E.: Efficient multiple genome alignment. Bioinformatics 18, S312–S320 (2002)

    Google Scholar 

  6. Bray, N., Dubchak, I., Pachter, L.: AVID: a global alignment program. Genome Research 13, 97–102 (2003)

    Article  Google Scholar 

  7. Brudno, M., Do, C., Cooker, G., Kim, M.F., Davydov, E., Green, E.D., Sidow, A., Batzoglou, S.: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Research 13, 721–731 (2003)

    Article  Google Scholar 

  8. Siepel, A., Haussler, D.: Computational identification of evolutionarily conserved exons. In: Proceedings of the Eighth Annual International Conference on Computational Molecular Biology (RECOMB 2004), San Diego, CA, pp. 177–186 (2004)

    Google Scholar 

  9. McAuliffe, J.D., Pachter, L., Jordan, M.I.: Multiple-sequence functional annotation and the generalized hidden Markov phylogeny. Bioinformatics 20, 1850–1860 (2004)

    Article  Google Scholar 

  10. Altschul, S.F., Gish, W.: Local alignment statistics. Methods: a Companion to Methods in Enzymology 266, 460–480 (1996)

    Article  Google Scholar 

  11. Altschul, S.F., Madden, T.L., Scháffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)

    Article  Google Scholar 

  12. Yona, G., Levitt, M.: A unified sequence-structure classificatin of proteins: combining sequence and structure in a map of protein space. In: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB 2000), Tokyo, Japan, pp. 308–317 (2000)

    Google Scholar 

  13. Wang, T., Stormo, G.D.: Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics 19, 2369–2380 (2003)

    Article  Google Scholar 

  14. Tamura, K., Nei, M.: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Molecular Biology and Evolution 10, 512–526 (1993)

    Google Scholar 

  15. McGuire, G., Denham, M.C., Balding, D.J.: Models of sequence evolution for DNA sequences containing gaps. Molecular Biology and Evolution 18, 481–490 (2001)

    Google Scholar 

  16. Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis. Cambridge University Press, New York (1998)

    Book  Google Scholar 

  17. Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. In: Proceedings of the Seventh Annual International Conference on Computational Molecular Biology (RECOMB 2003), Berlin, Germany, pp. 67–75 (2003)

    Google Scholar 

  18. States, D.J., Gish, W., Altschul, S.F.: Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods: a Companion to Methods in Enzymology 3, 66–70 (1991)

    Article  Google Scholar 

  19. Karlin, S., Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. PNAS 87, 2264–2268 (1990)

    Article  MATH  Google Scholar 

  20. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39, 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  21. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)

    Article  Google Scholar 

  22. Meza, J.C., Hough, P.D., Williams, P.J.: Opt++ optimization library 2.1r3 (2004), http://csmr.ca.sandia.gov/projects/opt++

  23. Strimmer, K., von Haeseler, A.: Nucleotide substitution models. In: Salemi, M., Vandamme, A.M. (eds.) The Phylogenetic Handbook. Cambridge University Press, New York (2003)

    Google Scholar 

  24. Siepel, A., Haussler, D.: Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Molecular Biology and Evolution 21, 468–488 (2004)

    Article  Google Scholar 

  25. Smit, A.F., Green, P.: Repeatmasker (1999), http://ftp.genome.washington.edu/RM/RepeatMasker.html

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Buhler, J., Nordgren, R. (2005). Toward a Phylogenetically Aware Algorithm for Fast DNA Similarity Search. In: Lagergren, J. (eds) Comparative Genomics. RCG 2004. Lecture Notes in Computer Science(), vol 3388. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-32290-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-32290-0_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24455-4

  • Online ISBN: 978-3-540-32290-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics