Abstract
High-throughput DNA sequencing is now producing collections of genomes from moderately or closely related organisms. Such a collection may be represented as a multiple alignment M of orthologous sequences, which induces a phylogenetic tree τ. Long-range genomic alignments with phylogenies have not yet found a prominent place in BLAST-like similarity search algorithms, though using them directly as databases can potentially yield more accurate and more informative alignments.
This work describes how to construct local alignments between a query and a multiple alignment in a way that explicitly uses a phylogenetic tree τ. We give an EM algorithm to find a locally optimal alignment when the location of the query on the tree τ is not known. An initial implementation of the method is tested on a large multiple alignment of sequences from eight vertebrate genomes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Cliften, P., Sudarsanam, P., Desikan, A., Fulton, L., Fulton, B., Majors, J., Waterston, R., Cohen, B.A., Johnston, M.: Finding functional features in saccharomyces genomes by phylogenetic footprinting. Science 301, 71–76 (2003)
Bahl, A., Brunk, B., Crabtree, J., Fraunhoz, M.J., et al.: PlasmoDB: the Plasmodium genome resource. Nucleic Acids Research 31, 212–215 (2003)
Thomas, J.W., Touchman, J.W., Blakesley, R.W., Bouffard, G.G., et al.: Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424, 788–793 (2003)
Schwartz, S., Zhang, Z., Frazer, K.A., Smit, A.F., et al.: PipMaker – a web server for aligning two genomic DNA sequences. Genome Research 10, 577–586 (2000)
Höhl, M., Kurtz, S., Ohlebusch, E.: Efficient multiple genome alignment. Bioinformatics 18, S312–S320 (2002)
Bray, N., Dubchak, I., Pachter, L.: AVID: a global alignment program. Genome Research 13, 97–102 (2003)
Brudno, M., Do, C., Cooker, G., Kim, M.F., Davydov, E., Green, E.D., Sidow, A., Batzoglou, S.: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Research 13, 721–731 (2003)
Siepel, A., Haussler, D.: Computational identification of evolutionarily conserved exons. In: Proceedings of the Eighth Annual International Conference on Computational Molecular Biology (RECOMB 2004), San Diego, CA, pp. 177–186 (2004)
McAuliffe, J.D., Pachter, L., Jordan, M.I.: Multiple-sequence functional annotation and the generalized hidden Markov phylogeny. Bioinformatics 20, 1850–1860 (2004)
Altschul, S.F., Gish, W.: Local alignment statistics. Methods: a Companion to Methods in Enzymology 266, 460–480 (1996)
Altschul, S.F., Madden, T.L., Scháffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)
Yona, G., Levitt, M.: A unified sequence-structure classificatin of proteins: combining sequence and structure in a map of protein space. In: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB 2000), Tokyo, Japan, pp. 308–317 (2000)
Wang, T., Stormo, G.D.: Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics 19, 2369–2380 (2003)
Tamura, K., Nei, M.: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Molecular Biology and Evolution 10, 512–526 (1993)
McGuire, G., Denham, M.C., Balding, D.J.: Models of sequence evolution for DNA sequences containing gaps. Molecular Biology and Evolution 18, 481–490 (2001)
Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis. Cambridge University Press, New York (1998)
Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. In: Proceedings of the Seventh Annual International Conference on Computational Molecular Biology (RECOMB 2003), Berlin, Germany, pp. 67–75 (2003)
States, D.J., Gish, W., Altschul, S.F.: Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods: a Companion to Methods in Enzymology 3, 66–70 (1991)
Karlin, S., Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. PNAS 87, 2264–2268 (1990)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39, 1–38 (1977)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Meza, J.C., Hough, P.D., Williams, P.J.: Opt++ optimization library 2.1r3 (2004), http://csmr.ca.sandia.gov/projects/opt++
Strimmer, K., von Haeseler, A.: Nucleotide substitution models. In: Salemi, M., Vandamme, A.M. (eds.) The Phylogenetic Handbook. Cambridge University Press, New York (2003)
Siepel, A., Haussler, D.: Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Molecular Biology and Evolution 21, 468–488 (2004)
Smit, A.F., Green, P.: Repeatmasker (1999), http://ftp.genome.washington.edu/RM/RepeatMasker.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Buhler, J., Nordgren, R. (2005). Toward a Phylogenetically Aware Algorithm for Fast DNA Similarity Search. In: Lagergren, J. (eds) Comparative Genomics. RCG 2004. Lecture Notes in Computer Science(), vol 3388. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-32290-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-540-32290-0_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24455-4
Online ISBN: 978-3-540-32290-0
eBook Packages: Computer ScienceComputer Science (R0)