Skip to main content

Estimating Evolutionary Distances from Spaced-Word Matches

  • Conference paper
  • 1867 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8701))

Abstract

Alignment-free methods are increasingly used to estimate distances between DNA and protein sequences and to reconstruct phylogenetic trees. Most distance functions used by these methods, however, are heuristic measures of dissimilarity, not based on any explicit model of evolution. Herein, we propose a simple estimator of the evolutionary distance between two DNA sequences calculated from the number of (spaced) word matches between them. We show that this distance function estimates the evolutionary distance between DNA sequences more accurately than other distance measures used by alignment-free methods. In addition, we calculate the variance of the number of (spaced) word matches depending on sequence length and mismatch probability.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Vinga, S.: Editorial: Alignment-free methods in computational biology. Briefings in Bioinformatics 15, 341–342 (2014)

    Article  Google Scholar 

  2. Blaisdell, B.E.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences of the United States of America 83, 5155–5159 (1986)

    Article  MATH  Google Scholar 

  3. Lin, J.: Divergence measures based on the shannon entropy. IEEE Transactions on Information theory 37, 145–151 (1991)

    Article  MATH  Google Scholar 

  4. Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)

    Article  Google Scholar 

  5. Boden, M., Schöneich, M., Horwege, S., Lindner, S., Leimeister, C.-A., Morgenstern, B.: Alignment-free sequence comparison with spaced k-mers. In: German Conference on Bioinformatics 2013. OpenAccess Series in Informatics (OASIcs), vol. 34, pp. 24–34 (2013)

    Google Scholar 

  6. Leimeister, C.-A., Boden, M., Horwege, S., Lindner, S., Morgenstern, B.: Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics 30, 2000–2008 (2014)

    Article  Google Scholar 

  7. Horwege, S., Sebastian, L., Boden, M., Hatje, K., Kollmar, M., Leimeister, C.-A., Morgenstern, B.: Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Research 42, W7–W11 (2014)

    Google Scholar 

  8. Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4, 406–425 (1987)

    Google Scholar 

  9. Haubold, B., Pierstorff, N., Möller, F., Wiehe, T.: Genome comparison without alignment using shortest unique substrings. BMC Bioinformatics 6, 123 (2005)

    Article  Google Scholar 

  10. Lippert, R.A., Huang, H., Waterman, M.S.: Distributional regimes for the number of k-word matches between two random sequences. Proceedings of the National Academy of Sciences 99, 13980–13989 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  11. Kantorovitz, M., Robinson, G., Sinha, S.: A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics 23, 249–255 (2007)

    Article  Google Scholar 

  12. Reinert, G., Chew, D., Sun, F., Waterman, M.S.: Alignment-free sequence comparison (i): Statistics and power. Journal of Computational Biology 16, 1615–1634 (2009)

    Article  MathSciNet  Google Scholar 

  13. Jukes, T.H., Cantor, C.R.: Evolution of Protein Molecules. Academy Press (1969)

    Google Scholar 

  14. Robin, S., Rodolphe, F., Schbath, S.: DNA, Words and Models: Statistics of Exceptional Words. Cambridge University Press, Cambridge (2005)

    Google Scholar 

  15. Haubold, B., Pfaffelhuber, P., Domazet-Loso, M., Wiehe, T.: Estimating mutation distances from unaligned genomes. Journal of Computational Biology 16, 1487–1500 (2009)

    Article  MathSciNet  Google Scholar 

  16. Leimeister, C.-A., Morgenstern, B.: kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics 30, 1991–1999 (2014)

    Article  Google Scholar 

  17. Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenomic reconstruction. Journal of Computational Biology 13, 336–350 (2006)

    Article  MathSciNet  Google Scholar 

  18. Sims, G.E., Jun, S.-R., Wu, G.A., Kim, S.-H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences 106, 2677–2682 (2009)

    Article  Google Scholar 

  19. Qi, J., Luo, H., Hao, B.: CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Research 32(suppl 2), W45–W47 (2004)

    Google Scholar 

  20. Felsenstein, J.: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 5, 164–166 (1989)

    Google Scholar 

  21. Bonnet, E., de Peer, Y.V.: zt: A sofware tool for simple and partial mantel tests. Journal of Statistical Software 7, 1–12 (2002)

    Google Scholar 

  22. Didier, G., Laprevotte, I., Pupin, M., Hénaut, A.: Local decoding of sequences and alignment-free comparison. J. Computational Biology 13, 1465–1476 (2006)

    Article  Google Scholar 

  23. Kuiken, C., Leitner, T., Foley, B., Hahn, B., Marx, P., McCutchan, F., Wolinsky, S., Korber, B.T. (eds.): HIV Sequence Compendium 2009. Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, New Mexico (2009)

    Google Scholar 

  24. Sievers, F., Wilm, A., Dineen, D., Gibson, T.J., Karplus, K., Li, W., Lopez, R., McWilliam, H., Remmert, M., Söding, J., Thompson, J.D., Higgins, D.G.: Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology 7, 539 (2011)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Morgenstern, B., Zhu, B., Horwege, S., Leimeister, CA. (2014). Estimating Evolutionary Distances from Spaced-Word Matches. In: Brown, D., Morgenstern, B. (eds) Algorithms in Bioinformatics. WABI 2014. Lecture Notes in Computer Science(), vol 8701. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44753-6_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-44753-6_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-44752-9

  • Online ISBN: 978-3-662-44753-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics