Skip to main content

Maximal Words in Sequence Comparisons Based on Subword Composition

  • Chapter
Algorithms and Applications

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6060))

Abstract

Measures of sequence similarity and distance based more or less explicitly on subword composition are attracting an increasing interest driven by intensive applications such as massive document classification and genome-wide molecular taxonomy. A uniform character of such measures is in some underlying notion of relative compressibility, whereby two similar sequences are expected to share a larger number of common substrings than two distant ones. This paper reviews some of the approaches to sequence comparison based on subword composition and suggests that their common denominator may ultimately reside in special classes of subwords, the nature of which resonates in interesting ways with the structure of popular subword trees and graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apostolico, A.: The myriad virtues of suffix trees. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words, pp. 85–96. Springer, Berlin (1985)

    Google Scholar 

  2. Apostolico, A., Denas, O.: Fast algorithms for computing sequence distances by exhaustive substring composition. Algorithms for Molecular Biology 3 (2008)

    Google Scholar 

  3. Apostolico, A., Denas, O., Dress, A.: Efficient tools for comparative substring analysis (submitted, 2009)

    Google Scholar 

  4. Blaidsell, B.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences, 5155–5159 (1986)

    Google Scholar 

  5. Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.T., Seiferas, J.I.: The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci. 40, 31–55 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  6. Brillouin, L.: Science and Information Theory. Academic Press, London (1971)

    Google Scholar 

  7. Colosimo, A., de Luca, A.: Special factors in biological strings. J. Theor. Biol. 204, 29–47 (2000)

    Article  Google Scholar 

  8. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, Hoboken (1991)

    Book  MATH  Google Scholar 

  9. Edgar, R.: Local homology recognition and distance measures in linear time using compressed amino-acid alphabets. Bioinformatics 32, 380–385 (2004)

    Google Scholar 

  10. Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G.: Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinformatics 8, 252–272 (2007)

    Article  Google Scholar 

  11. Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco (2000)

    Google Scholar 

  12. Hao, B.: Personal communication (2008)

    Google Scholar 

  13. Hao, B., Qi, J.: Procaryote phylogeny without sequence alignment: from avoidance singature to composition distance. Journal of Bioinformatics and Computational Biology 2, 1–19 (2004)

    Article  Google Scholar 

  14. Van Helden, J.: Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics 20, 399–406 (2004)

    Article  Google Scholar 

  15. Höhl, M., Ragan, M.A.: Is multiple-sequence alignment required for accurate inference of phylogeny? Syst. Biol. 56(2), 206–221 (2007)

    Article  Google Scholar 

  16. Höhl, M., Rigutsos, I., Ragan, M.A.: Pattern-based phylogenetic distance estimation and tree recosntruction. Evolutionary Bioinformatics Online 2, 357–373 (2006)

    Google Scholar 

  17. Hopcroft, J.E., Wong, J.K.: Linear time algorithm for isomorphism of planar graphs (preliminary report). In: STOC, pp. 172–184 (1974)

    Google Scholar 

  18. Brooks Jr., F.P.: Three great challenges for half-century-old computer science. J. ACM 50(1), 25–26 (2003)

    Article  Google Scholar 

  19. Karlin, S., Burge, C.: Dinucleotide relative abundance extremes: a genomic signature. Trends in genetics: TIG 11(7), 283–290 (1995)

    Article  Google Scholar 

  20. Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problemi Pederachi Inf. 1 (1965)

    Google Scholar 

  21. Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Transactions on Information Theory 22, 75–81 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  22. Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P.E., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(1), 149–154 (2001)

    Article  Google Scholar 

  23. Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)

    Article  Google Scholar 

  24. Otu, H., Sayood, K.: A new sequence distance measure for phylogenetic tree reconstruction. Bioinformatics 19, 2122–2130 (2003)

    Article  Google Scholar 

  25. Qi, J., Wang, B., Hao, B.: Whole proteome prokaryote phylogeny without sequence alignment: A k-string composition approach. Molecular Evolution 58(1), 1–11 (2004)

    Article  Google Scholar 

  26. Rényi, A.: On measures of information and entropy. In: Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability, pp. 547–561 (1960)

    Google Scholar 

  27. Sims, G.E., Jun, S.R., Wu, G.A., Kim, S.H.: Alignment-free genome comparison with feature frequency profiles (ffp) and optimal resolutions. Proceedings of the National Academy of Sciences 106(8), 2677–2682 (2009)

    Article  Google Scholar 

  28. Ukkonen, E.: Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci. 92(1), 191–211 (1992)

    Article  MATH  MathSciNet  Google Scholar 

  29. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  30. Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenetic reconstruction. Journal of Computational Biology 13(2), 336–350 (2006)

    Article  MathSciNet  Google Scholar 

  31. Vinga, S., Almeida, J.: Alignment-free sequence comparison – a review. Bioinformatics 20, 206–215 (2004)

    Article  Google Scholar 

  32. von Mises, R.: Probability, Statistics and Truth. MacMillan, Basingstoke (1939)

    Google Scholar 

  33. Wu, T.J., Bruke, J., Davison, D.: A measure of DNA dissimilarity based on the mahalanobis distance between frequencies of words. Biometrics 53, 1431–1439 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  34. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Apostolico, A. (2010). Maximal Words in Sequence Comparisons Based on Subword Composition. In: Elomaa, T., Mannila, H., Orponen, P. (eds) Algorithms and Applications. Lecture Notes in Computer Science, vol 6060. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12476-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12476-1_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12475-4

  • Online ISBN: 978-3-642-12476-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics