Maximal Words in Sequence Comparisons Based on Subword Composition

Apostolico, Alberto

doi:10.1007/978-3-642-12476-1_2

Alberto Apostolico¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6060))

1086 Accesses
9 Citations

Abstract

Measures of sequence similarity and distance based more or less explicitly on subword composition are attracting an increasing interest driven by intensive applications such as massive document classification and genome-wide molecular taxonomy. A uniform character of such measures is in some underlying notion of relative compressibility, whereby two similar sequences are expected to share a larger number of common substrings than two distant ones. This paper reviews some of the approaches to sequence comparison based on subword composition and suggests that their common denominator may ultimately reside in special classes of subwords, the nature of which resonates in interesting ways with the structure of popular subword trees and graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Apostolico, A.: The myriad virtues of suffix trees. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words, pp. 85–96. Springer, Berlin (1985)
Google Scholar
Apostolico, A., Denas, O.: Fast algorithms for computing sequence distances by exhaustive substring composition. Algorithms for Molecular Biology 3 (2008)
Google Scholar
Apostolico, A., Denas, O., Dress, A.: Efficient tools for comparative substring analysis (submitted, 2009)
Google Scholar
Blaidsell, B.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences, 5155–5159 (1986)
Google Scholar
Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.T., Seiferas, J.I.: The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci. 40, 31–55 (1985)
Article MATH MathSciNet Google Scholar
Brillouin, L.: Science and Information Theory. Academic Press, London (1971)
Google Scholar
Colosimo, A., de Luca, A.: Special factors in biological strings. J. Theor. Biol. 204, 29–47 (2000)
Article Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, Hoboken (1991)
Book MATH Google Scholar
Edgar, R.: Local homology recognition and distance measures in linear time using compressed amino-acid alphabets. Bioinformatics 32, 380–385 (2004)
Google Scholar
Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G.: Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinformatics 8, 252–272 (2007)
Article Google Scholar
Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco (2000)
Google Scholar
Hao, B.: Personal communication (2008)
Google Scholar
Hao, B., Qi, J.: Procaryote phylogeny without sequence alignment: from avoidance singature to composition distance. Journal of Bioinformatics and Computational Biology 2, 1–19 (2004)
Article Google Scholar
Van Helden, J.: Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics 20, 399–406 (2004)
Article Google Scholar
Höhl, M., Ragan, M.A.: Is multiple-sequence alignment required for accurate inference of phylogeny? Syst. Biol. 56(2), 206–221 (2007)
Article Google Scholar
Höhl, M., Rigutsos, I., Ragan, M.A.: Pattern-based phylogenetic distance estimation and tree recosntruction. Evolutionary Bioinformatics Online 2, 357–373 (2006)
Google Scholar
Hopcroft, J.E., Wong, J.K.: Linear time algorithm for isomorphism of planar graphs (preliminary report). In: STOC, pp. 172–184 (1974)
Google Scholar
Brooks Jr., F.P.: Three great challenges for half-century-old computer science. J. ACM 50(1), 25–26 (2003)
Article Google Scholar
Karlin, S., Burge, C.: Dinucleotide relative abundance extremes: a genomic signature. Trends in genetics: TIG 11(7), 283–290 (1995)
Article Google Scholar
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problemi Pederachi Inf. 1 (1965)
Google Scholar
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Transactions on Information Theory 22, 75–81 (1976)
Article MATH MathSciNet Google Scholar
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P.E., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(1), 149–154 (2001)
Article Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)
Article Google Scholar
Otu, H., Sayood, K.: A new sequence distance measure for phylogenetic tree reconstruction. Bioinformatics 19, 2122–2130 (2003)
Article Google Scholar
Qi, J., Wang, B., Hao, B.: Whole proteome prokaryote phylogeny without sequence alignment: A k-string composition approach. Molecular Evolution 58(1), 1–11 (2004)
Article Google Scholar
Rényi, A.: On measures of information and entropy. In: Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability, pp. 547–561 (1960)
Google Scholar
Sims, G.E., Jun, S.R., Wu, G.A., Kim, S.H.: Alignment-free genome comparison with feature frequency profiles (ffp) and optimal resolutions. Proceedings of the National Academy of Sciences 106(8), 2677–2682 (2009)
Article Google Scholar
Ukkonen, E.: Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci. 92(1), 191–211 (1992)
Article MATH MathSciNet Google Scholar
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Article MATH MathSciNet Google Scholar
Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenetic reconstruction. Journal of Computational Biology 13(2), 336–350 (2006)
Article MathSciNet Google Scholar
Vinga, S., Almeida, J.: Alignment-free sequence comparison – a review. Bioinformatics 20, 206–215 (2004)
Article Google Scholar
von Mises, R.: Probability, Statistics and Truth. MacMillan, Basingstoke (1939)
Google Scholar
Wu, T.J., Bruke, J., Davison, D.: A measure of DNA dissimilarity based on the mahalanobis distance between frequencies of words. Biometrics 53, 1431–1439 (1997)
Article MATH MathSciNet Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Georgia Institute of Technology & Università di Padova,
Alberto Apostolico

Authors

Alberto Apostolico
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Software Systems, Tampere University of Technology, P. O. Box 553, 33101, Tampere, Finland
Tapio Elomaa
Department of Information and Computer Science, Aalto University School of Science and Technology, P.O. Box 17800, 00076, Aalto, Finland
Heikki Mannila
Department of Information and Computer Science, Aalto University School of Science and Technology, P.O. Box 15400, 00076, Aalto, Finland
Pekka Orponen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Apostolico, A. (2010). Maximal Words in Sequence Comparisons Based on Subword Composition. In: Elomaa, T., Mannila, H., Orponen, P. (eds) Algorithms and Applications. Lecture Notes in Computer Science, vol 6060. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12476-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-12476-1_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12475-4
Online ISBN: 978-3-642-12476-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics