Abstract
Measures of sequence similarity and distance based more or less explicitly on subword composition are attracting an increasing interest driven by intensive applications such as massive document classification and genome-wide molecular taxonomy. A uniform character of such measures is in some underlying notion of relative compressibility, whereby two similar sequences are expected to share a larger number of common substrings than two distant ones. This paper reviews some of the approaches to sequence comparison based on subword composition and suggests that their common denominator may ultimately reside in special classes of subwords, the nature of which resonates in interesting ways with the structure of popular subword trees and graphs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Apostolico, A.: The myriad virtues of suffix trees. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words, pp. 85–96. Springer, Berlin (1985)
Apostolico, A., Denas, O.: Fast algorithms for computing sequence distances by exhaustive substring composition. Algorithms for Molecular Biology 3 (2008)
Apostolico, A., Denas, O., Dress, A.: Efficient tools for comparative substring analysis (submitted, 2009)
Blaidsell, B.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences, 5155–5159 (1986)
Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.T., Seiferas, J.I.: The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci. 40, 31–55 (1985)
Brillouin, L.: Science and Information Theory. Academic Press, London (1971)
Colosimo, A., de Luca, A.: Special factors in biological strings. J. Theor. Biol. 204, 29–47 (2000)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, Hoboken (1991)
Edgar, R.: Local homology recognition and distance measures in linear time using compressed amino-acid alphabets. Bioinformatics 32, 380–385 (2004)
Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G.: Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinformatics 8, 252–272 (2007)
Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco (2000)
Hao, B.: Personal communication (2008)
Hao, B., Qi, J.: Procaryote phylogeny without sequence alignment: from avoidance singature to composition distance. Journal of Bioinformatics and Computational Biology 2, 1–19 (2004)
Van Helden, J.: Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics 20, 399–406 (2004)
Höhl, M., Ragan, M.A.: Is multiple-sequence alignment required for accurate inference of phylogeny? Syst. Biol. 56(2), 206–221 (2007)
Höhl, M., Rigutsos, I., Ragan, M.A.: Pattern-based phylogenetic distance estimation and tree recosntruction. Evolutionary Bioinformatics Online 2, 357–373 (2006)
Hopcroft, J.E., Wong, J.K.: Linear time algorithm for isomorphism of planar graphs (preliminary report). In: STOC, pp. 172–184 (1974)
Brooks Jr., F.P.: Three great challenges for half-century-old computer science. J. ACM 50(1), 25–26 (2003)
Karlin, S., Burge, C.: Dinucleotide relative abundance extremes: a genomic signature. Trends in genetics: TIG 11(7), 283–290 (1995)
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problemi Pederachi Inf. 1 (1965)
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Transactions on Information Theory 22, 75–81 (1976)
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P.E., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(1), 149–154 (2001)
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)
Otu, H., Sayood, K.: A new sequence distance measure for phylogenetic tree reconstruction. Bioinformatics 19, 2122–2130 (2003)
Qi, J., Wang, B., Hao, B.: Whole proteome prokaryote phylogeny without sequence alignment: A k-string composition approach. Molecular Evolution 58(1), 1–11 (2004)
Rényi, A.: On measures of information and entropy. In: Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability, pp. 547–561 (1960)
Sims, G.E., Jun, S.R., Wu, G.A., Kim, S.H.: Alignment-free genome comparison with feature frequency profiles (ffp) and optimal resolutions. Proceedings of the National Academy of Sciences 106(8), 2677–2682 (2009)
Ukkonen, E.: Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci. 92(1), 191–211 (1992)
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenetic reconstruction. Journal of Computational Biology 13(2), 336–350 (2006)
Vinga, S., Almeida, J.: Alignment-free sequence comparison – a review. Bioinformatics 20, 206–215 (2004)
von Mises, R.: Probability, Statistics and Truth. MacMillan, Basingstoke (1939)
Wu, T.J., Bruke, J., Davison, D.: A measure of DNA dissimilarity based on the mahalanobis distance between frequencies of words. Biometrics 53, 1431–1439 (1997)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Apostolico, A. (2010). Maximal Words in Sequence Comparisons Based on Subword Composition. In: Elomaa, T., Mannila, H., Orponen, P. (eds) Algorithms and Applications. Lecture Notes in Computer Science, vol 6060. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12476-1_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-12476-1_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12475-4
Online ISBN: 978-3-642-12476-1
eBook Packages: Computer ScienceComputer Science (R0)