Abstract
Every data compression method assumes a certain model of the information source that produces the data. When we improve a data compression method, we are also improving the model of the source. This happens because, when the probability distribution of the assumed source model is closer to the true probability distribution of the source, a smaller relative entropy results and, therefore, fewer redundancy bits are required. This is why the importance of data compression goes beyond the usual goal of reducing the storage space or the transmission time of the information. In fact, in some situations, seeking better models is the main aim. In our view, this is the case for DNA sequence data. In this paper, we give hints on how finite-context (Markov) modeling may be used for DNA sequence analysis, through the construction of complexity profiles of the sequences. These profiles are able to unveil structures of the DNA, some of them with potential biological relevance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Rissanen, J.: Generalized Kraft inequality and arithmetic coding. IBM J. Res. Develop. 20(3), 198–203 (1976)
Pinho, A.J., Neves, A.J.R., Afreixo, V., Bastos, C.A.C., Ferreira, P.J.S.G.: A three-state model for DNA protein-coding regions. IEEE Trans. on Biomedical Engineering 53(11), 2148–2155 (2006)
Pinho, A.J., Neves, A.J.R., Ferreira, P.J.S.G.: Inverted-repeats-aware finite-context models for DNA coding. In: Proc. of the 16th European Signal Processing Conf., EUSIPCO 2008, Lausanne, Switzerland (August 2008)
Pinho, A.J., Neves, A.J.R., Bastos, C.A.C., Ferreira, P.J.S.G.: DNA coding using finite-context models and arithmetic coding. In: Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 2009, Taipei, Taiwan (April 2009)
Pratas, D., Pinho, A.J.: Compressing the Human Genome Using Exclusively Markov Models. In: Rocha, M.P., RodrÃguez, J.M.C., Fdez-Riverola, F., Valencia, A. (eds.) PACBB 2011. AISC, vol. 93, pp. 213–220. Springer, Heidelberg (2011)
Pinho, A.J., Pratas, D., Ferreira, P.J.S.G.: Bacteria DNA sequence compression using a mixture of finite-context models. In: Proc. of the IEEE Workshop on Statistical Signal Processing, Nice, France (June 2011)
Pinho, A.J., Ferreira, P.J.S.G., Neves, A.J.R., Bastos, C.A.C.: On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE 6(6), e21588 (2011)
Pinho, A.J., Pratas, D., Ferreira, P.J.S.G., Garcia, S.P.: Symbolic to numerical conversion of DNA sequences using finite-context models. In: Proc. of the 19th European Signal Processing Conf., EUSIPCO 2011, Barcelona, Spain (August 2011)
Bell, T.C., Cleary, J.G., Witten, I.H.: Text compression. Prentice-Hall (1990)
Salomon, D.: Data compression - The complete reference, 4th edn. Springer, Heidelberg (2007)
Sayood, K.: Introduction to data compression, 3rd edn. Morgan Kaufmann (2006)
Laplace, P.S.: Essai philosophique sur les probabilités (A philosophical essay on probabilities). John Wiley & Sons, New York (1814); translated from the sixth French edition by Truscott, F.W., Emory, F. L. (1902)
Jeffreys, H.: An invariant form for the prior probability in estimation problems. Proc. of the Royal Society (London) A 186, 453–461 (1946)
Krichevsky, R.E., Trofimov, V.K.: The performance of universal encoding. IEEE Trans. on Information Theory 27(2), 199–207 (1981)
Grumbach, S., Tahi, F.: Compression of DNA sequences. In: Proc. of the Data Compression Conf., DCC 1993, Snowbird, Utah, pp. 340–350 (1993)
Rivals, E., Delahaye, J.P., Dauchet, M., Delgrange, O.: A guaranteed compression scheme for repetitive DNA sequences. In: Proc. of the Data Compression Conf., DCC 1996, Snowbird, Utah, p. 453 (1996)
Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences. IEEE Engineering in Medicine and Biology Magazine 20, 61–66 (2001)
Matsumoto, T., Sadakane, K., Imai, H.: Biological sequence compression algorithms. In: Dunker, A.K., Konagaya, A., Miyano, S., Takagi, T. (eds.) Genome Informatics 2000: Proc. of the 11th Workshop, Tokyo, Japan, pp. 43–52 (2000)
Manzini, G., Rastero, M.: A simple and fast DNA compressor. Software—Practice and Experience 34, 1397–1411 (2004)
Korodi, G., Tabus, I.: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. on Information Systems 23(1), 3–34 (2005)
Behzadi, B., Le Fessant, F.: DNA Compression Challenge Revisited. In: Combinatorial Pattern Matching. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 190–200. Springer, Heidelberg (2005)
Korodi, G., Tabus, I.: Normalized maximum likelihood model of order-1 for the compression of DNA sequences. In: Proc. of the Data Compression Conf., DCC 2007, Snowbird, Utah, pp. 33–42 (March 2007)
Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: Proc. of the Data Compression Conf., DCC 2007, Snowbird, Utah, pp. 43–52 (March 2007)
Solomonoff, R.J.: A formal theory of inductive inference, part I. Information and Control 7(1), 1–22 (1964)
Solomonoff, R.J.: A formal theory of inductive inference, part II. Information and Control 7(2), 224–254 (1964)
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problems of Information Transmission 1(1), 1–7 (1965)
Chaitin, G.J.: On the length of programs for computing finite binary sequences. Journal of the ACM 13, 547–569 (1966)
Wallace, C.S., Boulton, D.M.: An information measure for classification. The Computer Journal 11(2), 185–194 (1968)
Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. on Information Theory 22(1), 75–81 (1976)
Gordon, G.: Multi-dimensional linguistic complexity. Journal of Biomolecular Structure & Dynamics 20(6), 747–750 (2003)
Dix, T.I., Powell, D.R., Allison, L., Bernal, J., Jaeger, S., Stern, L.: Comparative analysis of long DNA sequences by per element information content using different contexts. BMC Bioinformatics 8(suppl. 2), S10 (2007)
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Trans. on Information Theory 50(12), 3250–3264 (2004)
Bennett, C.H., Gács, P., Vitányi, M.L.P.M.B., Zurek, W.H.: Information distance. IEEE Trans. on Information Theory 44(4), 1407–1423 (1998)
Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Trans. on Information Theory 51(4), 1523–1545 (2005)
Nan, F., Adjeroh, D.: On the complexity measures for biological sequences. In: Proc. of the IEEE Computational Systems Bioinformatics Conference, CSB 2004, Stanford, CA (August 2004)
Pirhaji, L., Kargar, M., Sheari, A., Poormohammadi, H., Sadeghi, M., Pezeshk, H., Eslahchi, C.: The performances of the chi-square test and complexity measures for signal recognition in biological sequences. Journal of Theoretical Biology 251(2), 380–387 (2008)
Gusev, V.D., Nemytikova, L.A., Chuzhanova, N.A.: On the complexity measures of genetic sequences. Bioinformatics 15(12), 994–999 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pinho, A.J., Pratas, D., Garcia, S.P. (2011). Complexity Profiles of DNA Sequences Using Finite-Context Models. In: Holzinger, A., Simonic, KM. (eds) Information Quality in e-Health. USAB 2011. Lecture Notes in Computer Science, vol 7058. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25364-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-25364-5_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25363-8
Online ISBN: 978-3-642-25364-5
eBook Packages: Computer ScienceComputer Science (R0)