Complexity Profiles of DNA Sequences Using Finite-Context Models

Pinho, Armando J.; Pratas, Diogo; Garcia, Sara P.

doi:10.1007/978-3-642-25364-5_8

Armando J. Pinho¹⁷,
Diogo Pratas¹⁷ &
Sara P. Garcia¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 7058))

Included in the following conference series:

Symposium of the Austrian HCI and Usability Engineering Group

2376 Accesses

Abstract

Every data compression method assumes a certain model of the information source that produces the data. When we improve a data compression method, we are also improving the model of the source. This happens because, when the probability distribution of the assumed source model is closer to the true probability distribution of the source, a smaller relative entropy results and, therefore, fewer redundancy bits are required. This is why the importance of data compression goes beyond the usual goal of reducing the storage space or the transmission time of the information. In fact, in some situations, seeking better models is the main aim. In our view, this is the case for DNA sequence data. In this paper, we give hints on how finite-context (Markov) modeling may be used for DNA sequence analysis, through the construction of complexity profiles of the sequences. These profiles are able to unveil structures of the DNA, some of them with potential biological relevance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

On the Approximation of the Kolmogorov Complexity for DNA Sequences

GeCo2: An Optimized Tool for Lossless Compression and Analysis of DNA Sequences

On Data Compression and Recovery for Sequences Using Constraints on the Spectrum Range

Article 01 October 2021

References

Rissanen, J.: Generalized Kraft inequality and arithmetic coding. IBM J. Res. Develop. 20(3), 198–203 (1976)
Article MathSciNet MATH Google Scholar
Pinho, A.J., Neves, A.J.R., Afreixo, V., Bastos, C.A.C., Ferreira, P.J.S.G.: A three-state model for DNA protein-coding regions. IEEE Trans. on Biomedical Engineering 53(11), 2148–2155 (2006)
Article Google Scholar
Pinho, A.J., Neves, A.J.R., Ferreira, P.J.S.G.: Inverted-repeats-aware finite-context models for DNA coding. In: Proc. of the 16th European Signal Processing Conf., EUSIPCO 2008, Lausanne, Switzerland (August 2008)
Google Scholar
Pinho, A.J., Neves, A.J.R., Bastos, C.A.C., Ferreira, P.J.S.G.: DNA coding using finite-context models and arithmetic coding. In: Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 2009, Taipei, Taiwan (April 2009)
Google Scholar
Pratas, D., Pinho, A.J.: Compressing the Human Genome Using Exclusively Markov Models. In: Rocha, M.P., Rodríguez, J.M.C., Fdez-Riverola, F., Valencia, A. (eds.) PACBB 2011. AISC, vol. 93, pp. 213–220. Springer, Heidelberg (2011)
Google Scholar
Pinho, A.J., Pratas, D., Ferreira, P.J.S.G.: Bacteria DNA sequence compression using a mixture of finite-context models. In: Proc. of the IEEE Workshop on Statistical Signal Processing, Nice, France (June 2011)
Google Scholar
Pinho, A.J., Ferreira, P.J.S.G., Neves, A.J.R., Bastos, C.A.C.: On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE 6(6), e21588 (2011)
Article Google Scholar
Pinho, A.J., Pratas, D., Ferreira, P.J.S.G., Garcia, S.P.: Symbolic to numerical conversion of DNA sequences using finite-context models. In: Proc. of the 19th European Signal Processing Conf., EUSIPCO 2011, Barcelona, Spain (August 2011)
Google Scholar
Bell, T.C., Cleary, J.G., Witten, I.H.: Text compression. Prentice-Hall (1990)
Google Scholar
Salomon, D.: Data compression - The complete reference, 4th edn. Springer, Heidelberg (2007)
MATH Google Scholar
Sayood, K.: Introduction to data compression, 3rd edn. Morgan Kaufmann (2006)
Google Scholar
Laplace, P.S.: Essai philosophique sur les probabilités (A philosophical essay on probabilities). John Wiley & Sons, New York (1814); translated from the sixth French edition by Truscott, F.W., Emory, F. L. (1902)
Google Scholar
Jeffreys, H.: An invariant form for the prior probability in estimation problems. Proc. of the Royal Society (London) A 186, 453–461 (1946)
Article MathSciNet MATH Google Scholar
Krichevsky, R.E., Trofimov, V.K.: The performance of universal encoding. IEEE Trans. on Information Theory 27(2), 199–207 (1981)
Article MathSciNet MATH Google Scholar
Grumbach, S., Tahi, F.: Compression of DNA sequences. In: Proc. of the Data Compression Conf., DCC 1993, Snowbird, Utah, pp. 340–350 (1993)
Google Scholar
Rivals, E., Delahaye, J.P., Dauchet, M., Delgrange, O.: A guaranteed compression scheme for repetitive DNA sequences. In: Proc. of the Data Compression Conf., DCC 1996, Snowbird, Utah, p. 453 (1996)
Google Scholar
Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences. IEEE Engineering in Medicine and Biology Magazine 20, 61–66 (2001)
Article Google Scholar
Matsumoto, T., Sadakane, K., Imai, H.: Biological sequence compression algorithms. In: Dunker, A.K., Konagaya, A., Miyano, S., Takagi, T. (eds.) Genome Informatics 2000: Proc. of the 11th Workshop, Tokyo, Japan, pp. 43–52 (2000)
Google Scholar
Manzini, G., Rastero, M.: A simple and fast DNA compressor. Software—Practice and Experience 34, 1397–1411 (2004)
Article Google Scholar
Korodi, G., Tabus, I.: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. on Information Systems 23(1), 3–34 (2005)
Article Google Scholar
Behzadi, B., Le Fessant, F.: DNA Compression Challenge Revisited. In: Combinatorial Pattern Matching. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 190–200. Springer, Heidelberg (2005)
Chapter Google Scholar
Korodi, G., Tabus, I.: Normalized maximum likelihood model of order-1 for the compression of DNA sequences. In: Proc. of the Data Compression Conf., DCC 2007, Snowbird, Utah, pp. 33–42 (March 2007)
Google Scholar
Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: Proc. of the Data Compression Conf., DCC 2007, Snowbird, Utah, pp. 43–52 (March 2007)
Google Scholar
Solomonoff, R.J.: A formal theory of inductive inference, part I. Information and Control 7(1), 1–22 (1964)
Article MathSciNet MATH Google Scholar
Solomonoff, R.J.: A formal theory of inductive inference, part II. Information and Control 7(2), 224–254 (1964)
Article MathSciNet MATH Google Scholar
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problems of Information Transmission 1(1), 1–7 (1965)
MathSciNet MATH Google Scholar
Chaitin, G.J.: On the length of programs for computing finite binary sequences. Journal of the ACM 13, 547–569 (1966)
Article MathSciNet MATH Google Scholar
Wallace, C.S., Boulton, D.M.: An information measure for classification. The Computer Journal 11(2), 185–194 (1968)
Article MATH Google Scholar
Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)
Article MATH Google Scholar
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. on Information Theory 22(1), 75–81 (1976)
Article MathSciNet MATH Google Scholar
Gordon, G.: Multi-dimensional linguistic complexity. Journal of Biomolecular Structure & Dynamics 20(6), 747–750 (2003)
Article Google Scholar
Dix, T.I., Powell, D.R., Allison, L., Bernal, J., Jaeger, S., Stern, L.: Comparative analysis of long DNA sequences by per element information content using different contexts. BMC Bioinformatics 8(suppl. 2), S10 (2007)
Article Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Trans. on Information Theory 50(12), 3250–3264 (2004)
Article MathSciNet MATH Google Scholar
Bennett, C.H., Gács, P., Vitányi, M.L.P.M.B., Zurek, W.H.: Information distance. IEEE Trans. on Information Theory 44(4), 1407–1423 (1998)
Article MathSciNet MATH Google Scholar
Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Trans. on Information Theory 51(4), 1523–1545 (2005)
Article MathSciNet MATH Google Scholar
Nan, F., Adjeroh, D.: On the complexity measures for biological sequences. In: Proc. of the IEEE Computational Systems Bioinformatics Conference, CSB 2004, Stanford, CA (August 2004)
Google Scholar
Pirhaji, L., Kargar, M., Sheari, A., Poormohammadi, H., Sadeghi, M., Pezeshk, H., Eslahchi, C.: The performances of the chi-square test and complexity measures for signal recognition in biological sequences. Journal of Theoretical Biology 251(2), 380–387 (2008)
Article MathSciNet Google Scholar
Gusev, V.D., Nemytikova, L.A., Chuzhanova, N.A.: On the complexity measures of genetic sequences. Bioinformatics 15(12), 994–999 (1999)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Signal Processing Lab, IEETA / DETI, University of Aveiro, 3810–193, Aveiro, Portugal
Armando J. Pinho, Diogo Pratas & Sara P. Garcia

Authors

Armando J. Pinho
View author publications
You can also search for this author in PubMed Google Scholar
Diogo Pratas
View author publications
You can also search for this author in PubMed Google Scholar
Sara P. Garcia
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Medical Informatics, Statistics and Documentation (IMI), Research Unit Human–Computer Interaction, Medical University Graz (MUG), Auenbruggerplatz 2/V, 8036, Graz, Austria
Andreas Holzinger & Klaus-Martin Simonic &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pinho, A.J., Pratas, D., Garcia, S.P. (2011). Complexity Profiles of DNA Sequences Using Finite-Context Models. In: Holzinger, A., Simonic, KM. (eds) Information Quality in e-Health. USAB 2011. Lecture Notes in Computer Science, vol 7058. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25364-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-25364-5_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25363-8
Online ISBN: 978-3-642-25364-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Complexity Profiles of DNA Sequences Using Finite-Context Models

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

On the Approximation of the Kolmogorov Complexity for DNA Sequences

GeCo2: An Optimized Tool for Lossless Compression and Analysis of DNA Sequences

On Data Compression and Recovery for Sequences Using Constraints on the Spectrum Range

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Complexity Profiles of DNA Sequences Using Finite-Context Models

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

On the Approximation of the Kolmogorov Complexity for DNA Sequences

GeCo2: An Optimized Tool for Lossless Compression and Analysis of DNA Sequences

On Data Compression and Recovery for Sequences Using Constraints on the Spectrum Range

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation