Non-repetitive DNA Sequence Compression Using Memoization

Srinivasa, K. G.; Jagadish, M.; Venugopal, K. R.; Patnaik, L. M.

doi:10.1007/11946465_36

Non-repetitive DNA Sequence Compression Using Memoization

K. G. Srinivasa²²,
M. Jagadish²³,
K. R. Venugopal²⁴ &
…
L. M. Patnaik²⁵

Conference paper

957 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4345))

Abstract

With increasing number of DNA sequences being discovered the problem of storing and using genomic databases has become vital. Since DNA sequences consist of only four letters, two bits are sufficient to store each base. Many algorithms have been proposed in the recent past that push the bits/base limit further. The subtle patterns in DNA along with statistical inferences have been exploited to increase the compression ratio. From the compression perspective, the entire DNA sequences can be considered to be made of two types of sequences: repetitive and non-repetitive. The repetitive parts are compressed used dictionary-based schemes and non-repetitive sequences of DNA are usually compressed using general text compression schemes. In this paper, we present a memoization based encoding scheme for non-repeat DNA sequences. This scheme is incorporated with a DNA-specific compression algorithm, DNAPack, which is used for compression of DNA sequences. The results show that our method noticeably performs better than other techniques of its kind.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chen, X., Kwong, S., Li, M.: A compression algorithm for dna sequences and its application in genome comparison. genomic 12, 512–514 (2001)
Google Scholar
Grumbach, S., Tahi, F.: Compression of dna sequences. In: Data compression conference, pp. 340–350 (1993)
Google Scholar
Grumbach, S., Tahi, F.: A new challenge for compression algorithms genetic sequences. Journal of Information processing and Management 30, 866–875 (1994)
Google Scholar
Matsumuto, T., Sadakane, K., Imai, H.: Biological sequences compression algorithms. In: Genome Information Ser. Workshop Genome Inform., vol. 11, pp. 43–52 (2000)
Google Scholar
Rivals, E., Delahaye, J.-P., Dauchet, M., Delgrange, O.: A guaranteed compression scheme for repetitive dna sequences. LIFL Lille I Univerisity technical report, 285 (1995)
Google Scholar
Willems, F.M.J., Shtralov, Y.M., Tjalkens, T.J.: The context tree weighting method:basic properties. IEE trans Inform Theory 41(3), 653–664 (1995)
Article MATH Google Scholar
Sadakane, K., Okazaki, T., Imai, H.: Implementing the context tree weighting method for text compression. In: DCC 2000: Proceedings of the Conference on Data Compression, Washington, DC, USA, p. 123. IEEE Computer Society, Los Alamitos (2000)
Chapter Google Scholar
Rivals, E., Dauchet, M.: Fast discerning repeats in DNA sequences with a compression algorithm. In: Proc. Genome Informatics Workshop, pp. 215–226. Universal Academy Press, Tokyo (1997)
Google Scholar
Sata, H., Yoshioka, T., Konagaya, A., Toyoda, T.: Dna compression in the post genomic era. Genome Informatics 12, 512–514 (2001)
Google Scholar
Ziv, J., Limpel, A.: Compression of individual sequences using variable-rate encoding. IEE trans. Inform Theory 24, 530–536 (1978)
Article MATH Google Scholar
Ziv, J., Limpel, A.: A universal algorithm for sequential data compression. IEE trans. Inform. Theory 23(3), 337–343 (1977)
Article MATH Google Scholar
Sadel, I.: Universal data compression algorithm based on approximate string matching. In: Probability in the Engineering and Informational Sciences, pp. 465–486 (1996)
Google Scholar
Chen, X., Kwong, S., Li, M.: A compression algorithm for dna sequences. IEEE Engineering in Medicine and biology Magazine 20(4), 61–66 (2001)
Article Google Scholar
Li, M., Badger, J.H., Chen, J.H., Kwong, S., Kerney, P., Zhang, H.: An information based sequences distance and its application to whole mitochondrial genome. Bioinformatics 17(2), 149–154 (2001)
Article Google Scholar
Chen, X., La, M., Ma, B., Tromp, J.: Dnacompress: fast and effective dna sequence compression. Bioinformatics 18, 1696–1698 (2002)
Article Google Scholar
Ma, B., Tromp, J., Li, M.: Patternhunter-faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)
Article Google Scholar
Chang, C.: Dnac: A compression algorithm of dna sequences by non-overlapping approximate repeats. Master Thesis (2004)
Google Scholar
Modegi, T.: Development of lossless compression techniques for biology information and its application for bioinformatics database retrieval. Genome Informatics (14), 695–696 (2003)
Google Scholar
Zhang, Y., Parthe, R., Adjeroh, D.: Lossless compression of dna microarray images. csbw 0, 128–132 (2005)
Google Scholar
Tan, Z., Cao, X., Ooi, B.C., Tung, A.K.H.: The ed-tree: An index for large dna sequence databases. ssdbm, 151 (2003)
Google Scholar
Behzadi, B., Le Fessant, F.: Dna compression challenge revisited:a dynamic programming approach. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 190–200. Springer, Heidelberg (2005)
Chapter Google Scholar
Apostolico, A., Lonardi, S.: Compression of biological sequences by greedy off-line textual substitution. dcc, 143 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Data Mining Laboratory, M S Ramaiah Institute of Technology, Bangalore
K. G. Srinivasa
Software Engineer, MindTree Consulting, Bangalore
M. Jagadish
Professor, University of Visvesvaraya College of Engineering, Bangalore University, Bangalore, 560 001
K. R. Venugopal
Professor, Microprocessor Application Laboratory, Indian Institute of Science, Bangalore, 560 012
L. M. Patnaik

Authors

K. G. Srinivasa
View author publications
You can also search for this author in PubMed Google Scholar
M. Jagadish
View author publications
You can also search for this author in PubMed Google Scholar
K. R. Venugopal
View author publications
You can also search for this author in PubMed Google Scholar
L. M. Patnaik
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The Medical School, Lab of Medical Informatics, Aristotle University, Box 323, 54124, Thessaloniki, Greece
Nicos Maglaveras & Ioanna Chouvarda &
Laboratory of Medical Informatics, Faculty of Medicine, Aristotle University of Thessaloniki, P.O.Box 323, 54124, Thessaloniki, Greece
Vassilis Koutkias
Institute of Informatics, Johann Wolfgang Goethe-Universität, 60054, Frankfurt a.M., Germany
Rüdiger Brause

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Srinivasa, K.G., Jagadish, M., Venugopal, K.R., Patnaik, L.M. (2006). Non-repetitive DNA Sequence Compression Using Memoization. In: Maglaveras, N., Chouvarda, I., Koutkias, V., Brause, R. (eds) Biological and Medical Data Analysis. ISBMDA 2006. Lecture Notes in Computer Science(), vol 4345. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11946465_36

Download citation

DOI: https://doi.org/10.1007/11946465_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68063-5
Online ISBN: 978-3-540-68065-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics