Skip to main content

Non-repetitive DNA Sequence Compression Using Memoization

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4345))

Abstract

With increasing number of DNA sequences being discovered the problem of storing and using genomic databases has become vital. Since DNA sequences consist of only four letters, two bits are sufficient to store each base. Many algorithms have been proposed in the recent past that push the bits/base limit further. The subtle patterns in DNA along with statistical inferences have been exploited to increase the compression ratio. From the compression perspective, the entire DNA sequences can be considered to be made of two types of sequences: repetitive and non-repetitive. The repetitive parts are compressed used dictionary-based schemes and non-repetitive sequences of DNA are usually compressed using general text compression schemes. In this paper, we present a memoization based encoding scheme for non-repeat DNA sequences. This scheme is incorporated with a DNA-specific compression algorithm, DNAPack, which is used for compression of DNA sequences. The results show that our method noticeably performs better than other techniques of its kind.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chen, X., Kwong, S., Li, M.: A compression algorithm for dna sequences and its application in genome comparison. genomic 12, 512–514 (2001)

    Google Scholar 

  2. Grumbach, S., Tahi, F.: Compression of dna sequences. In: Data compression conference, pp. 340–350 (1993)

    Google Scholar 

  3. Grumbach, S., Tahi, F.: A new challenge for compression algorithms genetic sequences. Journal of Information processing and Management 30, 866–875 (1994)

    Google Scholar 

  4. Matsumuto, T., Sadakane, K., Imai, H.: Biological sequences compression algorithms. In: Genome Information Ser. Workshop Genome Inform., vol. 11, pp. 43–52 (2000)

    Google Scholar 

  5. Rivals, E., Delahaye, J.-P., Dauchet, M., Delgrange, O.: A guaranteed compression scheme for repetitive dna sequences. LIFL Lille I Univerisity technical report, 285 (1995)

    Google Scholar 

  6. Willems, F.M.J., Shtralov, Y.M., Tjalkens, T.J.: The context tree weighting method:basic properties. IEE trans Inform Theory 41(3), 653–664 (1995)

    Article  MATH  Google Scholar 

  7. Sadakane, K., Okazaki, T., Imai, H.: Implementing the context tree weighting method for text compression. In: DCC 2000: Proceedings of the Conference on Data Compression, Washington, DC, USA, p. 123. IEEE Computer Society, Los Alamitos (2000)

    Chapter  Google Scholar 

  8. Rivals, E., Dauchet, M.: Fast discerning repeats in DNA sequences with a compression algorithm. In: Proc. Genome Informatics Workshop, pp. 215–226. Universal Academy Press, Tokyo (1997)

    Google Scholar 

  9. Sata, H., Yoshioka, T., Konagaya, A., Toyoda, T.: Dna compression in the post genomic era. Genome Informatics 12, 512–514 (2001)

    Google Scholar 

  10. Ziv, J., Limpel, A.: Compression of individual sequences using variable-rate encoding. IEE trans. Inform Theory 24, 530–536 (1978)

    Article  MATH  Google Scholar 

  11. Ziv, J., Limpel, A.: A universal algorithm for sequential data compression. IEE trans. Inform. Theory 23(3), 337–343 (1977)

    Article  MATH  Google Scholar 

  12. Sadel, I.: Universal data compression algorithm based on approximate string matching. In: Probability in the Engineering and Informational Sciences, pp. 465–486 (1996)

    Google Scholar 

  13. Chen, X., Kwong, S., Li, M.: A compression algorithm for dna sequences. IEEE Engineering in Medicine and biology Magazine 20(4), 61–66 (2001)

    Article  Google Scholar 

  14. Li, M., Badger, J.H., Chen, J.H., Kwong, S., Kerney, P., Zhang, H.: An information based sequences distance and its application to whole mitochondrial genome. Bioinformatics 17(2), 149–154 (2001)

    Article  Google Scholar 

  15. Chen, X., La, M., Ma, B., Tromp, J.: Dnacompress: fast and effective dna sequence compression. Bioinformatics 18, 1696–1698 (2002)

    Article  Google Scholar 

  16. Ma, B., Tromp, J., Li, M.: Patternhunter-faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)

    Article  Google Scholar 

  17. Chang, C.: Dnac: A compression algorithm of dna sequences by non-overlapping approximate repeats. Master Thesis (2004)

    Google Scholar 

  18. Modegi, T.: Development of lossless compression techniques for biology information and its application for bioinformatics database retrieval. Genome Informatics (14), 695–696 (2003)

    Google Scholar 

  19. Zhang, Y., Parthe, R., Adjeroh, D.: Lossless compression of dna microarray images. csbw 0, 128–132 (2005)

    Google Scholar 

  20. Tan, Z., Cao, X., Ooi, B.C., Tung, A.K.H.: The ed-tree: An index for large dna sequence databases. ssdbm, 151 (2003)

    Google Scholar 

  21. Behzadi, B., Le Fessant, F.: Dna compression challenge revisited:a dynamic programming approach. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 190–200. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  22. Apostolico, A., Lonardi, S.: Compression of biological sequences by greedy off-line textual substitution. dcc, 143 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Srinivasa, K.G., Jagadish, M., Venugopal, K.R., Patnaik, L.M. (2006). Non-repetitive DNA Sequence Compression Using Memoization. In: Maglaveras, N., Chouvarda, I., Koutkias, V., Brause, R. (eds) Biological and Medical Data Analysis. ISBMDA 2006. Lecture Notes in Computer Science(), vol 4345. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11946465_36

Download citation

  • DOI: https://doi.org/10.1007/11946465_36

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68063-5

  • Online ISBN: 978-3-540-68065-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics