Skip to main content

NRRC: A Non-referential Reads Compression Algorithm

  • Conference paper
Bioinformatics Research and Applications (ISBRA 2015)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9096))

Included in the following conference series:

Abstract

In the era of modern sequencing technology, we are collecting a vast amount of biological sequence data. The technology to store, process, and analyze the data is not as cheap as to generate the sequencing data. As a result, the need for devising efficient data compression and data reduction techniques is growing by the day. Although there exist a number of sophisticated general purpose compression algorithms, they are not efficient to compress biological data. As a result, we need specialized compression algorithms targeting biological data. Five different NGS data compression problems have been identified and studied. In this article we propose a novel algorithm for one of these problems. We have done extensive experiments using real sequencing reads of various lengths. The simulation results reveal that our proposed algorithm is indeed competitive and performs better than the best known algorithms existing in the current literature.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bonfield, J.K., Mahoney, M.V.: Compression of FASTQ and SAM format sequencing data. PLoS One 8, e59190 (2013)

    Google Scholar 

  2. Bose, T., Mohammed, M.H., Dutta, A., Mande, S.S.: BIND - An algorithm for loss-less compression of nucleotide sequence data. J. Biosci. 37, 785–789 (2012)

    Article  Google Scholar 

  3. Brandon, M.C., Wallace, D.C., Baldi, P.: Data structures and compression algorithms for genomic sequence data. Bioinformatics 25, 1731–1738 (2009)

    Article  Google Scholar 

  4. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. SRC Research Report (1994)

    Google Scholar 

  5. Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: Proceedings of the 2007 IEEE Data Compression Conference (DCC 2007), pp. 43–52 (2007)

    Google Scholar 

  6. Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences and its applications in genome comparison. Genome Informat Ser. 10, 51–61 (1999)

    Google Scholar 

  7. Chen, X., Li, M., Ma, B., Tromp, J.: DNACompress: fast and effective DNA sequence compression. Bioinformatics 8, 1696–1698 (2002)

    Article  Google Scholar 

  8. Christley, S., Lu, Y., Li, C., Xiaohui, X.: Human genomes as email attachments. Bioinformatics 25, 274–275 (2009)

    Article  Google Scholar 

  9. Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28, 1415–1419 (2012)

    Article  Google Scholar 

  10. Deorowicz, S., Grabowski, S.: Compression of DNA sequence reads in FASTQ format. Bioinformatics 27, 860–862 (2011)

    Article  Google Scholar 

  11. Fritz, M.H.-Y., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21, 734–740 (2011)

    Article  Google Scholar 

  12. Grumbach, S., Tahi, F.: Compression of DNA sequences. In: Proceedings of the 1993 IEEEData Compression Conference (DCC 1993), Snowbird, Utah, pp. 340–350 (1993)

    Google Scholar 

  13. Grumbach, S., Tahi, F.: A new challenge for compression algorithms. Genet. Seq. Inform. Process. Manag. 30, 875–886 (1994)

    Article  MATH  Google Scholar 

  14. Hach, F., Numanagic, I., Alkan, C., Sahinalp, S.C.: SCALCE: Boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057 (2012)

    Article  Google Scholar 

  15. Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of nextgeneration sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40, e171 (2012)

    Google Scholar 

  16. Kingsford, C., Patro, R.: Compression of short-read sequences using path encoding. bioRxiv (2014)

    Google Scholar 

  17. Korodi, G., Tabus, I., Rissanen, J., Astola, J.D.: sequence compression - based on the normalized maximum likelihood model. IEEE Sign Process Mag. 24, 47–53 (2007)

    Article  Google Scholar 

  18. Kuruppu, S., Beresford-Smith, B., Conway, T., Zobel, J.: Iterative dictionary construction for compression of large DNA data sets. IEEE-ACM Trans Computat Biol Bioinformatics 9, 137–149 (2012)

    Article  Google Scholar 

  19. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009)

    Google Scholar 

  20. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009)

    Article  Google Scholar 

  21. Mohammed, M.H., Dutta, A., Bose, T., Chadaram, S., Mande, S.S.: DELIMINATE-a fast and efficient method for loss-less compression of genomic sequences. Bioinformatics 28, 2527–2529 (2012)

    Article  Google Scholar 

  22. Pinho, A.J., Ferreira, P.J.S.G., Neves, A.J.R., Bastos, C.A.C.: On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS One 6, e21588 (2011)

    Google Scholar 

  23. Pinho, A.J., Pratas, D.: MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics 30, 117–118 (2014)

    Article  Google Scholar 

  24. Pinho, A.J., Pratas, D., Garcia, S.P.: GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res. 40, e27 (2012)

    Google Scholar 

  25. Popitsch, N., Haeseler, A.V.N.: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Res 41, e27 (2013)

    Google Scholar 

  26. Sahinalp, S.C., Vishkin, U.: Efficient approximate and dynamic matching of patterns using a labeling paradigm. In: Proceedings of the 37th Annual Symposium on Foundations of Computer Science, pp. 320–328 (1996)

    Google Scholar 

  27. Tembe, W., Lowey, J., Suh, E.: G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26, 2192–2194 (2010)

    Article  Google Scholar 

  28. Wang, C., Zhang, D.: A novel compression tool for efficient storage of genome resequencing data. Nucleic Acids Res. 39, E45-U74 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Subrata Saha .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Saha, S., Rajasekaran, S. (2015). NRRC: A Non-referential Reads Compression Algorithm. In: Harrison, R., Li, Y., Măndoiu, I. (eds) Bioinformatics Research and Applications. ISBRA 2015. Lecture Notes in Computer Science(), vol 9096. Springer, Cham. https://doi.org/10.1007/978-3-319-19048-8_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-19048-8_25

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-19047-1

  • Online ISBN: 978-3-319-19048-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics