Skip to main content

Dynamic Alignment-Free and Reference-Free Read Compression

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10229))

Abstract

The advent of High Throughput Sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pan-genomes. The ideal way to represent and transfer pan-genomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pan-genome. In this paper, we present DARRC, a new alignment-free and reference-free compression method. It addresses the problem of pan-genome compression by encoding the sequences of a pan-genome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large P. aeruginosa dataset, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared to the best performing state-of-the-art HTS-specific compression method in our experiments.

Availability. DARRC is available at https://github.com/GuillaumeHolley/DARRC.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Benoit, G., Lemaitre, C., Lavenier, D., Drezen, E., Dayris, T., Uricaru, R., Rizk, G.: Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinform. 16, 288 (2015)

    Article  Google Scholar 

  2. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13(7), 422–426 (1970)

    Article  MATH  Google Scholar 

  3. Bonfield, J.K., Mahoney, M.V.: Compression of FASTQ and SAM format sequencing data. PloS One 8(3), e59190 (2013)

    Article  Google Scholar 

  4. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Digital SRC Research Report 124 (1994)

    Google Scholar 

  5. Collet, Y.: ZSTD. https://github.com/facebook/zstd, 20 December 2016

  6. Compeau, P.E.C., Pevzner, P.A., Tesler, G.: How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol. 29(11), 987–991 (2011)

    Article  Google Scholar 

  7. 1000 Genomes Project Consortium: A global reference for human genetic variation. Nature 526(7571), 68–74 (2015)

    Article  Google Scholar 

  8. Deorowicz, S., Grabowski, S.: Data compression for sequencing data. Algorithms Mol. Biol. 8, 25 (2013)

    Article  Google Scholar 

  9. Duda, J.: Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding (2013). arXiv:1311.2540

  10. Giancarlo, R., Rombo, S.E., Utro, F.: Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief. Bioinform. 15(3), 390–406 (2014)

    Article  Google Scholar 

  11. Grabowski, S., Deorowicz, S., Roguski, L.: Disk-based compression of data from genome sequencing. Bioinformatics 31(9), 1389–1395 (2014)

    Article  Google Scholar 

  12. Hach, F., Numanagić, I., Alkan, C., Sahinalp, S.C.: SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28(23), 3051–3057 (2012)

    Article  Google Scholar 

  13. Holland, R.C.G., Nick, L.: Sequence squeeze: an open contest for sequence compression. GigaScience 2(1), 5 (2013)

    Article  Google Scholar 

  14. Holley, G., Roland, W., Stoye, J.: Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol. Biol. 11, 3 (2016)

    Article  Google Scholar 

  15. Hosseini, M., Pratas, D., Pinho, A.J.: A survey on data compression methods for biological sequences. Information 7(4), 56 (2016)

    Article  Google Scholar 

  16. Huffman, D.A.: A method for the construction of minimum-redundancy codes. In: Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101 (1952)

    Google Scholar 

  17. Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40(22), e171 (2012)

    Article  Google Scholar 

  18. Kingsford, C., Patro, R.: Reference-based compression of short-read sequences using path encoding. Bioinformatics 31(12), 1920–1928 (2015)

    Article  Google Scholar 

  19. Land, M., Hauser, L., Jun, S.-R., Nookaew, I., Leuze, M.R., Ahn, T.-H., Karpinets, T., Lund, O., Kora, G., Wassenaar, T., et al.: Insights from 20 years of bacterial genome sequencing. Funct. Integr. Genomics 15(2), 141–161 (2015)

    Article  Google Scholar 

  20. Loh, P.-R., Baym, M., Berger, B.: Compressive genomics. Nat. Biotechnol. 30, 627–630 (2012)

    Article  Google Scholar 

  21. Numanagić, I., Bonfield, J.K., Hach, F., Voges, J., Ostermann, J., Alberti, C., Mattavelli, M., Sahinalp, S.C.: Comparison of high-throughput sequencing data compression tools. Nat. Methods 13(12), 1005–1008 (2016)

    Article  Google Scholar 

  22. Patro, R., Kingsford, C.: Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics 31(17), 2770–2777 (2015)

    Article  Google Scholar 

  23. Pavlov, I.: LZMA. http://www.7-zip.org, 20 December 2016

  24. Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)

    Article  Google Scholar 

  25. Roguski, L., Deorowicz, S.: DSRC 2-Industry-oriented compression of FASTQ files. Bioinformatics 30(15), 2213–2215 (2014)

    Article  Google Scholar 

  26. Rozov, R., Shamir, R., Halperin, E.: Fast lossless compression via cascading Bloom filters. BMC Bioinform. 15(9), S7 (2014)

    Article  Google Scholar 

  27. Saha, S., Rajasekaran, S.: Efficient algorithms for the compression of FASTQ files. In: Proceedings of the International Conference on Bioinformatics and Biomedicine (BIBM 2014), pp. 82–85 (2014)

    Google Scholar 

  28. Sahinalp, S.C., Vishkin, U.: Efficient approximate and dynamic matching of patterns using a labeling paradigm. In: FOCS, pp. 320–328 (1996)

    Google Scholar 

  29. Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading Bloom filters to improve the memory usage for de Brujin graphs. Algorithm. Mol. Biol. 9(1), 2 (2014)

    Article  Google Scholar 

  30. Genome Biology Editorial Team: Closure of the NCBI SRA and implications for the long-term future of genomics data storage. Genome Biol. 12(3), 402 (2011)

    Google Scholar 

  31. Tettelin, H., Masignani, V., Cieslewicz, M.J., Donati, C., Medini, D., Ward, N.L., Angiuoli, S.V., Crabtree, J., Jones, A.L., Durkin, A.S., et al.: Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial pan-genome. Proc. Natl. Acad. Sci. USA 102(39), 13950–13955 (2005)

    Article  Google Scholar 

  32. Williams, H.E., Zobel, J.: Compressing integers for fast file access. Comput. J. 42(3), 193–201 (1999)

    Article  Google Scholar 

  33. Witten, I.H., Neal, R.M., Cleary, J.G.: Arithmetic coding for data compression. Commun. ACM 30(6), 520–540 (1987)

    Article  Google Scholar 

  34. Yu, Y.W., Daniels, N.M., Danko, D.C., Berger, B.: Entropy-scaling search of massive biological data. Cell Syst. 1(2), 130–140 (2015)

    Article  Google Scholar 

  35. Zimin, A.V., Marçais, G., Puiu, D., Roberts, M., Salzberg, S.L., Yorke, J.A.: The MaSuRCA genome assembler. Bioinformatics 29(21), 2669–2677 (2013)

    Article  Google Scholar 

  36. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

This research is funded by the International DFG Research Training Group GRK 1906/1 for GH and RW, the NSERC Discovery Frontiers grant on “Cancer Genome Collaboratory” to FH.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Guillaume Holley or Faraz Hach .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Holley, G., Wittler, R., Stoye, J., Hach, F. (2017). Dynamic Alignment-Free and Reference-Free Read Compression. In: Sahinalp, S. (eds) Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science(), vol 10229. Springer, Cham. https://doi.org/10.1007/978-3-319-56970-3_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-56970-3_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-56969-7

  • Online ISBN: 978-3-319-56970-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics