Dynamic Alignment-Free and Reference-Free Read Compression

Holley, Guillaume; Wittler, Roland; Stoye, Jens; Hach, Faraz

doi:10.1007/978-3-319-56970-3_4

Dynamic Alignment-Free and Reference-Free Read Compression

Guillaume Holley^14,15,
Roland Wittler^14,15,
Jens Stoye¹⁴ &
…
Faraz Hach^16,17,18

Conference paper
First Online: 12 April 2017

2074 Accesses
6 Citations
1 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10229))

Abstract

The advent of High Throughput Sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pan-genomes. The ideal way to represent and transfer pan-genomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pan-genome. In this paper, we present DARRC, a new alignment-free and reference-free compression method. It addresses the problem of pan-genome compression by encoding the sequences of a pan-genome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large P. aeruginosa dataset, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared to the best performing state-of-the-art HTS-specific compression method in our experiments.

Availability. DARRC is available at https://github.com/GuillaumeHolley/DARRC.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Benoit, G., Lemaitre, C., Lavenier, D., Drezen, E., Dayris, T., Uricaru, R., Rizk, G.: Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinform. 16, 288 (2015)
Article Google Scholar
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13(7), 422–426 (1970)
Article MATH Google Scholar
Bonfield, J.K., Mahoney, M.V.: Compression of FASTQ and SAM format sequencing data. PloS One 8(3), e59190 (2013)
Article Google Scholar
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Digital SRC Research Report 124 (1994)
Google Scholar
Collet, Y.: ZSTD. https://github.com/facebook/zstd, 20 December 2016
Compeau, P.E.C., Pevzner, P.A., Tesler, G.: How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol. 29(11), 987–991 (2011)
Article Google Scholar
1000 Genomes Project Consortium: A global reference for human genetic variation. Nature 526(7571), 68–74 (2015)
Article Google Scholar
Deorowicz, S., Grabowski, S.: Data compression for sequencing data. Algorithms Mol. Biol. 8, 25 (2013)
Article Google Scholar
Duda, J.: Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding (2013). arXiv:1311.2540
Giancarlo, R., Rombo, S.E., Utro, F.: Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief. Bioinform. 15(3), 390–406 (2014)
Article Google Scholar
Grabowski, S., Deorowicz, S., Roguski, L.: Disk-based compression of data from genome sequencing. Bioinformatics 31(9), 1389–1395 (2014)
Article Google Scholar
Hach, F., Numanagić, I., Alkan, C., Sahinalp, S.C.: SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28(23), 3051–3057 (2012)
Article Google Scholar
Holland, R.C.G., Nick, L.: Sequence squeeze: an open contest for sequence compression. GigaScience 2(1), 5 (2013)
Article Google Scholar
Holley, G., Roland, W., Stoye, J.: Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol. Biol. 11, 3 (2016)
Article Google Scholar
Hosseini, M., Pratas, D., Pinho, A.J.: A survey on data compression methods for biological sequences. Information 7(4), 56 (2016)
Article Google Scholar
Huffman, D.A.: A method for the construction of minimum-redundancy codes. In: Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101 (1952)
Google Scholar
Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40(22), e171 (2012)
Article Google Scholar
Kingsford, C., Patro, R.: Reference-based compression of short-read sequences using path encoding. Bioinformatics 31(12), 1920–1928 (2015)
Article Google Scholar
Land, M., Hauser, L., Jun, S.-R., Nookaew, I., Leuze, M.R., Ahn, T.-H., Karpinets, T., Lund, O., Kora, G., Wassenaar, T., et al.: Insights from 20 years of bacterial genome sequencing. Funct. Integr. Genomics 15(2), 141–161 (2015)
Article Google Scholar
Loh, P.-R., Baym, M., Berger, B.: Compressive genomics. Nat. Biotechnol. 30, 627–630 (2012)
Article Google Scholar
Numanagić, I., Bonfield, J.K., Hach, F., Voges, J., Ostermann, J., Alberti, C., Mattavelli, M., Sahinalp, S.C.: Comparison of high-throughput sequencing data compression tools. Nat. Methods 13(12), 1005–1008 (2016)
Article Google Scholar
Patro, R., Kingsford, C.: Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics 31(17), 2770–2777 (2015)
Article Google Scholar
Pavlov, I.: LZMA. http://www.7-zip.org, 20 December 2016
Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)
Article Google Scholar
Roguski, L., Deorowicz, S.: DSRC 2-Industry-oriented compression of FASTQ files. Bioinformatics 30(15), 2213–2215 (2014)
Article Google Scholar
Rozov, R., Shamir, R., Halperin, E.: Fast lossless compression via cascading Bloom filters. BMC Bioinform. 15(9), S7 (2014)
Article Google Scholar
Saha, S., Rajasekaran, S.: Efficient algorithms for the compression of FASTQ files. In: Proceedings of the International Conference on Bioinformatics and Biomedicine (BIBM 2014), pp. 82–85 (2014)
Google Scholar
Sahinalp, S.C., Vishkin, U.: Efficient approximate and dynamic matching of patterns using a labeling paradigm. In: FOCS, pp. 320–328 (1996)
Google Scholar
Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading Bloom filters to improve the memory usage for de Brujin graphs. Algorithm. Mol. Biol. 9(1), 2 (2014)
Article Google Scholar
Genome Biology Editorial Team: Closure of the NCBI SRA and implications for the long-term future of genomics data storage. Genome Biol. 12(3), 402 (2011)
Google Scholar
Tettelin, H., Masignani, V., Cieslewicz, M.J., Donati, C., Medini, D., Ward, N.L., Angiuoli, S.V., Crabtree, J., Jones, A.L., Durkin, A.S., et al.: Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial pan-genome. Proc. Natl. Acad. Sci. USA 102(39), 13950–13955 (2005)
Article Google Scholar
Williams, H.E., Zobel, J.: Compressing integers for fast file access. Comput. J. 42(3), 193–201 (1999)
Article Google Scholar
Witten, I.H., Neal, R.M., Cleary, J.G.: Arithmetic coding for data compression. Commun. ACM 30(6), 520–540 (1987)
Article Google Scholar
Yu, Y.W., Daniels, N.M., Danko, D.C., Berger, B.: Entropy-scaling search of massive biological data. Cell Syst. 1(2), 130–140 (2015)
Article Google Scholar
Zimin, A.V., Marçais, G., Puiu, D., Roberts, M., Salzberg, S.L., Yorke, J.A.: The MaSuRCA genome assembler. Bioinformatics 29(21), 2669–2677 (2013)
Article Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

This research is funded by the International DFG Research Training Group GRK 1906/1 for GH and RW, the NSERC Discovery Frontiers grant on “Cancer Genome Collaboratory” to FH.

Author information

Authors and Affiliations

Genome Informatics, Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
Guillaume Holley, Roland Wittler & Jens Stoye
International Research Training Group 1906 “Computational Methods for the Analysis of the Diversity and Dynamics of Genomes”, Bielefeld University, Bielefeld, Germany
Guillaume Holley & Roland Wittler
School of Computing Science, Simon Fraser University, Burnaby, Canada
Faraz Hach
Department of Urologic Sciences, University of British Columbia, Vancouver, Canada
Faraz Hach
Vancouver Prostate Centre, Vancouver, Canada
Faraz Hach

Authors

Guillaume Holley
View author publications
You can also search for this author in PubMed Google Scholar
Roland Wittler
View author publications
You can also search for this author in PubMed Google Scholar
Jens Stoye
View author publications
You can also search for this author in PubMed Google Scholar
Faraz Hach
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Guillaume Holley or Faraz Hach .

Editor information

Editors and Affiliations

Indiana University Bloomington, Bloomington, Indiana, USA
S. Cenk Sahinalp

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Holley, G., Wittler, R., Stoye, J., Hach, F. (2017). Dynamic Alignment-Free and Reference-Free Read Compression. In: Sahinalp, S. (eds) Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science(), vol 10229. Springer, Cham. https://doi.org/10.1007/978-3-319-56970-3_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-56970-3_4
Published: 12 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56969-7
Online ISBN: 978-3-319-56970-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics