Abstract
The advent of High Throughput Sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pan-genomes. The ideal way to represent and transfer pan-genomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pan-genome. In this paper, we present DARRC, a new alignment-free and reference-free compression method. It addresses the problem of pan-genome compression by encoding the sequences of a pan-genome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large P. aeruginosa dataset, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared to the best performing state-of-the-art HTS-specific compression method in our experiments.
Availability. DARRC is available at https://github.com/GuillaumeHolley/DARRC.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Benoit, G., Lemaitre, C., Lavenier, D., Drezen, E., Dayris, T., Uricaru, R., Rizk, G.: Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinform. 16, 288 (2015)
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13(7), 422–426 (1970)
Bonfield, J.K., Mahoney, M.V.: Compression of FASTQ and SAM format sequencing data. PloS One 8(3), e59190 (2013)
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Digital SRC Research Report 124 (1994)
Collet, Y.: ZSTD. https://github.com/facebook/zstd, 20 December 2016
Compeau, P.E.C., Pevzner, P.A., Tesler, G.: How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol. 29(11), 987–991 (2011)
1000 Genomes Project Consortium: A global reference for human genetic variation. Nature 526(7571), 68–74 (2015)
Deorowicz, S., Grabowski, S.: Data compression for sequencing data. Algorithms Mol. Biol. 8, 25 (2013)
Duda, J.: Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding (2013). arXiv:1311.2540
Giancarlo, R., Rombo, S.E., Utro, F.: Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief. Bioinform. 15(3), 390–406 (2014)
Grabowski, S., Deorowicz, S., Roguski, L.: Disk-based compression of data from genome sequencing. Bioinformatics 31(9), 1389–1395 (2014)
Hach, F., Numanagić, I., Alkan, C., Sahinalp, S.C.: SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28(23), 3051–3057 (2012)
Holland, R.C.G., Nick, L.: Sequence squeeze: an open contest for sequence compression. GigaScience 2(1), 5 (2013)
Holley, G., Roland, W., Stoye, J.: Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol. Biol. 11, 3 (2016)
Hosseini, M., Pratas, D., Pinho, A.J.: A survey on data compression methods for biological sequences. Information 7(4), 56 (2016)
Huffman, D.A.: A method for the construction of minimum-redundancy codes. In: Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101 (1952)
Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40(22), e171 (2012)
Kingsford, C., Patro, R.: Reference-based compression of short-read sequences using path encoding. Bioinformatics 31(12), 1920–1928 (2015)
Land, M., Hauser, L., Jun, S.-R., Nookaew, I., Leuze, M.R., Ahn, T.-H., Karpinets, T., Lund, O., Kora, G., Wassenaar, T., et al.: Insights from 20 years of bacterial genome sequencing. Funct. Integr. Genomics 15(2), 141–161 (2015)
Loh, P.-R., Baym, M., Berger, B.: Compressive genomics. Nat. Biotechnol. 30, 627–630 (2012)
Numanagić, I., Bonfield, J.K., Hach, F., Voges, J., Ostermann, J., Alberti, C., Mattavelli, M., Sahinalp, S.C.: Comparison of high-throughput sequencing data compression tools. Nat. Methods 13(12), 1005–1008 (2016)
Patro, R., Kingsford, C.: Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics 31(17), 2770–2777 (2015)
Pavlov, I.: LZMA. http://www.7-zip.org, 20 December 2016
Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)
Roguski, L., Deorowicz, S.: DSRC 2-Industry-oriented compression of FASTQ files. Bioinformatics 30(15), 2213–2215 (2014)
Rozov, R., Shamir, R., Halperin, E.: Fast lossless compression via cascading Bloom filters. BMC Bioinform. 15(9), S7 (2014)
Saha, S., Rajasekaran, S.: Efficient algorithms for the compression of FASTQ files. In: Proceedings of the International Conference on Bioinformatics and Biomedicine (BIBM 2014), pp. 82–85 (2014)
Sahinalp, S.C., Vishkin, U.: Efficient approximate and dynamic matching of patterns using a labeling paradigm. In: FOCS, pp. 320–328 (1996)
Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading Bloom filters to improve the memory usage for de Brujin graphs. Algorithm. Mol. Biol. 9(1), 2 (2014)
Genome Biology Editorial Team: Closure of the NCBI SRA and implications for the long-term future of genomics data storage. Genome Biol. 12(3), 402 (2011)
Tettelin, H., Masignani, V., Cieslewicz, M.J., Donati, C., Medini, D., Ward, N.L., Angiuoli, S.V., Crabtree, J., Jones, A.L., Durkin, A.S., et al.: Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial pan-genome. Proc. Natl. Acad. Sci. USA 102(39), 13950–13955 (2005)
Williams, H.E., Zobel, J.: Compressing integers for fast file access. Comput. J. 42(3), 193–201 (1999)
Witten, I.H., Neal, R.M., Cleary, J.G.: Arithmetic coding for data compression. Commun. ACM 30(6), 520–540 (1987)
Yu, Y.W., Daniels, N.M., Danko, D.C., Berger, B.: Entropy-scaling search of massive biological data. Cell Syst. 1(2), 130–140 (2015)
Zimin, A.V., Marçais, G., Puiu, D., Roberts, M., Salzberg, S.L., Yorke, J.A.: The MaSuRCA genome assembler. Bioinformatics 29(21), 2669–2677 (2013)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Acknowledgments
This research is funded by the International DFG Research Training Group GRK 1906/1 for GH and RW, the NSERC Discovery Frontiers grant on “Cancer Genome Collaboratory” to FH.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Holley, G., Wittler, R., Stoye, J., Hach, F. (2017). Dynamic Alignment-Free and Reference-Free Read Compression. In: Sahinalp, S. (eds) Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science(), vol 10229. Springer, Cham. https://doi.org/10.1007/978-3-319-56970-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-56970-3_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56969-7
Online ISBN: 978-3-319-56970-3
eBook Packages: Computer ScienceComputer Science (R0)