skip to main content
10.1145/2808719.2808753acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

Fast and efficient compression of high-throughput sequencing reads

Published: 09 September 2015 Publication History

Abstract

Biological sequence data for one to many individuals from thousands of species has been generated or is currently being generated, resulting in enormous amounts of next generation sequence (NGS) data that must be stored and managed. It has thus become imperative that tailored compression algorithms be developed that exploit the particular redundancy present in NGS data in order to reduce storage costs. We present two LZ-style compression algorithms for short read compression: Faust and Afin. Both methods work without requiring a reference genome and take only the sequence reads as input. We compare our new techniques to state-of-the-art methods using a collection of one billion human sequence reads that were sequenced from a well-studied African male. Our experiments demonstrate that Afin and Faust provide powerful compression while having superior time and memory usage to alternative methods during compression and decompression. Both Faust and Afin are available at https://github.com/tkind94/afin---faust.

References

[1]
J. Bonfield and M. Mahoney. Compression of FASTQ and SAM format sequencing data. PLOS ONE, 8(3):e59190, 2013.
[2]
R. Cánovas and A. Moffat. Practical compression for multi-alignment genomic files. In Proceedings of the 36th Australasian Computer Science Conference, pages 51--60, 2013.
[3]
X. Chen, M. Li, B. Ma, and J. Tromp. DNACompress: fast and effective DNA sequence compression. Bioinformatics, 18(12):1696--1698, 2002.
[4]
G. Cochrane, B. Alako, C. Amid, L. Bower, A. Cerdeño-Tárraga, I. Cleland, R. Gibson, N. Goodgame, M. Jang, S. Kay, et al. Facing growth in the European nucleotide archive. Nucleic Acids Research, 41(D1):D30--D35, 2013.
[5]
A. J. Cox, M. J. Bauer, T. Jakobi, and G. Rosone. Large-scale compression of genomic sequence databases with the Burrows--Wheeler transform. Bioinformatics, 28(11):1415--1419, 2012.
[6]
S. Deorowicz and S. Grabowski. Compression of DNA sequence reads in FASTQ format. Bioinformatics, 27(6):860--862, 2011.
[7]
M. H.-Y. Fritz, R. Leinonen, G. Cochrane, and E. Birney. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Research, 21(5):734--740, 2011.
[8]
F. Hach, I. Numanagić, C. Alkan, and S. C. Sahinalp. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics, 28(23):3051--3057, 2012.
[9]
L. Janin, G. Rosone, and A. J. Cox. Adaptive reference--free compression of sequence quality scores. Bioinformatics, 2013.
[10]
D. C. Jones, W. L. Ruzzo, X. Peng, and M. G. Katze. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Research, 40(22):e171--e171, 2012.
[11]
M. Kircher and J. Kelso. High-throughput DNA sequencing--concepts and limitations. Bioessays, 32(6):524--536, 2010.
[12]
C. Kozanitis, C. Saunders, S. Kruglyak, V. Bafna, and G. Varghese. Compressing genomic sequence fragments using SlimGene. Journal of Computational Biology, 18(3):401--413, 2011.
[13]
S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An extension of the Burrows Wheeler transform and applications to sequence comparison and data compression. In Combinatorial Pattern Matching, volume 3537, pages 178--189, 2005.
[14]
I. Ochoa, H. Asnani, D. Bharadia, M. Chowdhury, T. Weissman, and G. Yona. QualComp: a new lossy compressor for quality scores based on rate distortion theory. BMC Bioinformatics, 14(1):187, 2013.
[15]
N. Popitsch and A. von Haeseler. NGC: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Research, 41(1):e27, 2013.
[16]
J. Selva and X. Chen. SRComp: Short read sequence compression using burstsort and elias omega coding. PLoS ONE, 8(12):e81414, 2013.
[17]
W. Tembe, J. Lowey, and E. Suh. G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics, 26(17):2192--2194, 2010.
[18]
R. Wan, V. N. Anh, and K. Asai. Transformations for the compression of FASTQ quality scores of next-generation sequencing data. Bioinformatics, 28(5):628--635, 2012.
[19]
V. Yanovsky. ReCoil - an algorithm for compression of extremely large datasets of DNA data. Algorithms for Molecular Biology, 6(1):1--9, 2011.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics
September 2015
683 pages
ISBN:9781450338530
DOI:10.1145/2808719
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data compression
  2. next generation sequencing technologies

Qualifiers

  • Research-article

Funding Sources

  • Colorado Clinical and Translational Sciences Institute
  • National Institutes of Health

Conference

BCB '15
Sponsor:

Acceptance Rates

BCB '15 Paper Acceptance Rate 48 of 141 submissions, 34%;
Overall Acceptance Rate 254 of 885 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 110
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media