skip to main content
10.1145/3569951.3597577acmconferencesArticle/Chapter ViewAbstractPublication PagespearcConference Proceedingsconference-collections
short-paper

FASTQ File Compression Benchmarking Using Lossless General Purpose Algorithms

Published: 10 September 2023 Publication History

Abstract

FASTQ format is a text-based format for storing a biological sequence (usually nucleotide sequence) and its corresponding quality scores used for genome sequencing. While most text-based formats compress well using traditional methods such as tar and gzip, FASTQ files are generally quite large and don’t compress well using these methods, leading to much of a file system’s space being used to store these data sets. As most computing platforms are shared resources, balancing compression and resource allocation is vital. This paper investigates the best general use compression software for FASTQ files to run at the end of a job in a mixed-use throughput high-performance compute cluster. As most computing platforms are shared resources, balancing compression and resource allocation is vital. It was found that zpaq high compression delivers the highest compression ratios of the fifty-seven methods tested in this paper. However, for more real-world scenarios where system resources are shared or limited, we recommend pzstd medium as a good all-around compression method for FASTQ files. It delivers high compression ratios at fast speeds while performing well on CPU and memory efficiency.

References

[1]
7za 2009. Pzip (7za). https://p7zip.sourceforge.net/
[2]
Mark Adler. 2022. pigz. https://github.com/madler/pigz
[3]
brotli 2020. brotli. https://github.com/google/brotli
[4]
Peter JA Cock, Christopher J Fields, Naohisa Goto, Michael L Heuer, and Peter M Rice. 2010. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research 38, 6 (2010), 1767–1771.
[5]
Lasse Collin and Jia Tan. 2022. xz. https://tukaani.org/xz/
[6]
Dell 2012. Dell PowerEdge R720 rack server. https://www.dell.com/en-us/shop/productdetailstxn/poweredge-r720
[7]
Antonio Diaz Diaz. 2022. lzip. https://www.nongnu.org/lzip/
[8]
Antonio Diaz Diaz. 2022. plzip. https://www.nongnu.org/lzip/plzip.html
[9]
National Center for Biotechnology Information. 2022. SRA Data Formats. (2022). https://www.ncbi.nlm.nih.gov/sra/docs/sra-data-formats/
[10]
Genome 1988. Genome [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; Accession No. SRX18322581, gDNA sequencing of Vibrio cholerae isolate D2; [cited 2023 Apr 19]. https://www.ncbi.nlm.nih.gov/sra/SRX18322581[accn]
[11]
Jeff Gilchrist. 2021. Parallel BZIP2 (PBZIP2). http://compression.great-site.net/pbzip2/
[12]
Mikolaj Izdebski. 2015. lbzip2. https://github.com/kjn/lbzip2/
[13]
Con Kolivas. 2006. lrzip. https://github.com/ckolivas/lrzip
[14]
lz4 2022. lz4. https://lz4.github.io/lz4/
[15]
Matt Mahoney. 2016. zpaq. https://mattmahoney.net/dc/zpaq.html
[16]
Metagenome 1988. Metagenome [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; Accession No. SRX18799734, Short reads-Seq of wastewater [cited 2023 Apr 19]. https://www.ncbi.nlm.nih.gov/sra/SRX18799734[accn]
[17]
Jim Meyering and Paul Eggert. 2009. GNU Gzip. Free Software Foundation, Inc. https://www.gnu.org/software/gzip
[18]
Jim Meyering and Paul Eggert. 2018. GNU Time. Free Software Foundation, Inc. https://www.gnu.org/software/time/
[19]
Jindrich Novy. 2014. pxz. https://github.com/jnovy/pxz
[20]
Markus F.X.J. Oberhumer. 2017. lzop. https://www.lzop.org/
[21]
Paul Eggert, Mike Haertel, David Hayes, Richard Stallman, and Len Tower. 2021. GNU Diffutils. https://www.gnu.org/software/diffutils/
[22]
The CentOS Project. 2020. CentOS-7 (2009) Release Notes. https://wiki.centos.org/Manuals/ReleaseNotes/CentOS7.2009
[23]
Meta Platforms, Inc. 2023. Parallel Zstandard (PZstandard). Meta Platforms, Inc. https://github.com/facebook/zstd/tree/master/contrib/pzstd
[24]
Eric W Sayers, Jeffrey Beck, Evan E Bolton, Devon Bourexis, James R Brister, Kathi Canese, Donald C Comeau, Kathryn Funk, Sunghwan Kim, William Klimke, 2021. Database resources of the national center for biotechnology information. Nucleic acids research 49, D1 (2021), D10.
[25]
Micah Snyder. 2021. Bzip2. https://gitlab.com/bzip2/bzip2/
[26]
Dave Vasilevsky. 2020. pixz. https://github.com/jnovy/pxz
[27]
xeon 2012. Intel® Xeon® Processor E5-2660. https://ark.intel.com/content/www/us/en/ark/products/64584/intel-xeon-processor-e52660-20m-cache-2-20-ghz-8-00-gts-intel-qpi.html[Accessed 14-Apr-2023].
[28]
Info-ZIP 2008. zip. Info-ZIP. https://infozip.sourceforge.net/Zip.html
[29]
Meta Platforms, Inc. 2023. Zstandard. Meta Platforms, Inc. https://facebook.github.io/zstd/

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PEARC '23: Practice and Experience in Advanced Research Computing 2023: Computing for the Common Good
July 2023
519 pages
ISBN:9781450399852
DOI:10.1145/3569951
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 September 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Benchmark
  2. Bioinformatics
  3. Compression
  4. DNA Sequencing
  5. FASTQ
  6. File Format
  7. GNU
  8. Genomics
  9. Linux

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

PEARC '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 133 of 202 submissions, 66%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 151
    Total Downloads
  • Downloads (Last 12 months)106
  • Downloads (Last 6 weeks)13
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media