Abstract:
The rapid growth of next-generation sequencing (NGS) technology has led to an exponential increase in the volume of genomic data, creating significant challenges in data ...Show MoreMetadata
Abstract:
The rapid growth of next-generation sequencing (NGS) technology has led to an exponential increase in the volume of genomic data, creating significant challenges in data storage and transfer. Existing sequence data compression solutions often suffer from low throughput and moderate compression ratios, making them inadequate for large-scale genomic data management. We present GPUFASTQLZ, an ultra-fast compression methodology for FASTQ sequence data on GPUs. Leveraging the high parallelism capabilities of GPUs, GPUFASTQLZ incorporates several optimizations, including a fast algorithm for field separation, a 2-bit encoding scheme for base fields, and the implementation of Illumina binning and GPULZ compression algorithms. We evaluate GPUFASTQLZ on three datasets, across 324 hyperparameter settings, which shows that GPUFASTQLZ outperforms existing compressors, achieving up to a 1300x speedup in compression throughput and a 1.1x improvement in compression ratio compared to GZIP and exceeds the state-of-the-art FASTQ compressor GENOZIP by up to 18X throughput.
Published in: SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
Date of Conference: 17-22 November 2024
Date Added to IEEE Xplore: 08 January 2025
ISBN Information: