Processing math: 100%
GenoDedup: Similarity-Based Deduplication and Delta-Encoding for Genome Sequencing Data | IEEE Journals & Magazine | IEEE Xplore

GenoDedup: Similarity-Based Deduplication and Delta-Encoding for Genome Sequencing Data


Abstract:

The vast datasets produced in human genomics must be efficiently stored, transferred, and processed while prioritizing storage space and restore performance. Balancing th...Show More

Abstract:

The vast datasets produced in human genomics must be efficiently stored, transferred, and processed while prioritizing storage space and restore performance. Balancing these two properties becomes challenging when resorting to traditional data compression techniques. In fact, specialized algorithms for compressing sequencing data favor the former, while large genome repositories widely resort to generic compressors (e.g., GZIP) to benefit from the latter. Notably, human beings have approximately 99.9 percent of DNA sequence similarity, vouching for an excellent opportunity for deduplication and its assets: leveraging inter-file similarity and achieving higher read performance. However, identity-based deduplication fails to provide a satisfactory reduction in the storage requirements of genomes. In this article, we balance space savings and restore performance by proposing {\sf GenoDedup}, the first method that integrates efficient similarity-based deduplication and specialized delta-encoding for genome sequencing data. Our solution currently achieves 67.8 percent of the reduction gains of SPRING (i.e., the best specialized tool in this metric) and restores data 1.62\times faster than SeqDB (i.e., the fastest competitor). Additionally, \mathsf{ GenoDedup} restores data 9.96\times faster than SPRING and compresses files 2.05\times more than SeqDB.
Published in: IEEE Transactions on Computers ( Volume: 70, Issue: 5, 01 May 2021)
Page(s): 669 - 681
Date of Publication: 14 May 2020

ISSN Information:

Funding Agency:


References

References is not available for this document.