Abstract:
Genomics data is being produced at an unprecedented rate, especially in the context of clinical applications and grand challenge questions. There are various types of dat...Show MoreMetadata
Abstract:
Genomics data is being produced at an unprecedented rate, especially in the context of clinical applications and grand challenge questions. There are various types of data in genomics research, most of which are stored as plain text tables. A data compression framework tailored to this file type is introduced in this paper, featuring a combination of generic compression algorithms, GPU acceleration, and column-major storage. This approach is the first to achieve both compression and decompression rates of around 100MB/s on commodity hardware without compromising compression ratio. By selecting appropriate compression schemes for each column of data, this framework efficiently exploits data redundancy while remaining applicable to a wide range of formats. The GPU-accelerated implementation also properly exploits the parallelism of compression algorithms. Finally, this paper presents a novel first-order Markov model based transformation, with evidence that it is at least as effective as Burrows-Wheeler and Move-To-Front in some contexts.
Published in: 2013 IEEE International Conference on Big Data
Date of Conference: 06-09 October 2013
Date Added to IEEE Xplore: 23 December 2013
Electronic ISBN:978-1-4799-1293-3