Loading [a11y]/accessibility-menu.js
Accelerating large-scale genomic analysis with Spark | IEEE Conference Publication | IEEE Xplore

Accelerating large-scale genomic analysis with Spark


Abstract:

High-throughput next-generation sequencing technologies are producing a flood of cheap genomic information, providing precision medicine with the opportunity to better un...Show More

Abstract:

High-throughput next-generation sequencing technologies are producing a flood of cheap genomic information, providing precision medicine with the opportunity to better understand the primary cause of complicated diseases like cancer. However, even current state-of-the-art approaches still have large gaps with data generation due to limited scalability, accuracy and computational efficiency. To explore how to efficiently and effectively synthesize genomic data into knowledge, we propose GATK-Spark, a balanced parallelization approach that implements an in-memory version of GATK using Apache Spark. First, we performed a rigorous analysis of current GATK optimization strategies. We identify that compute resource utilization, text-based data format and long time single-thread file cutting and mergence operations are three major scalable bottlenecks. Second, we share our experiences designing a new approach optimized for GATK with big-data computing frameworks Apache Spark - GATK-Spark, which reduces the original execution of 20 hours to 30 minutes with a speedup in excess of 37 at 256 CPU cores. This work will facilitate the understanding of genomics analytics pipeline and design of strategies for accelerating large scale genomic analysis applications.
Date of Conference: 15-18 December 2016
Date Added to IEEE Xplore: 19 January 2017
ISBN Information:
Conference Location: Shenzhen

References

References is not available for this document.