Abstract:
Genomic analysis [1] usually includes a pipeline of three stages: sequence alignment, data conversion, and advanced analysis. The analysis pipeline needs to handle hundre...Show MoreMetadata
Abstract:
Genomic analysis [1] usually includes a pipeline of three stages: sequence alignment, data conversion, and advanced analysis. The analysis pipeline needs to handle hundreds of gigabytes of data as well as to run complex analytics algorithms, which traditionally takes long execution time (20+ hours) for a full genomes analysis. Parallelizing the execution of analytics algorithms is one way to speed up the process. Parallelizing genomic analysis is not a simple task, however, as it involves complicated splitting/distribution of data and merging of intermediate results. Our objective is to reduce the genomic analysis time to under an hour. To achieve this, we designed and implemented a distributed analysis pipeline that executes the pipeline in parallel on a Hadoop cluster (physical machines or VM nodes). Since Hadoop already handles work/job dispatching and work balance among distributed worker nodes, we need not handle node failure and load balancing required with a traditional distributed computing approach. Our major challenge is to run the genomic analysis pipeline effectively with Hadoop MapReduce and to ensure the correctness and quality of the analysis results. This paper discusses our work in the design and implementation of a highly parallelized genomic analysis pipeline. Our preliminary experiment results show that our parallelized pipeline using MapReduce improves analysis time by 447% while maintaining the result quality.
Date of Conference: 29 October 2015 - 01 November 2015
Date Added to IEEE Xplore: 28 December 2015
ISBN Information: