ABSTRACT
In recent years, the cost of NGS (Next Generation Sequencing) technology has dramatically reduced, making it a viable method for diagnosing genetic diseases. The large amount of data generated by NGS technology, usually in the order of hundreds of gigabytes per experiment, have to be analyzed quickly to generate meaningful variant results. The GATK best practices pipeline from the Broad Institute is one of the most popular computational pipelines for DNA analysis. Many components of the GATK pipeline are not very parallelizable though. In this paper, we present a parallel implementation of a DNA analysis pipeline based on the big data Apache Spark framework. This implementation is highly scalable and capable of parallelizing computation by utilizing data-level parallelism as well as load balancing techniques. In order to reduce the analysis cost, the framework can run on nodes with as little memory as 16GB. For whole genome sequencing experiments, we show that the runtime can be reduced to about 1.5 hours on a 20-node cluster with an accuracy of up to 99.9981%. Our solution is about 71% faster than other state-of-the-art solutions while also being more accurate. The source code of the software described in this paper is publicly available at https://github.com/HamidMushtaq/SparkGA1.git.
- G.A. van der Auwera, M. Carneiro, C. Hartl, R. Poplin, G. del Angel, A. Levy-Moonshine, T. Jordan, K. Shakir, D. Roazen, J. Thibault, E. Banks, K. Garimella, D. Altshuler, S. Gabriel, M. DePristo, "From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline", Current Protocols in Bioinformatics, 43:11.10.1--11.10.33, 2013.Google Scholar
- D. Decap, J. Reumers, C. Herzeel, P. Costanza and J. Fostier, "Halvade: scalable sequence analysis with MapReduce", Bioinformatics, btv179v2-btv179, 2015.Google Scholar
- https://broadinstitute.github.io/picard/Google Scholar
- https://gdc.cancer.gov/Google Scholar
- https://www.surf.nl/en/services-and-products/big-data-services/access/index.htmlGoogle Scholar
- J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", Commun. ACM, vol. 51, no. 1, 2008. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker and I.Stoica, "Spark: cluster computing with working sets", HotCloud'10, USENIX Association, Berkeley, CA, USA, 10--10. Google ScholarDigital Library
- J.M. Abuin, J.C. Pichel, T.F. Pena and J. Amigo, "SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data" Ed" PLoS ONE 11.5 (2016), e0155461. PMC. Web. 31 Oct. 2016.Google ScholarCross Ref
- D.C. Jones, W.L. Ruzzo, X. Peng and M.G. Katze, "Compression of next-generation sequencing reads aided by highly efficient de novo assembly", Nucleic Acids Research, 2012.Google ScholarCross Ref
- B.J. Kelly, J.R. Fitch, Y. Hu, D.J. Corsmeier, H. Zhong, A.N. Wetzel, R.D. Nordquist, D.L. Newsom and P. White,"Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics", Genome Biology, vol. 16, no. 6, 2015.Google Scholar
- N. Ahmed, V. M. Sima, E. Houtgast, K. Bertels and Z. Al-Ars, "Heterogeneous hardware/software acceleration of the BWA-MEM DNA alignment algorithm," 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Austin, TX, 2015, pp. 240--246. Google ScholarDigital Library
- S. Ren, V. M. Sima and Z. Al-Ars, "FPGA acceleration of the pair-HMMs forward algorithm for DNA sequence analysis," 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, 2015, pp. 1465--1470. Google ScholarDigital Library
- Z. Al-Ars and Hamid Mushtaq "Scalability Potential of BWA DNA Mapping Algorithm on Apache Spark," SIMBig 2015, Cusco, Peru, 2015, pp. 85--88.Google Scholar
- H. Li, "Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM", arXiv:1303.3997 {q-bio.GN}, 2013.Google Scholar
- H. Mushtaq, Z. Al-Ars, "Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline", IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2015. Google ScholarDigital Library
- B. Langmead, S.L. Salzberg, "Fast gapped-read alignment with Bowtie 2", Nature Methods, vol. 9, no. 4, pp. 357--359, 2012.Google ScholarCross Ref
Index Terms
- SparkGA: A Spark Framework for Cost Effective, Fast and Accurate DNA Analysis at Scale
Recommendations
Spark-based data analytics of sequence motifs in large omics data
AbstractData explosion in bioinformatics in recent years has led to new challenges for researchers to develop novel techniques to discover new knowledge from the avalanche of omics data (e.g., genomics, proteomics, transcriptomics). These data are ...
'Big data', Hadoop and cloud computing in genomics
Graphical abstractDisplay Omitted Ever improving next generation sequencing technologies has led to an unprecedented proliferation of sequence data.Biology is now one of the fastest growing fields of big data science.Cloud computing and big data ...
Howdah - A Flexible Pipeline Framework for Analyzing Genomic Data
CLOUDCOM '10: Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and ScienceThe advent of new high-throughput sequencing technologies has led to a flood of genomic data which overwhelms the capabilities of single processor machines. We present a MapReduce pipeline called Howdah that supports the analysis of genomic sequence ...
Comments