skip to main content
10.1145/3107411.3107438acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

SparkGA: A Spark Framework for Cost Effective, Fast and Accurate DNA Analysis at Scale

Published:20 August 2017Publication History

ABSTRACT

In recent years, the cost of NGS (Next Generation Sequencing) technology has dramatically reduced, making it a viable method for diagnosing genetic diseases. The large amount of data generated by NGS technology, usually in the order of hundreds of gigabytes per experiment, have to be analyzed quickly to generate meaningful variant results. The GATK best practices pipeline from the Broad Institute is one of the most popular computational pipelines for DNA analysis. Many components of the GATK pipeline are not very parallelizable though. In this paper, we present a parallel implementation of a DNA analysis pipeline based on the big data Apache Spark framework. This implementation is highly scalable and capable of parallelizing computation by utilizing data-level parallelism as well as load balancing techniques. In order to reduce the analysis cost, the framework can run on nodes with as little memory as 16GB. For whole genome sequencing experiments, we show that the runtime can be reduced to about 1.5 hours on a 20-node cluster with an accuracy of up to 99.9981%. Our solution is about 71% faster than other state-of-the-art solutions while also being more accurate. The source code of the software described in this paper is publicly available at https://github.com/HamidMushtaq/SparkGA1.git.

References

  1. G.A. van der Auwera, M. Carneiro, C. Hartl, R. Poplin, G. del Angel, A. Levy-Moonshine, T. Jordan, K. Shakir, D. Roazen, J. Thibault, E. Banks, K. Garimella, D. Altshuler, S. Gabriel, M. DePristo, "From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline", Current Protocols in Bioinformatics, 43:11.10.1--11.10.33, 2013.Google ScholarGoogle Scholar
  2. D. Decap, J. Reumers, C. Herzeel, P. Costanza and J. Fostier, "Halvade: scalable sequence analysis with MapReduce", Bioinformatics, btv179v2-btv179, 2015.Google ScholarGoogle Scholar
  3. https://broadinstitute.github.io/picard/Google ScholarGoogle Scholar
  4. https://gdc.cancer.gov/Google ScholarGoogle Scholar
  5. https://www.surf.nl/en/services-and-products/big-data-services/access/index.htmlGoogle ScholarGoogle Scholar
  6. J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", Commun. ACM, vol. 51, no. 1, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker and I.Stoica, "Spark: cluster computing with working sets", HotCloud'10, USENIX Association, Berkeley, CA, USA, 10--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J.M. Abuin, J.C. Pichel, T.F. Pena and J. Amigo, "SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data" Ed" PLoS ONE 11.5 (2016), e0155461. PMC. Web. 31 Oct. 2016.Google ScholarGoogle ScholarCross RefCross Ref
  9. D.C. Jones, W.L. Ruzzo, X. Peng and M.G. Katze, "Compression of next-generation sequencing reads aided by highly efficient de novo assembly", Nucleic Acids Research, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  10. B.J. Kelly, J.R. Fitch, Y. Hu, D.J. Corsmeier, H. Zhong, A.N. Wetzel, R.D. Nordquist, D.L. Newsom and P. White,"Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics", Genome Biology, vol. 16, no. 6, 2015.Google ScholarGoogle Scholar
  11. N. Ahmed, V. M. Sima, E. Houtgast, K. Bertels and Z. Al-Ars, "Heterogeneous hardware/software acceleration of the BWA-MEM DNA alignment algorithm," 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Austin, TX, 2015, pp. 240--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Ren, V. M. Sima and Z. Al-Ars, "FPGA acceleration of the pair-HMMs forward algorithm for DNA sequence analysis," 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, 2015, pp. 1465--1470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Z. Al-Ars and Hamid Mushtaq "Scalability Potential of BWA DNA Mapping Algorithm on Apache Spark," SIMBig 2015, Cusco, Peru, 2015, pp. 85--88.Google ScholarGoogle Scholar
  14. H. Li, "Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM", arXiv:1303.3997 {q-bio.GN}, 2013.Google ScholarGoogle Scholar
  15. H. Mushtaq, Z. Al-Ars, "Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline", IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Langmead, S.L. Salzberg, "Fast gapped-read alignment with Bowtie 2", Nature Methods, vol. 9, no. 4, pp. 357--359, 2012.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. SparkGA: A Spark Framework for Cost Effective, Fast and Accurate DNA Analysis at Scale

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics
          August 2017
          800 pages
          ISBN:9781450347228
          DOI:10.1145/3107411

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 20 August 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          ACM-BCB '17 Paper Acceptance Rate42of132submissions,32%Overall Acceptance Rate254of885submissions,29%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader