research-article

SparkGA: A Spark Framework for Cost Effective, Fast and Accurate DNA Analysis at Scale

Authors:
Hamid Mushtaq

TU Delft, Delft, Netherlands

TU Delft, Delft, Netherlands
View Profile

,
Frank Liu

IBM, Austin, TX, USA

IBM, Austin, TX, USA
View Profile

,
Carlos Costa

IBM, Yorktown, NY, USA

IBM, Yorktown, NY, USA
View Profile

,
Gang Liu

IBM, Austin, TX, USA

IBM, Austin, TX, USA
View Profile

,
Peter Hofstee

IBM, Austin, TX, USA

IBM, Austin, TX, USA
View Profile

,
Zaid Al-Ars

TU Delft, Delft, Netherlands

TU Delft, Delft, Netherlands
View Profile

ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health InformaticsAugust 2017Pages 148–157https://doi.org/10.1145/3107411.3107438

Published:20 August 2017Publication History

ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

Pages 148–157

ABSTRACT

In recent years, the cost of NGS (Next Generation Sequencing) technology has dramatically reduced, making it a viable method for diagnosing genetic diseases. The large amount of data generated by NGS technology, usually in the order of hundreds of gigabytes per experiment, have to be analyzed quickly to generate meaningful variant results. The GATK best practices pipeline from the Broad Institute is one of the most popular computational pipelines for DNA analysis. Many components of the GATK pipeline are not very parallelizable though. In this paper, we present a parallel implementation of a DNA analysis pipeline based on the big data Apache Spark framework. This implementation is highly scalable and capable of parallelizing computation by utilizing data-level parallelism as well as load balancing techniques. In order to reduce the analysis cost, the framework can run on nodes with as little memory as 16GB. For whole genome sequencing experiments, we show that the runtime can be reduced to about 1.5 hours on a 20-node cluster with an accuracy of up to 99.9981%. Our solution is about 71% faster than other state-of-the-art solutions while also being more accurate. The source code of the software described in this paper is publicly available at https://github.com/HamidMushtaq/SparkGA1.git.

References

G.A. van der Auwera, M. Carneiro, C. Hartl, R. Poplin, G. del Angel, A. Levy-Moonshine, T. Jordan, K. Shakir, D. Roazen, J. Thibault, E. Banks, K. Garimella, D. Altshuler, S. Gabriel, M. DePristo, "From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline", Current Protocols in Bioinformatics, 43:11.10.1--11.10.33, 2013.Google Scholar
D. Decap, J. Reumers, C. Herzeel, P. Costanza and J. Fostier, "Halvade: scalable sequence analysis with MapReduce", Bioinformatics, btv179v2-btv179, 2015.Google Scholar
https://broadinstitute.github.io/picard/Google Scholar
https://gdc.cancer.gov/Google Scholar
https://www.surf.nl/en/services-and-products/big-data-services/access/index.htmlGoogle Scholar
J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", Commun. ACM, vol. 51, no. 1, 2008. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker and I.Stoica, "Spark: cluster computing with working sets", HotCloud'10, USENIX Association, Berkeley, CA, USA, 10--10. Google ScholarDigital Library
J.M. Abuin, J.C. Pichel, T.F. Pena and J. Amigo, "SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data" Ed" PLoS ONE 11.5 (2016), e0155461. PMC. Web. 31 Oct. 2016.Google ScholarCross Ref
D.C. Jones, W.L. Ruzzo, X. Peng and M.G. Katze, "Compression of next-generation sequencing reads aided by highly efficient de novo assembly", Nucleic Acids Research, 2012.Google ScholarCross Ref
B.J. Kelly, J.R. Fitch, Y. Hu, D.J. Corsmeier, H. Zhong, A.N. Wetzel, R.D. Nordquist, D.L. Newsom and P. White,"Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics", Genome Biology, vol. 16, no. 6, 2015.Google Scholar
N. Ahmed, V. M. Sima, E. Houtgast, K. Bertels and Z. Al-Ars, "Heterogeneous hardware/software acceleration of the BWA-MEM DNA alignment algorithm," 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Austin, TX, 2015, pp. 240--246. Google ScholarDigital Library
S. Ren, V. M. Sima and Z. Al-Ars, "FPGA acceleration of the pair-HMMs forward algorithm for DNA sequence analysis," 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, 2015, pp. 1465--1470. Google ScholarDigital Library
Z. Al-Ars and Hamid Mushtaq "Scalability Potential of BWA DNA Mapping Algorithm on Apache Spark," SIMBig 2015, Cusco, Peru, 2015, pp. 85--88.Google Scholar
H. Li, "Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM", arXiv:1303.3997 {q-bio.GN}, 2013.Google Scholar
H. Mushtaq, Z. Al-Ars, "Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline", IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2015. Google ScholarDigital Library
B. Langmead, S.L. Salzberg, "Fast gapped-read alignment with Bowtie 2", Nature Methods, vol. 9, no. 4, pp. 357--359, 2012.Google ScholarCross Ref

Index Terms

SparkGA: A Spark Framework for Cost Effective, Fast and Accurate DNA Analysis at Scale
1. Applied computing
  1. Life and medical sciences
    1. Bioinformatics
    2. Genomics
2. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing

Recommendations

Spark-based data analytics of sequence motifs in large omics data
Abstract
Data explosion in bioinformatics in recent years has led to new challenges for researchers to develop novel techniques to discover new knowledge from the avalanche of omics data (e.g., genomics, proteomics, transcriptomics). These data are ...
Read More
'Big data', Hadoop and cloud computing in genomics

Graphical abstractDisplay Omitted Ever improving next generation sequencing technologies has led to an unprecedented proliferation of sequence data.Biology is now one of the fastest growing fields of big data science.Cloud computing and big data ...
Read More
Howdah - A Flexible Pipeline Framework for Analyzing Genomic Data
CLOUDCOM '10: Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science

The advent of new high-throughput sequencing technologies has led to a flood of genomic data which overwhelms the capabilities of single processor machines. We present a MapReduce pipeline called Howdah that supports the analysis of genomic sequence ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics
August 2017
800 pages
ISBN:9781450347228
DOI:10.1145/3107411
General Chairs:
Nurit Haspel
University of Massachusetts Boston, USA
,
Lenore J. Cowen
Tufts University, USA
,
Program Chairs:
Amarda Shehu
George Mason University, USA
,
Tamer Kahveci
University of Florida, USA
,
Giuseppe Pozzi
Politecnico di Milano, Italy
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 August 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
DNA
HADOOP
big data
bioinformatics
genomics
mapreduce
spark
Qualifiers
- research-article
Conference

Acceptance Rates
ACM-BCB '17 Paper Acceptance Rate42of132submissions,32%Overall Acceptance Rate254of885submissions,29%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 316
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SparkGA: A Spark Framework for Cost Effective, Fast and Accurate DNA Analysis at Scale

ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Spark-based data analytics of sequence motifs in large omics data

'Big data', Hadoop and cloud computing in genomics

Howdah - A Flexible Pipeline Framework for Analyzing Genomic Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SparkGA: A Spark Framework for Cost Effective, Fast and Accurate DNA Analysis at Scale

ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Spark-based data analytics of sequence motifs in large omics data

'Big data', Hadoop and cloud computing in genomics

Howdah - A Flexible Pipeline Framework for Analyzing Genomic Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media