SPRITE: A Fast Parallel SNP Detection Pipeline

Rengasamy, Vasudevan; Madduri, Kamesh

doi:10.1007/978-3-319-41321-1_9

Vasudevan Rengasamy¹⁶ &
Kamesh Madduri¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9697))

Included in the following conference series:

International Conference on High Performance Computing

2660 Accesses
2 Citations

Abstract

We present Sprite, a new high-performance data analysis pipeline for detecting single nucleotide polymorphisms (SNPs) in the human genome. A SNP detection pipeline for next-generation sequencing data uses several software tools, including tools for read alignment, processing alignment output, and SNP identification. We target end-to-end scalability and I/O efficiency in Sprite by merging tools in this pipeline and eliminating redundancies. For a benchmark human whole-genome sequencing data set, Sprite takes less than 50 min on 16 nodes of the TACC Stampede supercomputer. A key component of our optimized pipeline is parsnip, a new parallel method and software tool for SNP detection. We find that the quality of results obtained by parsnip (sensitivity and precision using high-confidence variant calls as ground truth) is comparable to state-of-the-art SNP-calling software. A prototype implementation of Sprite is available at sprite-psu.sourceforge.net.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Adjeroh, D., Bell, T.C., Mukherjee, A.: The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching. Springer, Heidelberg (2008)
Book Google Scholar
Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation, Palo Alto, CA (1994)
Google Scholar
Challis, D., Yu, J., Evani, U.S., Jackson, A.R., Paithankar, S., Coarfa, C., Milosavljevic, A., Gibbs, R.A., Yu, F.: An integrative variant analysis suite for whole exome next-generation sequencing data. BMC Bioinformatics 13(1), 8 (2012)
Article Google Scholar
Chiang, C., Layer, R.M., Faust, G.G., Lindberg, M.R., Rose, D.B., Garrison, E.P., Marth, G.T., Quinlan, A.R., Hall, I.M.: SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat. Methods 12, 966–968 (2015)
Article Google Scholar
Depristo, M., Banks, E., Poplin, R., Garimella, K., Maguire, J., Hartl, C.: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–8 (2011)
Article Google Scholar
Faust, G., Hall, I.: SAMBLASTER: fast duplicate marking and structural variant read extraction. Bioinformatics 30, 2503–5 (2014)
Article Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings Symposium on Foundations of Computer Science, pp. 390–398 (2000)
Google Scholar
Garrison, E., Marth, G.: Haplotype-based variant detection from short-read sequencing (2012). http://arxiv.org/abs/1207.3907
GATK best practices. https://www.broadinstitute.org/gatk/guide/best-practices.php. Accessed May 2016
Abecasis Lab GLF tools. http://www.sph.umich.edu/csg/abecasis/glfTools. Accessed May 2016
Kathiresan, N., Temanni, M.R., Al-Ali, R.: Performance improvement of BWA MEM algorithm using data-parallel with concurrent parallelization. In: Proceedings of the International Conference on Parallel, Distributed and Grid Computing (PDGC) (2014)
Google Scholar
Kelly, B., Fitch, J., Hu, Y., Corsmeier, D., Zhong, H., Wetzel, A., Nordquist, R., Newsom, D., White, P.: Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome Biol. 16(1), 6 (2015)
Article Google Scholar
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4), 357–359 (2012)
Article Google Scholar
Li, H.: A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27(21), 2987–2993 (2011)
Article Google Scholar
Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013). http://arxiv.org/abs/1303.3997v2
Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)
Article Google Scholar
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R.: 1000 Genome Project Data Processing Subgroup: The aequence alignment/map format and SAMtools. Bioinformatics 25(16), 2078–2079 (2009)
Google Scholar
Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18(11), 1851–1858 (2008)
Article Google Scholar
Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)
Article Google Scholar
Liu, C., Wong, T., Wu, E., Luo, R., Yiu, S., Li, Y., Wang, B., Yu, C., Chu, X., Zhao, K., Li, R., Lam, T.: SOAP3: ultra-fast GPU-based parallel alignment tool for short reads. Bioinformatics 28(6), 878–879 (2012)
Article Google Scholar
Liu, X., Han, S., Wang, Z., Gelernter, J., Yang, B.-Z.: Variant callers for next-generation sequencing data: a comparison study. PLoS ONE 8(9), e75619 (2013)
Article Google Scholar
Luo, R., Wong, Y.-L., Law, W.-C., Lee, L.-K., Cheung, J., Liu, C.-M., Lam, T.-W.: BALSA: integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU. PeerJ 2, e421 (2014)
Article Google Scholar
Nielsen, R., Paul, J., Albrechtsen, A., Song, Y.: Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443–451 (2011)
Article Google Scholar
Peters, D., Luo, X., Qiu, K., Liang, P.: Speeding up large-scale next generation sequencing data analysis with pBWA. J. Appl. Bioinform. Comput. Biol. 1(1), 1–6 (2012)
Google Scholar
Picard tools. http://broadinstitute.github.io/picard. Accessed Dec 2015
pMap: Parallel sequence mapping tool. http://bmi.osu.edu/hpc/software/pmap/pmap.html. Accessed May 2016
Raczy, C., Petrovski, R., Saunders, C.T., Chorny, I., Kruglyak, S., Margulies, E.H., Chuang, H.-Y., Kllberg, M., Kumar, S.A., Liao, A., Little, K.M., Strömberg, M.P., Tanner, S.W.: Isaac: ultra-fast whole-genome secondary analysis on Illumina sequencing platforms. Bioinformatics 29(16), 2041–2043 (2013)
Article Google Scholar
Rengasamy, V., Madduri, K.: Engineering a high-performance SNP detection pipeline. Technical report, The Pennsylvania State University (2015)
Google Scholar
Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M., Sidow, A., Brudno, M.: Shrimp: accurate mapping of short color-space reads. PLoS Comput. Biol. 5(5), e1000386 (2009)
Article Google Scholar
Sambamba: process your BAM data faster! http://lomereiter.github.io/sambamba/. Accessed May 2016
Single Nucleotide Polymorphism - SNPedia. http://www.snpedia.com/index.php/Single_Nucleotide_Polymorphism. Accessed May 2016
Talwalkar, A., Liptrap, J., Newcomb, J., Hartl, C., Terhorst, J., Curtis, K., Bresler, M., Song, Y.S., Jordan, M.I., Patterson, D.: SMaSH: a benchmarking toolkit for human genome variant calling. Bioinformatics 30(19), 2787–2795 (2014)
Article Google Scholar
Zook, J., Chapman, B., Wang, J., Mittelman, D., Hofmann, O., Hide, W.: Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014)
Article Google Scholar

Download references

Acknowledgments

This research is supported by the National Science Foundation award # 1439057. We thank members of our project research team for helpful discussions.

Author information

Authors and Affiliations

The Pennsylvania State University, University Park, PA, USA
Vasudevan Rengasamy & Kamesh Madduri

Authors

Vasudevan Rengasamy
View author publications
You can also search for this author in PubMed Google Scholar
Kamesh Madduri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vasudevan Rengasamy .

Editor information

Editors and Affiliations

Deutsches Klimarechenzentrum, Hamburg, Germany
Julian M. Kunkel
Argonne National Laboratory, Lemont, Illinois, USA
Pavan Balaji
University of Tennessee, Knoxville, Tennessee, USA
Jack Dongarra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rengasamy, V., Madduri, K. (2016). SPRITE: A Fast Parallel SNP Detection Pipeline. In: Kunkel, J., Balaji, P., Dongarra, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9697. Springer, Cham. https://doi.org/10.1007/978-3-319-41321-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-41321-1_9
Published: 15 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41320-4
Online ISBN: 978-3-319-41321-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics