Skip to main content

SPRITE: A Fast Parallel SNP Detection Pipeline

  • Conference paper
  • First Online:
Book cover High Performance Computing (ISC High Performance 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9697))

Included in the following conference series:

Abstract

We present Sprite, a new high-performance data analysis pipeline for detecting single nucleotide polymorphisms (SNPs) in the human genome. A SNP detection pipeline for next-generation sequencing data uses several software tools, including tools for read alignment, processing alignment output, and SNP identification. We target end-to-end scalability and I/O efficiency in Sprite by merging tools in this pipeline and eliminating redundancies. For a benchmark human whole-genome sequencing data set, Sprite takes less than 50 min on 16 nodes of the TACC Stampede supercomputer. A key component of our optimized pipeline is parsnip, a new parallel method and software tool for SNP detection. We find that the quality of results obtained by parsnip (sensitivity and precision using high-confidence variant calls as ground truth) is comparable to state-of-the-art SNP-calling software. A prototype implementation of Sprite is available at sprite-psu.sourceforge.net.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Adjeroh, D., Bell, T.C., Mukherjee, A.: The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching. Springer, Heidelberg (2008)

    Book  Google Scholar 

  2. Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation, Palo Alto, CA (1994)

    Google Scholar 

  3. Challis, D., Yu, J., Evani, U.S., Jackson, A.R., Paithankar, S., Coarfa, C., Milosavljevic, A., Gibbs, R.A., Yu, F.: An integrative variant analysis suite for whole exome next-generation sequencing data. BMC Bioinformatics 13(1), 8 (2012)

    Article  Google Scholar 

  4. Chiang, C., Layer, R.M., Faust, G.G., Lindberg, M.R., Rose, D.B., Garrison, E.P., Marth, G.T., Quinlan, A.R., Hall, I.M.: SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat. Methods 12, 966–968 (2015)

    Article  Google Scholar 

  5. Depristo, M., Banks, E., Poplin, R., Garimella, K., Maguire, J., Hartl, C.: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–8 (2011)

    Article  Google Scholar 

  6. Faust, G., Hall, I.: SAMBLASTER: fast duplicate marking and structural variant read extraction. Bioinformatics 30, 2503–5 (2014)

    Article  Google Scholar 

  7. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings Symposium on Foundations of Computer Science, pp. 390–398 (2000)

    Google Scholar 

  8. Garrison, E., Marth, G.: Haplotype-based variant detection from short-read sequencing (2012). http://arxiv.org/abs/1207.3907

  9. GATK best practices. https://www.broadinstitute.org/gatk/guide/best-practices.php. Accessed May 2016

  10. Abecasis Lab GLF tools. http://www.sph.umich.edu/csg/abecasis/glfTools. Accessed May 2016

  11. Kathiresan, N., Temanni, M.R., Al-Ali, R.: Performance improvement of BWA MEM algorithm using data-parallel with concurrent parallelization. In: Proceedings of the International Conference on Parallel, Distributed and Grid Computing (PDGC) (2014)

    Google Scholar 

  12. Kelly, B., Fitch, J., Hu, Y., Corsmeier, D., Zhong, H., Wetzel, A., Nordquist, R., Newsom, D., White, P.: Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome Biol. 16(1), 6 (2015)

    Article  Google Scholar 

  13. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4), 357–359 (2012)

    Article  Google Scholar 

  14. Li, H.: A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27(21), 2987–2993 (2011)

    Article  Google Scholar 

  15. Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013). http://arxiv.org/abs/1303.3997v2

  16. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)

    Article  Google Scholar 

  17. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R.: 1000 Genome Project Data Processing Subgroup: The aequence alignment/map format and SAMtools. Bioinformatics 25(16), 2078–2079 (2009)

    Google Scholar 

  18. Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18(11), 1851–1858 (2008)

    Article  Google Scholar 

  19. Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)

    Article  Google Scholar 

  20. Liu, C., Wong, T., Wu, E., Luo, R., Yiu, S., Li, Y., Wang, B., Yu, C., Chu, X., Zhao, K., Li, R., Lam, T.: SOAP3: ultra-fast GPU-based parallel alignment tool for short reads. Bioinformatics 28(6), 878–879 (2012)

    Article  Google Scholar 

  21. Liu, X., Han, S., Wang, Z., Gelernter, J., Yang, B.-Z.: Variant callers for next-generation sequencing data: a comparison study. PLoS ONE 8(9), e75619 (2013)

    Article  Google Scholar 

  22. Luo, R., Wong, Y.-L., Law, W.-C., Lee, L.-K., Cheung, J., Liu, C.-M., Lam, T.-W.: BALSA: integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU. PeerJ 2, e421 (2014)

    Article  Google Scholar 

  23. Nielsen, R., Paul, J., Albrechtsen, A., Song, Y.: Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443–451 (2011)

    Article  Google Scholar 

  24. Peters, D., Luo, X., Qiu, K., Liang, P.: Speeding up large-scale next generation sequencing data analysis with pBWA. J. Appl. Bioinform. Comput. Biol. 1(1), 1–6 (2012)

    Google Scholar 

  25. Picard tools. http://broadinstitute.github.io/picard. Accessed Dec 2015

  26. pMap: Parallel sequence mapping tool. http://bmi.osu.edu/hpc/software/pmap/pmap.html. Accessed May 2016

  27. Raczy, C., Petrovski, R., Saunders, C.T., Chorny, I., Kruglyak, S., Margulies, E.H., Chuang, H.-Y., Kllberg, M., Kumar, S.A., Liao, A., Little, K.M., Strömberg, M.P., Tanner, S.W.: Isaac: ultra-fast whole-genome secondary analysis on Illumina sequencing platforms. Bioinformatics 29(16), 2041–2043 (2013)

    Article  Google Scholar 

  28. Rengasamy, V., Madduri, K.: Engineering a high-performance SNP detection pipeline. Technical report, The Pennsylvania State University (2015)

    Google Scholar 

  29. Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M., Sidow, A., Brudno, M.: Shrimp: accurate mapping of short color-space reads. PLoS Comput. Biol. 5(5), e1000386 (2009)

    Article  Google Scholar 

  30. Sambamba: process your BAM data faster! http://lomereiter.github.io/sambamba/. Accessed May 2016

  31. Single Nucleotide Polymorphism - SNPedia. http://www.snpedia.com/index.php/Single_Nucleotide_Polymorphism. Accessed May 2016

  32. Talwalkar, A., Liptrap, J., Newcomb, J., Hartl, C., Terhorst, J., Curtis, K., Bresler, M., Song, Y.S., Jordan, M.I., Patterson, D.: SMaSH: a benchmarking toolkit for human genome variant calling. Bioinformatics 30(19), 2787–2795 (2014)

    Article  Google Scholar 

  33. Zook, J., Chapman, B., Wang, J., Mittelman, D., Hofmann, O., Hide, W.: Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014)

    Article  Google Scholar 

Download references

Acknowledgments

This research is supported by the National Science Foundation award # 1439057. We thank members of our project research team for helpful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vasudevan Rengasamy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Rengasamy, V., Madduri, K. (2016). SPRITE: A Fast Parallel SNP Detection Pipeline. In: Kunkel, J., Balaji, P., Dongarra, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9697. Springer, Cham. https://doi.org/10.1007/978-3-319-41321-1_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41321-1_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41320-4

  • Online ISBN: 978-3-319-41321-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics