Skip to main content

MULKSG: MULtiple K Simultaneous Graph Assembly

  • Conference paper
  • First Online:
Algorithms for Computational Biology (AlCoB 2019)

Abstract

This work shows how to parallelize multi K de Bruijn graph genome assembly simultaneously, removing the bottleneck of iterative multi K assembly. The expected execution time on a single node with 40 cores is variable, with the average execution time for the entire pipeline over 16 datasets tested being 1613 s for SPAdes vs. 1581 s for MULKSG, with the MULKSG graph creation and traversal averaging 15% faster than SPAdes. We implement a multi-node implementation for the graph creation and traversal portions of the assembly, showing the speedups in Fig. 4. We show that when implemented correctly with correction phases performed per graph in parallel, the expected outcome is very close to the original method, in some cases having less errors while keeping the same NGA50 and genome coverage %. We show this works in practice, implementing with the popular genome assembler SPAdes. Further, this algorithmic change gets rid of the single node sequential bottleneck on multi K genome assembly, allowing for the use of parallel error correction, graph building, graph correction, and graph traversal. We implement a parallel version of the assembly and show the statistics are the same as when run on a single node. The code is open source and can be found at https://github.com/cwright7101/mulksg.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. https://ccb.jhu.edu/gage_b/datasets/A_hydrophila_HiSeq.tar.gz. Accessed 7 Jan 2019

  2. ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Aeromonas_hydrophila_ATCC_7966_uid58617/NC_008570.fna. Accessed 7 Jan 2019

  3. https://ccb.jhu.edu/gage_b/datasets/B_cereus_HiSeq.tar.gz. Accessed 7 Jan 2019

  4. ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Bacillus_cereus_ATCC_10987_uid57673/NC_003909.fna. Accessed 7 Jan 2019

  5. https://ccb.jhu.edu/gage_b/datasets/B_cereus_MiSeq.tar.gz. Accessed 7 Jan 2019

  6. https://ccb.jhu.edu/gage_b/datasets/B_fragilis_HiSeq.tar.gz. Accessed 7 Jan 2019

  7. ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Bacteroides_fragilis_638R_uid84217/NC_016776.fna. Accessed 7 Jan 2019

  8. https://ccb.jhu.edu/gage_b/datasets/M_abscessus_HiSeq.tar.gz. Accessed 7 Jan 2019

  9. ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Mycobacterium_abscessus_uid61613/NC_010394.fna. Accessed 7 Jan 2019

  10. https://ccb.jhu.edu/gage_b/datasets/M_abscessus_MiSeq.tar.gz. Accessed 7 Jan 2019

  11. https://ccb.jhu.edu/gage_b/datasets/R_sphaeroides_HiSeq.tar.gz. Accessed 7 Jan 2019

  12. ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Rhodobacter_sphaeroides_2_4_1_uid57653/NC_007488.fna. Accessed 7 Jan 2019

  13. https://ccb.jhu.edu/gage_b/datasets/R_sphaeroides_MiSeq.tar.gz. Accessed 7 Jan 2019

  14. https://ccb.jhu.edu/gage_b/datasets/S_aureus_HiSeq.tar.gz. Accessed 7 Jan 2019

  15. ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Staphylococcus_aureus_USA300_TCH1516_uid58925/NC_010063.fna. Accessed 7 Jan 2019

  16. https://ccb.jhu.edu/gage_b/datasets/V_cholerae_HiSeq.tar.gz. Accessed 7 Jan 2019

  17. ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Vibrio_cholerae_O1_biovar_El_Tor_N16961_uid57623/NC_002505.fna. Accessed 7 Jan 2019

  18. https://ccb.jhu.edu/gage_b/datasets/V_cholerae_MiSeq.tar.gz. Accessed 7 Jan 2019

  19. https://ccb.jhu.edu/gage_b/datasets/X_axonopodis_HiSeq.tar.gz. Accessed 7 Jan 2019

  20. ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Xanthomonas_axonopodis_citrumelo_F1_uid73179/NC_016010.fna. Accessed 7 Jan 2019

  21. ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000299455.1_ASM29945v1/GCF_000299455.1_ASM29945v1_genomic.fna.gz. Accessed 7 Jan 2019

  22. ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Pseudomonas_aeruginosa/reference/GCF_000006765.1_ASM676v1/GCF_000006765.1_ASM676v1_genomic.fna.gz. Accessed 7 Jan 2019

  23. ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Salmonella_enterica/reference/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz. Accessed 7 Jan 2019

  24. ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Staphylococcus_aureus/reference/GCF_000013425.1_ASM1342v1/GCF_000013425.1_ASM1342v1_genomic.fna.gz. Accessed 7 Jan 2019

  25. Bankevich, A., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)

    Article  MathSciNet  Google Scholar 

  26. Batzoglou, S., et al.: ARACHNE: a whole-genome shotgun assembler. Genome Res. 12(1), 177–89 (2002)

    Article  Google Scholar 

  27. Dalcin, L.D., Paz, R.R., Kler, P.A., Cosimo, A.: Parallel distributed computing using Python. Adv. Water Resour. 34(9), 1124–1139 (2011). https://doi.org/10.1016/j.advwatres.2011.04.013. http://www.sciencedirect.com/science/article/pii/S0309170811000777, new Computational Methods and Software Tools

    Article  Google Scholar 

  28. Gurevich, A., Saveliev, V., Vyahhi, N., Tesler, G.: QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8), 1072–1075 (2013). https://doi.org/10.1093/bioinformatics/btt086. https://www.ncbi.nlm.nih.gov/pubmed/23422339, 23422339[pmid]

    Article  Google Scholar 

  29. Huang, X., Madan, A.: CAP3: a DNA sequence assembly program. Genome Res. 9(9), 868–877 (1999). https://www.ncbi.nlm.nih.gov/pubmed/10508846, 10508846[pmid]

    Article  Google Scholar 

  30. Li, D., Liu, C.M., Luo, R., Sadakane, K., Lam, T.W.: MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31(10), 1674–1676 (2015). https://doi.org/10.1093/bioinformatics/btv033

    Article  Google Scholar 

  31. Luo, R., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1(1), 18 (2012)

    Article  Google Scholar 

  32. Magoc, T., et al.: GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics 29(14), 1718–1725 (2013). https://doi.org/10.1093/bioinformatics/btt273. https://www.ncbi.nlm.nih.gov/pubmed/23665771, 23665771[pmid]

    Article  Google Scholar 

  33. Mahadik, K., Wright, C., Kulkarni, M., Bagchi, S., Chaterji, S.: Scalable genomic assembly through parallel de Bruijn graph construction for multiple K-mers. In: BCB (2017)

    Google Scholar 

  34. Mullikin, J.C., Ning, Z.: The phusion assembler. Genome Res. 13(1), 81–90 (2003). https://doi.org/10.1101/gr.731003. https://www.ncbi.nlm.nih.gov/pubmed/12529309, 12529309[pmid]

    Article  Google Scholar 

  35. Peng, Y., Leung, H.C., Yiu, S.M., Chin, F.Y.: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11), 1420–1428 (2012)

    Article  Google Scholar 

  36. Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)

    Article  Google Scholar 

  37. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christopher Wright .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wright, C., Krishnamoorty, S., Kulkarni, M. (2019). MULKSG: MULtiple K Simultaneous Graph Assembly. In: Holmes, I., Martín-Vide, C., Vega-Rodríguez, M. (eds) Algorithms for Computational Biology. AlCoB 2019. Lecture Notes in Computer Science(), vol 11488. Springer, Cham. https://doi.org/10.1007/978-3-030-18174-1_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-18174-1_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-18173-4

  • Online ISBN: 978-3-030-18174-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics