Abstract
This work shows how to parallelize multi K de Bruijn graph genome assembly simultaneously, removing the bottleneck of iterative multi K assembly. The expected execution time on a single node with 40 cores is variable, with the average execution time for the entire pipeline over 16 datasets tested being 1613 s for SPAdes vs. 1581 s for MULKSG, with the MULKSG graph creation and traversal averaging 15% faster than SPAdes. We implement a multi-node implementation for the graph creation and traversal portions of the assembly, showing the speedups in Fig. 4. We show that when implemented correctly with correction phases performed per graph in parallel, the expected outcome is very close to the original method, in some cases having less errors while keeping the same NGA50 and genome coverage %. We show this works in practice, implementing with the popular genome assembler SPAdes. Further, this algorithmic change gets rid of the single node sequential bottleneck on multi K genome assembly, allowing for the use of parallel error correction, graph building, graph correction, and graph traversal. We implement a parallel version of the assembly and show the statistics are the same as when run on a single node. The code is open source and can be found at https://github.com/cwright7101/mulksg.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
https://ccb.jhu.edu/gage_b/datasets/A_hydrophila_HiSeq.tar.gz. Accessed 7 Jan 2019
ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Aeromonas_hydrophila_ATCC_7966_uid58617/NC_008570.fna. Accessed 7 Jan 2019
https://ccb.jhu.edu/gage_b/datasets/B_cereus_HiSeq.tar.gz. Accessed 7 Jan 2019
ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Bacillus_cereus_ATCC_10987_uid57673/NC_003909.fna. Accessed 7 Jan 2019
https://ccb.jhu.edu/gage_b/datasets/B_cereus_MiSeq.tar.gz. Accessed 7 Jan 2019
https://ccb.jhu.edu/gage_b/datasets/B_fragilis_HiSeq.tar.gz. Accessed 7 Jan 2019
ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Bacteroides_fragilis_638R_uid84217/NC_016776.fna. Accessed 7 Jan 2019
https://ccb.jhu.edu/gage_b/datasets/M_abscessus_HiSeq.tar.gz. Accessed 7 Jan 2019
ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Mycobacterium_abscessus_uid61613/NC_010394.fna. Accessed 7 Jan 2019
https://ccb.jhu.edu/gage_b/datasets/M_abscessus_MiSeq.tar.gz. Accessed 7 Jan 2019
https://ccb.jhu.edu/gage_b/datasets/R_sphaeroides_HiSeq.tar.gz. Accessed 7 Jan 2019
ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Rhodobacter_sphaeroides_2_4_1_uid57653/NC_007488.fna. Accessed 7 Jan 2019
https://ccb.jhu.edu/gage_b/datasets/R_sphaeroides_MiSeq.tar.gz. Accessed 7 Jan 2019
https://ccb.jhu.edu/gage_b/datasets/S_aureus_HiSeq.tar.gz. Accessed 7 Jan 2019
ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Staphylococcus_aureus_USA300_TCH1516_uid58925/NC_010063.fna. Accessed 7 Jan 2019
https://ccb.jhu.edu/gage_b/datasets/V_cholerae_HiSeq.tar.gz. Accessed 7 Jan 2019
ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Vibrio_cholerae_O1_biovar_El_Tor_N16961_uid57623/NC_002505.fna. Accessed 7 Jan 2019
https://ccb.jhu.edu/gage_b/datasets/V_cholerae_MiSeq.tar.gz. Accessed 7 Jan 2019
https://ccb.jhu.edu/gage_b/datasets/X_axonopodis_HiSeq.tar.gz. Accessed 7 Jan 2019
ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Xanthomonas_axonopodis_citrumelo_F1_uid73179/NC_016010.fna. Accessed 7 Jan 2019
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000299455.1_ASM29945v1/GCF_000299455.1_ASM29945v1_genomic.fna.gz. Accessed 7 Jan 2019
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Pseudomonas_aeruginosa/reference/GCF_000006765.1_ASM676v1/GCF_000006765.1_ASM676v1_genomic.fna.gz. Accessed 7 Jan 2019
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Salmonella_enterica/reference/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz. Accessed 7 Jan 2019
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Staphylococcus_aureus/reference/GCF_000013425.1_ASM1342v1/GCF_000013425.1_ASM1342v1_genomic.fna.gz. Accessed 7 Jan 2019
Bankevich, A., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)
Batzoglou, S., et al.: ARACHNE: a whole-genome shotgun assembler. Genome Res. 12(1), 177–89 (2002)
Dalcin, L.D., Paz, R.R., Kler, P.A., Cosimo, A.: Parallel distributed computing using Python. Adv. Water Resour. 34(9), 1124–1139 (2011). https://doi.org/10.1016/j.advwatres.2011.04.013. http://www.sciencedirect.com/science/article/pii/S0309170811000777, new Computational Methods and Software Tools
Gurevich, A., Saveliev, V., Vyahhi, N., Tesler, G.: QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8), 1072–1075 (2013). https://doi.org/10.1093/bioinformatics/btt086. https://www.ncbi.nlm.nih.gov/pubmed/23422339, 23422339[pmid]
Huang, X., Madan, A.: CAP3: a DNA sequence assembly program. Genome Res. 9(9), 868–877 (1999). https://www.ncbi.nlm.nih.gov/pubmed/10508846, 10508846[pmid]
Li, D., Liu, C.M., Luo, R., Sadakane, K., Lam, T.W.: MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31(10), 1674–1676 (2015). https://doi.org/10.1093/bioinformatics/btv033
Luo, R., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1(1), 18 (2012)
Magoc, T., et al.: GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics 29(14), 1718–1725 (2013). https://doi.org/10.1093/bioinformatics/btt273. https://www.ncbi.nlm.nih.gov/pubmed/23665771, 23665771[pmid]
Mahadik, K., Wright, C., Kulkarni, M., Bagchi, S., Chaterji, S.: Scalable genomic assembly through parallel de Bruijn graph construction for multiple K-mers. In: BCB (2017)
Mullikin, J.C., Ning, Z.: The phusion assembler. Genome Res. 13(1), 81–90 (2003). https://doi.org/10.1101/gr.731003. https://www.ncbi.nlm.nih.gov/pubmed/12529309, 12529309[pmid]
Peng, Y., Leung, H.C., Yiu, S.M., Chin, F.Y.: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11), 1420–1428 (2012)
Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Wright, C., Krishnamoorty, S., Kulkarni, M. (2019). MULKSG: MULtiple K Simultaneous Graph Assembly. In: Holmes, I., Martín-Vide, C., Vega-Rodríguez, M. (eds) Algorithms for Computational Biology. AlCoB 2019. Lecture Notes in Computer Science(), vol 11488. Springer, Cham. https://doi.org/10.1007/978-3-030-18174-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-18174-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-18173-4
Online ISBN: 978-3-030-18174-1
eBook Packages: Computer ScienceComputer Science (R0)