Skip to main content

Metagenomics Binning of Long Reads Using Read-Overlap Graphs

  • Conference paper
  • First Online:
Book cover Comparative Genomics (RECOMB-CG 2022)

Abstract

Metagenomics sequencing enables the direct study of microbial communities revealing important information such as taxonomy and relative abundance of species. Metagenomics binning facilitates the separation of these genetic materials into different taxonomic groups. Moving from second-generation sequencing to third-generation sequencing techniques enables the binning of reads before assembly thanks to the increased read lengths. The limited number of long-read binning tools that exist, still suffer from unreliable coverage estimation for individual long reads and face challenges in recovering low-abundance species. In this paper, we present a novel binning approach to bin long reads using the read-overlap graph. The read-overlap graph (1) enables a fast and reliable estimation of the coverage of individual long reads; (2) allows to incorporate the overlapping information between reads into the binning process; (3) facilitates a more uniform sampling of long reads across species of varying abundances. Experimental results show that our new binning approach produces better binning results of long reads and results in better assemblies especially for recovering low abundant species. The source code and a functional Google Colab Notebook are available at https://www.github.com/anuradhawick/oblr.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Baaijens, J.A., El Aabidine, A.Z., Rivals, E., Schönhuth, A.: De novo assembly of viral quasispecies using overlap graphs. Genome Res. 27(5), 835–848 (2017)

    Article  Google Scholar 

  2. Balvert, M., Luo, X., Hauptfeld, E., Schönhuth, A., Dutilh, B.E.: Ogre: overlap graph-based metagenomic read clustering. Bioinformatics 37(7), 905–912 (2021)

    Article  Google Scholar 

  3. Chen, K., Pachter, L.: Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLOS Comput. Biol. 1(2) (2005)

    Google Scholar 

  4. Feng, X., Cheng, H., Portik, D., Li, H.: Metagenome assembly of high-fidelity long reads with hifiasm-meta. arXiv:2110.08457 (2021)

  5. Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch Geometric. In: ICLR Workshop on Representation Learning on Graphs and Manifolds (2019)

    Google Scholar 

  6. Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large graphs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1025–1035 (2017)

    Google Scholar 

  7. Huson, D.H., et al.: Megan-LR: new algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs. Biol. Direct 13(1), 1–17 (2018)

    Google Scholar 

  8. Huson, D.H., Richter, D.C., Mitra, S., Auch, A.F., Schuster, S.C.: Methods for comparative metagenomics. BMC Bioinf. 10(1), 1–10 (2009)

    Google Scholar 

  9. Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Analysis 6(5), 429–449 (2002)

    Article  Google Scholar 

  10. Kang, D.D., et a.: Metabat 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019)

    Google Scholar 

  11. Kim, D., Song, L., Breitwieser, F.P., Salzberg, S.L.: Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26(12), 1721–1729 (2016)

    Article  Google Scholar 

  12. Kolmogorov, M., et al.: metaflye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods 17(11), 1103–1110 (2020)

    Google Scholar 

  13. Li, H.: Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32(14), 2103–2110 (2016)

    Article  Google Scholar 

  14. Li, H.: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100 (2018)

    Article  Google Scholar 

  15. Liang, D.M., Li, Y.F.: Lightweight label propagation for large-scale network data. In: IJCAI, pp. 3421–3427 (2018)

    Google Scholar 

  16. Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) 39(2), 539–550 (2009). https://doi.org/10.1109/TSMCB.2008.2007853

  17. Logsdon, G.A., Vollger, M.R., Eichler, E.E.: Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21(10), 597–614 (2020)

    Article  Google Scholar 

  18. McInnes, L., Healy, J., Astels, S.: HDBSCAN: hierarchical density based clustering. J. Open Source Softw. 2(11), 205, e7359 (2017)

    Google Scholar 

  19. McInnes, L., Healy, J., Melville, J.: Umap: Uniform manifold approximation and projection for dimension reduction (2020)

    Google Scholar 

  20. Menzel, P., Ng, K.L., Krogh, A.: Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7, 11257 (2016)

    Google Scholar 

  21. Meyer, F., et al.: Amber: assessment of metagenome binners. Gigascience 7(6), giy069 (2018)

    Google Scholar 

  22. Mikheenko, A., Saveliev, V., Gurevich, A.: Metaquast: evaluation of metagenome assemblies. Bioinformatics 32(7), 1088–1090 (2016)

    Article  Google Scholar 

  23. Nayfach, S., Pollard, K.S.: Toward accurate and quantitative comparative metagenomics. Cell 166(5), 1103–1116 (2016)

    Article  Google Scholar 

  24. Nicholls, S.M., Quick, J.C., Tang, S., Loman, N.J.: Ultra-deep, long-read nanopore sequencing of mock microbial community standards. Gigascience 8(5), giz043 (2019)

    Google Scholar 

  25. Nissen, J.N., et al.: Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 39(5), 555–560 (2021)

    Google Scholar 

  26. Nolet, C.J., et al.: Bringing UMAP closer to the speed of light with GPU acceleration (2020)

    Google Scholar 

  27. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65, e7359 (1987)

    Google Scholar 

  28. Ruan, J., Li, H.: Fast and accurate long-read assembly with WTDBG2. Nat. Methods 17(2), 155–158, e7359 (2020)

    Google Scholar 

  29. Stöcker, B.K., Köster, J., Rahmann, S.: Simlord: simulation of long read data. Bioinformatics 32(17), 2704–2706 (2016)

    Article  Google Scholar 

  30. Strous, M., Kraft, B., Bisdorf, R., Tegetmeyer, H.: The binning of metagenomic contigs for microbial physiology of mixed cultures. Front. Microbiol. 3, 410 (2012)

    Article  Google Scholar 

  31. Team, R.D.: RAPIDS: Collection of Libraries for End to End GPU Data Science (2018). https://rapids.ai

  32. Tyson, G.W., et al.: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428(6978), 37–43 (2004)

    Google Scholar 

  33. Wickramarachchi, A.: anuradhawick/seq2vec: release v1.0 (2021). https://doi.org/10.5281/zenodo.5515743, https://doi.org/10.5281/zenodo.5515743

  34. Wickramarachchi, A., Lin, Y.: Lrbinner: binning long reads in metagenomics datasets. In: 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2021)

    Google Scholar 

  35. Wickramarachchi, A., Mallawaarachchi, V., Rajan, V., Lin, Y.: Metabcc-LR: meta genomics binning by coverage and composition for long reads. Bioinformatics 36(Supplement_1), i3–i11 (2020)

    Google Scholar 

  36. Wood, D.E., Lu, J., Langmead, B.: Improved metagenomic analysis with kraken 2. Genome Biol. 20(1), 1–13 (2019)

    Article  Google Scholar 

  37. Wu, Y.W., Simmons, B.A., Singer, S.W.: Maxbin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32(4), 605–607 (2016)

    Google Scholar 

  38. Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? arXiv:1810.00826 (2018)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Lin .

Editor information

Editors and Affiliations

Appendices

A Dataset Information

Tables 5 and  7 demonstrate the simulated and real dataset information respectively. Note that Table 5 and 6 tabulate the coverages used for simulation using SimLoRD [29] while Table 7 indicate abundances from the dataset sources.

Table 5. Information of simulated datasets.
Table 6. Information of simulated dataset containing 50 species.
Table 7. Information of real datasets.

B Interpretation of AMBER Per-bin F1-Score

The binning evaluations are presented using Precision, Recall and F1 score. Furthermore, stricter evaluations are presented using AMBER [21] for MetaBCC-LR, LRBinner and OBLR. This section explains the evaluation metrics in detail and discuss as to why AMBER evaluations are poor in some cases where number of bins predicted is further away from the actual number of species in the dataset. Note that the bin assignment matrix a can be presented as \(M\times N\), illustrated in Table 8. Note that \(N=5\) and \(M=7\).

Table 8. Binning matrix

Recall is computed for each species, by taking the largest assignment to a bin. Precision is computed per bin taking the largest assignment of the bin to a given species. In contrast, AMBER uses purity and completeness to compute the per-bin F1 score using the following equations, for each bin b.

The true positives are computed using the majority species in a given bin. Because of this, if a bin appears as a result of a false bin split (1% reads), the Completeness of the bin will be very low as the majority of it (approximately 1%) according to AMBER evaluation. In comparison, the recall of the species using Eq. 5 will report 99% since 99% of the reads are in a single bin despite having the false bin split. Similarly, the false split of the bin will report a greater precision as long as the bin has no other species mixed according to Eq. 4. Consider the following running example.

Example 1. Suppose Species 1 has \(a_{11}=99\) and \(a_{16}=1\) with rest of the row having no reads and Bin 1 and 6 has no reads from another species. Purity in this case will be 100% for both bins 1 and 6 while completeness will be 99% and 1% respectively. F1-score will be 99.5% and 1.98% with average being very low at 50.7%. Recall will be 99% for Species 1 with 100% precision on both bins 1 and 6 since there are no impurities in each bin, thus, F1-score is 99.5% for each bin.

Example 2. Suppose Bin 2 has \(a_{22}=100\) and \(a_{32}=20\), with two species 2 and 3, with no other contaminants and species 2 and 3 are fully contained in the bin. Now, the purity of the bin is 83.33% and completeness is 83.33%, hence, F1-score is 83.33%. Recall for species 2 and 3 will be 100% since it is not broken into multiple bins. However, the precision for Bin 2 will be 83.33%, hence a F1 score of 90.91%.

This means, AMBER penalize whenever a species is broken into pieces across bins while not significantly penalizing bin mergers between large bins and smaller bins. This is because, dominant species in the bin will determine the purity and completeness.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wickramarachchi, A., Lin, Y. (2022). Metagenomics Binning of Long Reads Using Read-Overlap Graphs. In: Jin, L., Durand, D. (eds) Comparative Genomics. RECOMB-CG 2022. Lecture Notes in Computer Science(), vol 13234. Springer, Cham. https://doi.org/10.1007/978-3-031-06220-9_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-06220-9_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-06219-3

  • Online ISBN: 978-3-031-06220-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics