Metagenomics Binning of Long Reads Using Read-Overlap Graphs

Wickramarachchi, Anuradha; Lin, Yu

doi:10.1007/978-3-031-06220-9_15

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13234))

Included in the following conference series:

RECOMB International Workshop on Comparative Genomics

825 Accesses
2 Citations
3 Altmetric

Abstract

Metagenomics sequencing enables the direct study of microbial communities revealing important information such as taxonomy and relative abundance of species. Metagenomics binning facilitates the separation of these genetic materials into different taxonomic groups. Moving from second-generation sequencing to third-generation sequencing techniques enables the binning of reads before assembly thanks to the increased read lengths. The limited number of long-read binning tools that exist, still suffer from unreliable coverage estimation for individual long reads and face challenges in recovering low-abundance species. In this paper, we present a novel binning approach to bin long reads using the read-overlap graph. The read-overlap graph (1) enables a fast and reliable estimation of the coverage of individual long reads; (2) allows to incorporate the overlapping information between reads into the binning process; (3) facilitates a more uniform sampling of long reads across species of varying abundances. Experimental results show that our new binning approach produces better binning results of long reads and results in better assemblies especially for recovering low abundant species. The source code and a functional Google Colab Notebook are available at https://www.github.com/anuradhawick/oblr.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Baaijens, J.A., El Aabidine, A.Z., Rivals, E., Schönhuth, A.: De novo assembly of viral quasispecies using overlap graphs. Genome Res. 27(5), 835–848 (2017)
Article Google Scholar
Balvert, M., Luo, X., Hauptfeld, E., Schönhuth, A., Dutilh, B.E.: Ogre: overlap graph-based metagenomic read clustering. Bioinformatics 37(7), 905–912 (2021)
Article Google Scholar
Chen, K., Pachter, L.: Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLOS Comput. Biol. 1(2) (2005)
Google Scholar
Feng, X., Cheng, H., Portik, D., Li, H.: Metagenome assembly of high-fidelity long reads with hifiasm-meta. arXiv:2110.08457 (2021)
Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch Geometric. In: ICLR Workshop on Representation Learning on Graphs and Manifolds (2019)
Google Scholar
Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large graphs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1025–1035 (2017)
Google Scholar
Huson, D.H., et al.: Megan-LR: new algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs. Biol. Direct 13(1), 1–17 (2018)
Google Scholar
Huson, D.H., Richter, D.C., Mitra, S., Auch, A.F., Schuster, S.C.: Methods for comparative metagenomics. BMC Bioinf. 10(1), 1–10 (2009)
Google Scholar
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Analysis 6(5), 429–449 (2002)
Article Google Scholar
Kang, D.D., et a.: Metabat 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019)
Google Scholar
Kim, D., Song, L., Breitwieser, F.P., Salzberg, S.L.: Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26(12), 1721–1729 (2016)
Article Google Scholar
Kolmogorov, M., et al.: metaflye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods 17(11), 1103–1110 (2020)
Google Scholar
Li, H.: Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32(14), 2103–2110 (2016)
Article Google Scholar
Li, H.: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100 (2018)
Article Google Scholar
Liang, D.M., Li, Y.F.: Lightweight label propagation for large-scale network data. In: IJCAI, pp. 3421–3427 (2018)
Google Scholar
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) 39(2), 539–550 (2009). https://doi.org/10.1109/TSMCB.2008.2007853
Logsdon, G.A., Vollger, M.R., Eichler, E.E.: Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21(10), 597–614 (2020)
Article Google Scholar
McInnes, L., Healy, J., Astels, S.: HDBSCAN: hierarchical density based clustering. J. Open Source Softw. 2(11), 205, e7359 (2017)
Google Scholar
McInnes, L., Healy, J., Melville, J.: Umap: Uniform manifold approximation and projection for dimension reduction (2020)
Google Scholar
Menzel, P., Ng, K.L., Krogh, A.: Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7, 11257 (2016)
Google Scholar
Meyer, F., et al.: Amber: assessment of metagenome binners. Gigascience 7(6), giy069 (2018)
Google Scholar
Mikheenko, A., Saveliev, V., Gurevich, A.: Metaquast: evaluation of metagenome assemblies. Bioinformatics 32(7), 1088–1090 (2016)
Article Google Scholar
Nayfach, S., Pollard, K.S.: Toward accurate and quantitative comparative metagenomics. Cell 166(5), 1103–1116 (2016)
Article Google Scholar
Nicholls, S.M., Quick, J.C., Tang, S., Loman, N.J.: Ultra-deep, long-read nanopore sequencing of mock microbial community standards. Gigascience 8(5), giz043 (2019)
Google Scholar
Nissen, J.N., et al.: Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 39(5), 555–560 (2021)
Google Scholar
Nolet, C.J., et al.: Bringing UMAP closer to the speed of light with GPU acceleration (2020)
Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65, e7359 (1987)
Google Scholar
Ruan, J., Li, H.: Fast and accurate long-read assembly with WTDBG2. Nat. Methods 17(2), 155–158, e7359 (2020)
Google Scholar
Stöcker, B.K., Köster, J., Rahmann, S.: Simlord: simulation of long read data. Bioinformatics 32(17), 2704–2706 (2016)
Article Google Scholar
Strous, M., Kraft, B., Bisdorf, R., Tegetmeyer, H.: The binning of metagenomic contigs for microbial physiology of mixed cultures. Front. Microbiol. 3, 410 (2012)
Article Google Scholar
Team, R.D.: RAPIDS: Collection of Libraries for End to End GPU Data Science (2018). https://rapids.ai
Tyson, G.W., et al.: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428(6978), 37–43 (2004)
Google Scholar
Wickramarachchi, A.: anuradhawick/seq2vec: release v1.0 (2021). https://doi.org/10.5281/zenodo.5515743, https://doi.org/10.5281/zenodo.5515743
Wickramarachchi, A., Lin, Y.: Lrbinner: binning long reads in metagenomics datasets. In: 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2021)
Google Scholar
Wickramarachchi, A., Mallawaarachchi, V., Rajan, V., Lin, Y.: Metabcc-LR: meta genomics binning by coverage and composition for long reads. Bioinformatics 36(Supplement_1), i3–i11 (2020)
Google Scholar
Wood, D.E., Lu, J., Langmead, B.: Improved metagenomic analysis with kraken 2. Genome Biol. 20(1), 1–13 (2019)
Article Google Scholar
Wu, Y.W., Simmons, B.A., Singer, S.W.: Maxbin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32(4), 605–607 (2016)
Google Scholar
Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? arXiv:1810.00826 (2018)

Download references

Author information

Authors and Affiliations

School of Computing, Australian National University, Canberra, Australia
Anuradha Wickramarachchi & Yu Lin

Authors

Anuradha Wickramarachchi
View author publications
You can also search for this author in PubMed Google Scholar
Yu Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu Lin .

Editor information

Editors and Affiliations

University of Saskatchewan, Saskatoon, SK, Canada
Lingling Jin
Carnegie Mellon University, Pittsburgh, PA, USA
Dannie Durand

Appendices

A Dataset Information

Tables 5 and 7 demonstrate the simulated and real dataset information respectively. Note that Table 5 and 6 tabulate the coverages used for simulation using SimLoRD [29] while Table 7 indicate abundances from the dataset sources.

Table 5. Information of simulated datasets.

Full size table

Table 6. Information of simulated dataset containing 50 species.

Full size table

Table 7. Information of real datasets.

Full size table

B Interpretation of AMBER Per-bin F1-Score

The binning evaluations are presented using Precision, Recall and F1 score. Furthermore, stricter evaluations are presented using AMBER [21] for MetaBCC-LR, LRBinner and OBLR. This section explains the evaluation metrics in detail and discuss as to why AMBER evaluations are poor in some cases where number of bins predicted is further away from the actual number of species in the dataset. Note that the bin assignment matrix a can be presented as \(M\times N\), illustrated in Table 8. Note that \(N=5\) and \(M=7\).

Table 8. Binning matrix

Full size table

Recall is computed for each species, by taking the largest assignment to a bin. Precision is computed per bin taking the largest assignment of the bin to a given species. In contrast, AMBER uses purity and completeness to compute the per-bin F1 score using the following equations, for each bin b.

The true positives are computed using the majority species in a given bin. Because of this, if a bin appears as a result of a false bin split (1% reads), the Completeness of the bin will be very low as the majority of it (approximately 1%) according to AMBER evaluation. In comparison, the recall of the species using Eq. 5 will report 99% since 99% of the reads are in a single bin despite having the false bin split. Similarly, the false split of the bin will report a greater precision as long as the bin has no other species mixed according to Eq. 4. Consider the following running example.

Example 1. Suppose Species 1 has \(a_{11}=99\) and \(a_{16}=1\) with rest of the row having no reads and Bin 1 and 6 has no reads from another species. Purity in this case will be 100% for both bins 1 and 6 while completeness will be 99% and 1% respectively. F1-score will be 99.5% and 1.98% with average being very low at 50.7%. Recall will be 99% for Species 1 with 100% precision on both bins 1 and 6 since there are no impurities in each bin, thus, F1-score is 99.5% for each bin.

Example 2. Suppose Bin 2 has \(a_{22}=100\) and \(a_{32}=20\), with two species 2 and 3, with no other contaminants and species 2 and 3 are fully contained in the bin. Now, the purity of the bin is 83.33% and completeness is 83.33%, hence, F1-score is 83.33%. Recall for species 2 and 3 will be 100% since it is not broken into multiple bins. However, the precision for Bin 2 will be 83.33%, hence a F1 score of 90.91%.

This means, AMBER penalize whenever a species is broken into pieces across bins while not significantly penalizing bin mergers between large bins and smaller bins. This is because, dominant species in the bin will determine the purity and completeness.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wickramarachchi, A., Lin, Y. (2022). Metagenomics Binning of Long Reads Using Read-Overlap Graphs. In: Jin, L., Durand, D. (eds) Comparative Genomics. RECOMB-CG 2022. Lecture Notes in Computer Science(), vol 13234. Springer, Cham. https://doi.org/10.1007/978-3-031-06220-9_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-06220-9_15
Published: 15 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06219-3
Online ISBN: 978-3-031-06220-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Metagenomics Binning of Long Reads Using Read-Overlap Graphs

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Dataset Information

B Interpretation of AMBER Per-bin F1-Score

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation