Skip to main content

CONSULT-II: Taxonomic Identification Using Locality Sensitive Hashing

  • Conference paper
  • First Online:
Comparative Genomics (RECOMB-CG 2023)

Abstract

Metagenomics is widely used to study the microbiome using environmental samples, and taxonomic classification of reads is a precursor to many analyses of such data. Taxonomic classification requires comparing sample reads against a reference dataset of known organisms. Crucially, the genomes represented in a sample may be phylogenetically distant from their closest match in the reference set. Thus, simply mapping reads to genomes is insufficient; we need to find inexact matches to species with substantial distance. While k-mer-based methods, such as Kraken, have proved popular, they have limited ability to match against distant taxa. In this paper, we use locality sensitive hashing to design a k-mer-based method that can match reads to genomes with higher distance than existing methods. We build on an earlier contamination detection method, CONSULT, to add taxonomic classification abilities. We show in a series of experiments that our method, CONSULT-II, has higher recall than alternatives when precision is about the same. Its results can also be summarized to obtain a taxonomic profile, which we show outperforms leading methods with respect to some measurement criteria. CONSULT-II is available at https://github.com/bo1929/CONSULT-II.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ames, S.K., Hysom, D.A., Gardner, S.N., Lloyd, G.S., Gokhale, M.B., Allen, J.E.: Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics 29(18), 2253–2260 (2013). ISSN 1367-4811 (Electronic). https://doi.org/10.1093/bioinformatics/btt389

  2. Asnicar, F., et al.: Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0. Nat. Commun. 11(1), 2500 (2020). ISSN 2041–1723. https://doi.org/10.1038/s41467-020-16366-7

  3. Balaban, M., Sarmashghi, S., Mirarab, S.: APPLES: scalable distance-based phylogenetic placement with or without alignments. Syst. Biol. 69(3), 566–578 (2020). ISSN 1063-5157. https://doi.org/10.1093/sysbio/syz063

  4. Berlin, K., Koren, S., Chin, C.S., Drake, J.P., Landolin, J.M., Phillippy, A.M.: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33(6), 623–630 (2015), ISSN 1546–1696 (Electronic). https://doi.org/10.1038/nbt.3238

  5. Bharti, R., Grimm, D.G.: Current challenges and best-practice protocols for microbiome analysis. Briefings Bioinf. 22(1), 178–193 (2021). ISSN 1477-4054. https://doi.org/10.1093/bib/bbz155

  6. Blanke, M., Morgenstern, B.: Phylogenetic placement of short reads without sequence alignment. bioRxiv, October 2020

    Google Scholar 

  7. Brenner, D.J., Staley, J.T., Krieg, N.R.: Classification of procaryotic organisms and the concept of bacterial speciation. In: Bergey’s Manual of Systematics of Archaea and Bacteria, pp. 1–9. Wiley, Chichester, UK, September 2015. https://doi.org/10.1002/9781118960608.bm00006

  8. Brown, D., Truszkowski, J.: LSHPlace: fast phylogenetic placement using locality-sensitive hashing. In: Pacific Symposium on Biocomputing, pp. 310–319, November 2013. ISBN 978-981-4596-36-7. ISSN 2335-6936

    Google Scholar 

  9. Buhler, J.: Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17(5), 419–428 (2001). ISSN 1367-4803. https://doi.org/10.1093/bioinformatics/17.5.419

  10. Choi, J., et al.: Strategies to improve reference databases for soil microbiomes. ISME J. 11(4), 829–834 (2017). ISSN 1751-7362. https://doi.org/10.1038/ismej.2016.168

  11. Dress, A.W., et al.: Noisy: identification of problematic columns in multiple sequence alignments. Algorithms Mol. Biol. 3(1), 7 (2008). ISSN 1748-7188. https://doi.org/10.1186/1748-7188-3-7

  12. Gill, S.R., et al.: Metagenomic analysis of the human distal gut microbiome. Science 312(5778), 1355–9 (2006). ISSN 1095-9203. https://doi.org/10.1126/science.1124234

  13. Handelsman, J.: Metagenomics: application of genomics to uncultured microorganisms. Microbiol. Mol. Biol. Rev. 68(4), 669–85 (2004). ISSN 1092-2172. https://doi.org/10.1128/MMBR.68.4.669-685.2004

  14. Huang, W., Li, L., Myers, J.R., Marth, G.T.: ART: a next-generation sequencing read simulator. Bioinformatics 28(4), 593–594 (2012). ISSN 1367-4803. https://doi.org/10.1093/bioinformatics/btr708

  15. Lau, A.K., Dörrer, S., Leimeister, C.A., Bleidorn, C., Morgenstern, B.: Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage. BMC Bioinf. 20(S20), 638 (2019). ISSN 1471-2105. https://doi.org/10.1186/s12859-019-3205-7

  16. Li, H.: Seqtk, toolkit for processing sequences in FASTA/q formats (2018). https://github.com/lh3/seqtk

  17. Liang, Q., Bible, P.W., Liu, Y., Zou, B., Wei, L.: DeepMicrobes: taxonomic classification for metagenomics with deep learning. NAR Genomics Bioinf. 2(1) (2020). ISSN 2631-9268. https://doi.org/10.1093/nargab/lqaa009

  18. Liu, B., Gibbons, T., Ghodsi, M., Pop, M.: MetaPhyler: taxonomic profiling for metagenomic sequences. In: 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 95–100. IEEE (2011). ISBN 978-1-4244-8305-1

    Google Scholar 

  19. Locey, K.J., Lennon, J.T.: Scaling laws predict global microbial diversity. Proc. Nat. Acad. Sci. 113(21), 5970–5975 (2016). ISSN 0027-8424. https://doi.org/10.1073/pnas.1521291113

  20. Lu, J., Breitwieser, F.P., Thielen, P., Salzberg, S.L.: Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 3, e104 (2017). ISSN 2376-5992. https://doi.org/10.7717/peerj-cs.104

  21. Luo, Y., Yu, Y.W., Zeng, J., Berger, B., Peng, J.: Metagenomic binning through low-density hashing. Bioinformatics 35(2), 219–226 (2019). ISSN 1367-4803. https://doi.org/10.1093/bioinformatics/bty611

  22. Matsen, F.A.: Phylogenetics and the human microbiome. Syst. Biol. 64(1), e26–e41 (2015). ISSN 1076-836X. arXiv:1407.1794. https://doi.org/10.1093/sysbio/syu053

  23. McIntyre, A.B.R., et al.: Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biol. 18(1), 182 (2017). ISSN 1474-760X. https://doi.org/10.1186/s13059-017-1299-7

  24. von Meijenfeldt, F.A.B., Arkhipova, K., Cambuy, D.D., Coutinho, F.H., Dutilh, B.E.: Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT. Genome Biol. 20(1), 217 (2019). ISSN 1474-760X. https://doi.org/10.1186/s13059-019-1817-x

  25. Metsky, H.C., et al.: Capturing sequence diversity in metagenomes with comprehensive and scalable probe design. Nat. Biotechnol. 37(2), 160–168 (2019). ISSN 1087-0156. https://doi.org/10.1038/s41587-018-0006-x

  26. Meyer, F., Bremges, A., Belmann, P., Janssen, S., McHardy, A.C., Koslicki, D.: Assessing taxonomic metagenome profilers with OPAL. Genome Biol. (2019). ISSN 1474-760X. https://doi.org/10.1186/s13059-019-1646-y

  27. Meyer, F., Bremges, A., Belmann, P., Janssen, S., McHardy, A.C., Koslicki, D.: Assessing taxonomic metagenome profilers with OPAL. Genome Biol. 20(1), 51 (2019). ISSN 1474-760X. https://doi.org/10.1186/s13059-019-1646-y

  28. Milanese, A., et al.: Microbial abundance, activity and population genomic profiling with mOTUs2. Nat. Commun. 10(1), 1014 (2019). ISSN 2041-1723. https://doi.org/10.1038/s41467-019-08844-4

  29. Nasko, D.J., Koren, S., Phillippy, A.M., Treangen, T.J.: RefSeq database growth influences the accuracy of \(k\)-mer-based lowest common ancestor species identification. Genome Biol. 19(1), 165 (2018). ISSN 1474-760X. https://doi.org/10.1186/s13059-018-1554-6

  30. National Research Council (US). Committee on Metagenomics, Functional Applications, National Academies Press (US): The New Science of Metagenomics. National Academies Press, Washington, D.C., May 2007. ISBN 978-0-309-10676-4. https://doi.org/10.17226/11902

  31. Nguyen, N., Mirarab, S., Liu, B., Pop, M., Warnow, T.: TIPP: taxonomic identification and phylogenetic profiling. Bioinformatics 30(24), 3548–3555 (2014), ISSN 1460-2059. https://doi.org/10.1093/bioinformatics/btu721

  32. Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016). ISSN 1474-760X. https://doi.org/10.1186/s13059-016-0997-x

  33. Ounit, R., Wanamaker, S., Close, T.J., Lonardi, S.: CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative \(k\)-mers. BMC Genomics 16(1), 236 (2015). ISSN 1471-2164. https://doi.org/10.1186/s12864-015-1419-2

  34. Pachiadaki, M.G., et al.: Charting the complexity of the marine microbiome through single-cell genomics. Cell 179(7), 1623–1635.e11 (2019). ISSN 0092-8674. https://doi.org/10.1016/j.cell.2019.11.017

  35. Rachtman, E., Bafna, V., Mirarab, S.: CONSULT: accurate contamination removal using locality-sensitive hashing. NAR Genomics Bioinf. 3(3) (2011). ISSN 2631-9268. https://doi.org/10.1093/nargab/lqab071

  36. Rachtman, E., Balaban, M., Bafna, V., Mirarab, S.: The impact of contaminants on the accuracy of genome skimming and the effectiveness of exclusion read filters. Mol. Ecol. Resour. 20(3), 649–661 (2020). ISSN 1755-098X. https://doi.org/10.1111/1755-0998.13135

  37. Rachtman, E., Balaban, M., Bafna, V., Mirarab, S.: The impact of contaminants on the accuracy of genome skimming and the effectiveness of exclusion read filters. Mol. Ecol. Resour. (2020). ISSN 1755-0998 (Electronic). https://doi.org/10.1111/1755-0998.13135

  38. Rasheed, Z., Rangwala, H., Barbará, D.: 16S rRNA metagenome clustering and diversity estimation using locality sensitive hashing. BMC Syst. Biol. 7(Suppl. 4), S11 (2013). ISSN 1752–0509. https://doi.org/10.1186/1752-0509-7-S4-S11

  39. Sczyrba, A., et al.: Critical assessment of metagenome interpretation-a benchmark of metagenomics software. Nat. Meth. 14(11), 1063–1071 (2017). ISSN 1548-7105. https://doi.org/10.1038/nmeth.4458

  40. Sczyrba, A., et al.: Critical assessment of metagenome interpretation-a benchmark of metagenomics software. Nat. Meth. 14(11), 1063–1071 (2017). ISSN 1548-7091. https://doi.org/10.1038/nmeth.4458

  41. Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O., Huttenhower, C.: Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Meth. 9(8), 811–814 (2012). ISSN 1548-7091. https://doi.org/10.1038/nmeth.2066

  42. Shah, N., Molloy, E.K., Pop, M., Warnow, T.: TIPP2: metagenomic taxonomic profiling using phylogenetic markers. Bioinformatics 37(13), 1839–1845 (2021). ISSN 1367-4803. https://doi.org/10.1093/bioinformatics/btab023

  43. Stark, M., Berger, S.A., Stamatakis, A., von Mering, C.: MLTreeMap-accurate Maximum Likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies. BMC Genomics 11(1), 461 (2010). ISSN 1471-2164. https://doi.org/10.1186/1471-2164-11-461

  44. Sunagawa, S., et al.: Metagenomic species profiling using universal phylogenetic marker genes. Nat. Meth. 10(12), 1196–1199 (2013). ISSN 1548-7091. https://doi.org/10.1038/nmeth.2693

  45. Truong, D.T., et al.: MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Meth. 12(10), 902–903 (2015). ISSN 1548-7091. https://doi.org/10.1038/nmeth.3589

  46. Wood, D.E., Lu, J., Langmead, B.: Improved metagenomic analysis with Kraken 2. Genome Biol. 20(1), 257 (2019). ISSN 1474-760X. https://doi.org/10.1186/s13059-019-1891-0

  47. Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3) (2014). ISSN 1474-760X. https://doi.org/10.1186/gb-2014-15-3-r46

  48. Ye, S.H., Siddle, K.J., Park, D.J., Sabeti, P.C.: Benchmarking metagenomics tools for taxonomic classification. Cell 178(4), 779–794 (2019). ISSN 1097-4172 (Electronic). https://doi.org/10.1016/j.cell.2019.07.010

  49. Zhu, Q., et al.: Phylogeny-aware analysis of metagenome community ecology based on matched reference genomes while bypassing taxonomy. mSystems 7(2), e0016722 (2022). ISSN 2379-5077. https://doi.org/10.1128/msystems.00167-22

  50. Zhu, Q., et al.: Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea. Nat. Commun. 10(1), 5477 (2019). ISSN 2041-1723. https://doi.org/10.1038/s41467-019-13443-4

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Siavash Mirarab .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Şapcı, A.O.B., Rachtman, E., Mirarab, S. (2023). CONSULT-II: Taxonomic Identification Using Locality Sensitive Hashing. In: Jahn, K., Vinař, T. (eds) Comparative Genomics. RECOMB-CG 2023. Lecture Notes in Computer Science(), vol 13883. Springer, Cham. https://doi.org/10.1007/978-3-031-36911-7_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-36911-7_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-36910-0

  • Online ISBN: 978-3-031-36911-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics