Skip to main content

A Fast Word Embedding Based Classifier to Profile Target Gene Databases in Metagenomic Samples

  • Conference paper
  • First Online:
Computational Advances in Bio and Medical Sciences (ICCABS 2020)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12686))

  • 355 Accesses

Abstract

The functional profile of metagenomic samples allows the understanding of the role of the microbes in the environment. Sequence alignment of short reads against curated databases has been widely used to profile metagenomic samples. However, this method is time consuming and requires high computing resources. Although several alignment free methods based on k-mer composition have been developed in recent years, they still require a large amount of memory. In this paper, MetaMLP (Metagenomics Machine Learning Profiler), a machine learning method that represents sequences into numerical vectors (embeddings) and uses a simple one hidden layer neural network is proposed to profile functional categories. Unlike other methods, MetaMLP enables partial matching through a reduced alphabet for sequence embeddings. MetaMLP is able to identify a larger number of reads compared to Diamond (one of the fastest sequence alignment methods) while maintaining high performance with a 0.99 precision and a 0.99 recall. MetaMLP can process 100 million reads in around 10 min in a laptop computer, a 50x speed up compared to Diamond. MetaMLP is freely available at https://bitbucket.org/gaarangoa/metamlp/src/master/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Altschul, S.F., et al.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)

    Article  Google Scholar 

  2. Pearson, W.R.: [5] Rapid and sensitive sequence comparison with FASTP and FASTA (1990)

    Google Scholar 

  3. Finn, R.D., Clements, J., Eddy, S.R.: HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39(suppl_2), W29–W37 (2011)

    Google Scholar 

  4. Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)

    Article  Google Scholar 

  5. Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 11(5), 473–483 (2010)

    Article  Google Scholar 

  6. Zielezinski, A., et al.: Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18(1), 186 (2017)

    Article  Google Scholar 

  7. Bengtsson-Palme, J., Larsson, D.J., Kristiansson, E.: Using metagenomics to investigate human and environmental resistomes. J. Antimicrob. Chemother. 72(10), 2690–2703 (2017)

    Google Scholar 

  8. Pearson, W.R.: An introduction to sequence similarity (“homology”) searching. Curr Protoc Bioinformatics, Chapter 3: p. Unit 3 1 (2013)

    Google Scholar 

  9. Buchfink, B., Xie, C., Huson, D.H.: Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12(1), 59 (2015)

    Article  Google Scholar 

  10. Kent, W.J.: BLAT—the BLAST-like alignment tool. Genome Res. 12(4), 656–664 (2002)

    Article  Google Scholar 

  11. Edgar, R.: USEARCH: ultra-fast sequence analysis (2015)

    Google Scholar 

  12. Ye, Y., Choi, J.-H., Tang, H.: RAPSearch: a fast protein similarity search tool for short reads. BMC Bioinform. 12(1), 159 (2011)

    Article  Google Scholar 

  13. Pearson, W.R.: Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics 11(3), 635–650 (1991)

    Article  Google Scholar 

  14. Vinga, S., Almeida, J.: Alignment-free sequence comparison—a review. Bioinformatics 19(4), 513–523 (2003)

    Article  Google Scholar 

  15. Weijers, S., et al.: KALLISTO: cost effective and integrated optimization of the urban wastewater system Eindhoven. Water Pract. Technol. 7(2), wpt2012036 (2012)

    Google Scholar 

  16. Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32(5), 462 (2014)

    Article  Google Scholar 

  17. Patro, R., Duggal, G., Kingsford, C.: Accurate, fast, and model-aware transcript expression quantification with Salmon. bioRxiv, 21592 (2015)

    Google Scholar 

  18. Zhang, Z., Wang, W.: RNA-Skim: a rapid method for RNA-Seq quantification at transcript level. Bioinformatics 30(12), i283–i292 (2014)

    Article  Google Scholar 

  19. Li, Y., et al.: ChimeRScope: a novel alignment-free algorithm for fusion transcript prediction using paired-end RNA-Seq data. Nucleic Acids Res. gkx315 (2017)

    Google Scholar 

  20. Pajuste, F.-D., et al.: FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads. Sci. Rep. 7(1), 2537 (2017)

    Article  Google Scholar 

  21. Li, H.: Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32(14), 2103–2110 (2016)

    Article  Google Scholar 

  22. Berlin, K., et al.: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33(6), 623 (2015)

    Article  Google Scholar 

  23. Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), R46 (2014)

    Article  Google Scholar 

  24. Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016)

    Article  Google Scholar 

  25. Ounit, R., et al.: CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16(1), 236 (2015)

    Article  Google Scholar 

  26. Gupta, A., Jordan, I.K., Rishishwar, L.: stringMLST: a fast k-mer based tool for multilocus sequence typing. Bioinformatics 33(1), 119–121 (2016)

    Article  Google Scholar 

  27. Goldberg, Y., Levy, O.: word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014)

  28. Ng, P.: dna2vec: Consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.06279 (2017)

  29. Yang, K.K., et al.: Learned protein embeddings for machine learning. Bioinformatics 34(15), 2642–2648 (2018)

    Article  Google Scholar 

  30. Joulin, A., et al.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)

  31. Zhao, Y., Tang, H., Ye, Y.: RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics 28(1), 125–126 (2011)

    Article  Google Scholar 

  32. Arango-Argoty, G., et al.: DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data. Microbiome 6(1), 23 (2018)

    Article  Google Scholar 

  33. Maaten, L.v.d., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    Google Scholar 

  34. Pal, C., et al.: The structure and diversity of human, animal and environmental resistomes. Microbiome 4(1), 54 (2016)

    Article  Google Scholar 

Download references

Acknowledgements

We thank the support from the USDA National Institute of Food and Agriculture competitive Grant 2017-68003-26498, the U.S. National Science Foundation Partnership in International Research and Education Award # 1545756, and the U.S. National Science Foundation Award # 2004751.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to L. Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Arango-Argoty, G.A., Heath, L.S., Pruden, A., Vikesland, P.J., Zhang, L. (2021). A Fast Word Embedding Based Classifier to Profile Target Gene Databases in Metagenomic Samples. In: Jha, S.K., Măndoiu, I., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds) Computational Advances in Bio and Medical Sciences. ICCABS 2020. Lecture Notes in Computer Science(), vol 12686. Springer, Cham. https://doi.org/10.1007/978-3-030-79290-9_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-79290-9_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-79289-3

  • Online ISBN: 978-3-030-79290-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics