A Fast Word Embedding Based Classifier to Profile Target Gene Databases in Metagenomic Samples

Arango-Argoty, G. A.; Heath, L. S.; Pruden, A.; Vikesland, P. J.; Zhang, L.

doi:10.1007/978-3-030-79290-9_10

G. A. Arango-Argoty¹³,
L. S. Heath¹³,
A. Pruden¹⁴,
P. J. Vikesland¹⁴ &
…
L. Zhang¹³

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12686))

Included in the following conference series:

International Conference on Computational Advances in Bio and Medical Sciences

404 Accesses

Abstract

The functional profile of metagenomic samples allows the understanding of the role of the microbes in the environment. Sequence alignment of short reads against curated databases has been widely used to profile metagenomic samples. However, this method is time consuming and requires high computing resources. Although several alignment free methods based on k-mer composition have been developed in recent years, they still require a large amount of memory. In this paper, MetaMLP (Metagenomics Machine Learning Profiler), a machine learning method that represents sequences into numerical vectors (embeddings) and uses a simple one hidden layer neural network is proposed to profile functional categories. Unlike other methods, MetaMLP enables partial matching through a reduced alphabet for sequence embeddings. MetaMLP is able to identify a larger number of reads compared to Diamond (one of the fastest sequence alignment methods) while maintaining high performance with a 0.99 precision and a 0.99 recall. MetaMLP can process 100 million reads in around 10 min in a laptop computer, a 50x speed up compared to Diamond. MetaMLP is freely available at https://bitbucket.org/gaarangoa/metamlp/src/master/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

MetaVW: Large-Scale Machine Learning for Metagenomics Sequence Classification

DNASimCLR: a contrastive learning-based deep learning approach for gene sequence data classification

Article Open access 14 October 2024

A novel deep contrastive convolutional autoencoder based binning approach for taxonomic independent metagenomics data

Article 16 August 2024

References

Altschul, S.F., et al.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Article Google Scholar
Pearson, W.R.: [5] Rapid and sensitive sequence comparison with FASTP and FASTA (1990)
Google Scholar
Finn, R.D., Clements, J., Eddy, S.R.: HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39(suppl_2), W29–W37 (2011)
Google Scholar
Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)
Article Google Scholar
Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 11(5), 473–483 (2010)
Article Google Scholar
Zielezinski, A., et al.: Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18(1), 186 (2017)
Article Google Scholar
Bengtsson-Palme, J., Larsson, D.J., Kristiansson, E.: Using metagenomics to investigate human and environmental resistomes. J. Antimicrob. Chemother. 72(10), 2690–2703 (2017)
Google Scholar
Pearson, W.R.: An introduction to sequence similarity (“homology”) searching. Curr Protoc Bioinformatics, Chapter 3: p. Unit 3 1 (2013)
Google Scholar
Buchfink, B., Xie, C., Huson, D.H.: Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12(1), 59 (2015)
Article Google Scholar
Kent, W.J.: BLAT—the BLAST-like alignment tool. Genome Res. 12(4), 656–664 (2002)
Article Google Scholar
Edgar, R.: USEARCH: ultra-fast sequence analysis (2015)
Google Scholar
Ye, Y., Choi, J.-H., Tang, H.: RAPSearch: a fast protein similarity search tool for short reads. BMC Bioinform. 12(1), 159 (2011)
Article Google Scholar
Pearson, W.R.: Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics 11(3), 635–650 (1991)
Article Google Scholar
Vinga, S., Almeida, J.: Alignment-free sequence comparison—a review. Bioinformatics 19(4), 513–523 (2003)
Article Google Scholar
Weijers, S., et al.: KALLISTO: cost effective and integrated optimization of the urban wastewater system Eindhoven. Water Pract. Technol. 7(2), wpt2012036 (2012)
Google Scholar
Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32(5), 462 (2014)
Article Google Scholar
Patro, R., Duggal, G., Kingsford, C.: Accurate, fast, and model-aware transcript expression quantification with Salmon. bioRxiv, 21592 (2015)
Google Scholar
Zhang, Z., Wang, W.: RNA-Skim: a rapid method for RNA-Seq quantification at transcript level. Bioinformatics 30(12), i283–i292 (2014)
Article Google Scholar
Li, Y., et al.: ChimeRScope: a novel alignment-free algorithm for fusion transcript prediction using paired-end RNA-Seq data. Nucleic Acids Res. gkx315 (2017)
Google Scholar
Pajuste, F.-D., et al.: FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads. Sci. Rep. 7(1), 2537 (2017)
Article Google Scholar
Li, H.: Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32(14), 2103–2110 (2016)
Article Google Scholar
Berlin, K., et al.: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33(6), 623 (2015)
Article Google Scholar
Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), R46 (2014)
Article Google Scholar
Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016)
Article Google Scholar
Ounit, R., et al.: CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16(1), 236 (2015)
Article Google Scholar
Gupta, A., Jordan, I.K., Rishishwar, L.: stringMLST: a fast k-mer based tool for multilocus sequence typing. Bioinformatics 33(1), 119–121 (2016)
Article Google Scholar
Goldberg, Y., Levy, O.: word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014)
Ng, P.: dna2vec: Consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.06279 (2017)
Yang, K.K., et al.: Learned protein embeddings for machine learning. Bioinformatics 34(15), 2642–2648 (2018)
Article Google Scholar
Joulin, A., et al.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)
Zhao, Y., Tang, H., Ye, Y.: RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics 28(1), 125–126 (2011)
Article Google Scholar
Arango-Argoty, G., et al.: DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data. Microbiome 6(1), 23 (2018)
Article Google Scholar
Maaten, L.v.d., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Google Scholar
Pal, C., et al.: The structure and diversity of human, animal and environmental resistomes. Microbiome 4(1), 54 (2016)
Article Google Scholar

Download references

Acknowledgements

We thank the support from the USDA National Institute of Food and Agriculture competitive Grant 2017-68003-26498, the U.S. National Science Foundation Partnership in International Research and Education Award # 1545756, and the U.S. National Science Foundation Award # 2004751.

Author information

Authors and Affiliations

Department of Computer Science, Virginia Tech, Blacksburg, VA, USA
G. A. Arango-Argoty, L. S. Heath & L. Zhang
Department of Civil and Environmental Engineering, Virginia Tech, Blacksburg, VA, USA
A. Pruden & P. J. Vikesland

Authors

G. A. Arango-Argoty
View author publications
You can also search for this author in PubMed Google Scholar
L. S. Heath
View author publications
You can also search for this author in PubMed Google Scholar
A. Pruden
View author publications
You can also search for this author in PubMed Google Scholar
P. J. Vikesland
View author publications
You can also search for this author in PubMed Google Scholar
L. Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to L. Zhang .

Editor information

Editors and Affiliations

The University of Texas at San Antonio, San Antonio, TX, USA
Sumit Kumar Jha
University of Connecticut, Storrs, CT, USA
Ion Măndoiu
University of Connecticut, Storrs Mansfield, CT, USA
Sanguthevar Rajasekaran
Department of Computer Science, Georgia State University, Roswell, GA, USA
Pavel Skums
Department of Computer Science, Georgia State University, Atlanta, GA, USA
Alex Zelikovsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Arango-Argoty, G.A., Heath, L.S., Pruden, A., Vikesland, P.J., Zhang, L. (2021). A Fast Word Embedding Based Classifier to Profile Target Gene Databases in Metagenomic Samples. In: Jha, S.K., Măndoiu, I., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds) Computational Advances in Bio and Medical Sciences. ICCABS 2020. Lecture Notes in Computer Science(), vol 12686. Springer, Cham. https://doi.org/10.1007/978-3-030-79290-9_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-79290-9_10
Published: 03 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-79289-3
Online ISBN: 978-3-030-79290-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics