Abstract
The functional profile of metagenomic samples allows the understanding of the role of the microbes in the environment. Sequence alignment of short reads against curated databases has been widely used to profile metagenomic samples. However, this method is time consuming and requires high computing resources. Although several alignment free methods based on k-mer composition have been developed in recent years, they still require a large amount of memory. In this paper, MetaMLP (Metagenomics Machine Learning Profiler), a machine learning method that represents sequences into numerical vectors (embeddings) and uses a simple one hidden layer neural network is proposed to profile functional categories. Unlike other methods, MetaMLP enables partial matching through a reduced alphabet for sequence embeddings. MetaMLP is able to identify a larger number of reads compared to Diamond (one of the fastest sequence alignment methods) while maintaining high performance with a 0.99 precision and a 0.99 recall. MetaMLP can process 100 million reads in around 10 min in a laptop computer, a 50x speed up compared to Diamond. MetaMLP is freely available at https://bitbucket.org/gaarangoa/metamlp/src/master/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Altschul, S.F., et al.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Pearson, W.R.: [5] Rapid and sensitive sequence comparison with FASTP and FASTA (1990)
Finn, R.D., Clements, J., Eddy, S.R.: HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39(suppl_2), W29–W37 (2011)
Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)
Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 11(5), 473–483 (2010)
Zielezinski, A., et al.: Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18(1), 186 (2017)
Bengtsson-Palme, J., Larsson, D.J., Kristiansson, E.: Using metagenomics to investigate human and environmental resistomes. J. Antimicrob. Chemother. 72(10), 2690–2703 (2017)
Pearson, W.R.: An introduction to sequence similarity (“homology”) searching. Curr Protoc Bioinformatics, Chapter 3: p. Unit 3 1 (2013)
Buchfink, B., Xie, C., Huson, D.H.: Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12(1), 59 (2015)
Kent, W.J.: BLAT—the BLAST-like alignment tool. Genome Res. 12(4), 656–664 (2002)
Edgar, R.: USEARCH: ultra-fast sequence analysis (2015)
Ye, Y., Choi, J.-H., Tang, H.: RAPSearch: a fast protein similarity search tool for short reads. BMC Bioinform. 12(1), 159 (2011)
Pearson, W.R.: Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics 11(3), 635–650 (1991)
Vinga, S., Almeida, J.: Alignment-free sequence comparison—a review. Bioinformatics 19(4), 513–523 (2003)
Weijers, S., et al.: KALLISTO: cost effective and integrated optimization of the urban wastewater system Eindhoven. Water Pract. Technol. 7(2), wpt2012036 (2012)
Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32(5), 462 (2014)
Patro, R., Duggal, G., Kingsford, C.: Accurate, fast, and model-aware transcript expression quantification with Salmon. bioRxiv, 21592 (2015)
Zhang, Z., Wang, W.: RNA-Skim: a rapid method for RNA-Seq quantification at transcript level. Bioinformatics 30(12), i283–i292 (2014)
Li, Y., et al.: ChimeRScope: a novel alignment-free algorithm for fusion transcript prediction using paired-end RNA-Seq data. Nucleic Acids Res. gkx315 (2017)
Pajuste, F.-D., et al.: FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads. Sci. Rep. 7(1), 2537 (2017)
Li, H.: Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32(14), 2103–2110 (2016)
Berlin, K., et al.: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33(6), 623 (2015)
Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), R46 (2014)
Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016)
Ounit, R., et al.: CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16(1), 236 (2015)
Gupta, A., Jordan, I.K., Rishishwar, L.: stringMLST: a fast k-mer based tool for multilocus sequence typing. Bioinformatics 33(1), 119–121 (2016)
Goldberg, Y., Levy, O.: word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014)
Ng, P.: dna2vec: Consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.06279 (2017)
Yang, K.K., et al.: Learned protein embeddings for machine learning. Bioinformatics 34(15), 2642–2648 (2018)
Joulin, A., et al.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)
Zhao, Y., Tang, H., Ye, Y.: RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics 28(1), 125–126 (2011)
Arango-Argoty, G., et al.: DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data. Microbiome 6(1), 23 (2018)
Maaten, L.v.d., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Pal, C., et al.: The structure and diversity of human, animal and environmental resistomes. Microbiome 4(1), 54 (2016)
Acknowledgements
We thank the support from the USDA National Institute of Food and Agriculture competitive Grant 2017-68003-26498, the U.S. National Science Foundation Partnership in International Research and Education Award # 1545756, and the U.S. National Science Foundation Award # 2004751.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Arango-Argoty, G.A., Heath, L.S., Pruden, A., Vikesland, P.J., Zhang, L. (2021). A Fast Word Embedding Based Classifier to Profile Target Gene Databases in Metagenomic Samples. In: Jha, S.K., Măndoiu, I., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds) Computational Advances in Bio and Medical Sciences. ICCABS 2020. Lecture Notes in Computer Science(), vol 12686. Springer, Cham. https://doi.org/10.1007/978-3-030-79290-9_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-79290-9_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-79289-3
Online ISBN: 978-3-030-79290-9
eBook Packages: Computer ScienceComputer Science (R0)