Abstract
Biological sequence analysis involves the study of structural characteristics and chemical composition of a sequence. From a computational perspective, the goal is to represent sequences using vectors which bring out the essential features of the virus and enable efficient classification. Methods such as one-hot encoding, Word2Vec models, etc. have been explored for embedding sequences into the Euclidean plane. But these methods either fail to capture similarity information between k-mers or face the challenge of handling Out-of-Vocabulary (OOV) k-mers. In order to overcome these challenges, in this paper we aim explore the possibility of embedding Biosequences of MERS, SARS and SARS-CoV-2 using Global Vectors (GloVe) model and FastText n-gram representation. We conduct an extensive study to evaluate their performance using classical Machine Learning algorithms and Deep Learning methods. We compare our results with dna2vec, which is an existing Word2Vec approach. Experimental results show that FastText n-gram based sequence embeddings enable deeper insights into understanding the composition of each virus and thus give a classification accuracy close to 1. We also provide a study regarding the patterns in the viruses and support our results using various visualization techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Koyama, T., Platt, D., Parida, L.: Variants of the SARS-CoV-2 genomes. Bull. World Health Organ. 98, 495–504 (2020)
Malik, Y.A.: Properties of coronavirus and SARS-CoV-2. Malays. J. Pathol. 42(1), 3–11 (2020). PMID: 32342926
Lan, T.C.T., et al.: Structure of the full SARS-CoV-2 RNA genome in infected cells
Junior, J.A.C.N., Santos, A.M., Quintans-Júnior, L.J., Walker, C.I.B., Borges, L.P., Serafini, M.R.: SARS, MERS and SARS-CoV-2 (COVID-19) treatment: a patent review. Expert Opin. Ther. Pat. 30(8), 567–579 (2020)
Li, Q., et al.: The impact of mutations in SARS-CoV-2 spike on viral infectivity and antigenicity. Cell 182(5), 1284–1294 (2020)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
NCBI Virus. https://www.ncbi.nlm.nih.gov/labs/virus/vssi
Ng, P.: dna2vec: consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.06279 (2017)
Lopez Rincon, A., et al.: Accurate identification of SARS-CoV-2 from viral genome sequences using deep learning. bioRxiv (2020)
Zhang, J., Chen, Q., Liu, B.: DeepDRBP-2L: a new genome annotation predictor for identifying DNA binding proteins and RNA binding proteins using convolutional neural network and long short-term memory. In: IEEE/ACM Transactions on Computational Biology and Bioinformatics
Jha, P.K., Vijay, A., Halu, A., Uchida, S., Aikawa, M.: Gene expression profiling reveals the shared and distinct transcriptional signatures in human lung epithelial cells infected with SARS-CoV-2, MERS-CoV, or SARS-CoV: potential implications in cardiovascular complications of COVID-19. Front Cardiovasc Med. 7, 623012 (2021). Accessed 15 Jan 2021
Wang, L., Zhou, J., Wang, Q., Wang, Y., Kang, C.: Rapid design and development of CRISPR-Cas13a targeting SARS-CoV-2 spike protein. Theranostics. 11(2), 649–664 (2021). Accessed 1 Jan 2021
Heo, L., Feig, M.: Modeling of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) proteins by machine learning and physics-based refinement (2020)
Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations (ICLR 2013), pp. 1–12 (2013)
Kwan, H.K., Arniker, S.B.: Numerical representation of DNA sequences, pp. 307–310 (2009). https://doi.org/10.1109/EIT.2009.5189632
Lopez-Rincon, A., et al.: Classification and specific primer design for accurate detection of SARS-CoV-2 using deep learning. Sci. Rep. 11(1), 1–11 (2021)
Ballesio, F., et al.: Determining a novel feature-space for SARS-CoV-2 sequence data (2020)
Asgari, E., Mofrad, M.R.: Continuous distributed representation of biological sequences for deep proteomics and genomics, PLoS One 10, e0141287 (2015)
Kimothi, D., et al.: Distributed representations for biological sequence analysis. ArXiv abs/1608.05949 (2016). n. Pag
Le, N.Q.K., Yapp, E.K.Y., Nagasundaram, N., Yeh, H.Y.: Classifying promoters by interpreting the hidden information of DNA sequences via deep learning and combination of continuous FastText N-grams. Front. Bioeng. Biotechnol. 7, 305 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Ganesan, S., Kumar, S.S., Soman, K.P. (2022). Deep Learning Based NLP Embedding Approach for Biosequence Classification. In: Chbeir, R., Manolopoulos, Y., Prasath, R. (eds) Mining Intelligence and Knowledge Exploration. MIKE 2021. Lecture Notes in Computer Science(), vol 13119. Springer, Cham. https://doi.org/10.1007/978-3-031-21517-9_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-21517-9_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21516-2
Online ISBN: 978-3-031-21517-9
eBook Packages: Computer ScienceComputer Science (R0)