Deep Learning Based NLP Embedding Approach for Biosequence Classification

Ganesan, Shamika; Kumar, S. Sachin; Soman, K. P.

doi:10.1007/978-3-031-21517-9_16

Shamika Ganesan¹⁰,
S. Sachin Kumar¹⁰ &
K. P. Soman¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13119))

Included in the following conference series:

International Conference on Mining Intelligence and Knowledge Exploration

286 Accesses
2 Citations

Abstract

Biological sequence analysis involves the study of structural characteristics and chemical composition of a sequence. From a computational perspective, the goal is to represent sequences using vectors which bring out the essential features of the virus and enable efficient classification. Methods such as one-hot encoding, Word2Vec models, etc. have been explored for embedding sequences into the Euclidean plane. But these methods either fail to capture similarity information between k-mers or face the challenge of handling Out-of-Vocabulary (OOV) k-mers. In order to overcome these challenges, in this paper we aim explore the possibility of embedding Biosequences of MERS, SARS and SARS-CoV-2 using Global Vectors (GloVe) model and FastText n-gram representation. We conduct an extensive study to evaluate their performance using classical Machine Learning algorithms and Deep Learning methods. We compare our results with dna2vec, which is an existing Word2Vec approach. Experimental results show that FastText n-gram based sequence embeddings enable deeper insights into understanding the composition of each virus and thus give a classification accuracy close to 1. We also provide a study regarding the patterns in the viruses and support our results using various visualization techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Koyama, T., Platt, D., Parida, L.: Variants of the SARS-CoV-2 genomes. Bull. World Health Organ. 98, 495–504 (2020)
Article Google Scholar
Malik, Y.A.: Properties of coronavirus and SARS-CoV-2. Malays. J. Pathol. 42(1), 3–11 (2020). PMID: 32342926
Google Scholar
Lan, T.C.T., et al.: Structure of the full SARS-CoV-2 RNA genome in infected cells
Google Scholar
Junior, J.A.C.N., Santos, A.M., Quintans-Júnior, L.J., Walker, C.I.B., Borges, L.P., Serafini, M.R.: SARS, MERS and SARS-CoV-2 (COVID-19) treatment: a patent review. Expert Opin. Ther. Pat. 30(8), 567–579 (2020)
Article Google Scholar
Li, Q., et al.: The impact of mutations in SARS-CoV-2 spike on viral infectivity and antigenicity. Cell 182(5), 1284–1294 (2020)
Article Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
NCBI Virus. https://www.ncbi.nlm.nih.gov/labs/virus/vssi
Ng, P.: dna2vec: consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.06279 (2017)
Lopez Rincon, A., et al.: Accurate identification of SARS-CoV-2 from viral genome sequences using deep learning. bioRxiv (2020)
Google Scholar
Zhang, J., Chen, Q., Liu, B.: DeepDRBP-2L: a new genome annotation predictor for identifying DNA binding proteins and RNA binding proteins using convolutional neural network and long short-term memory. In: IEEE/ACM Transactions on Computational Biology and Bioinformatics
Google Scholar
Jha, P.K., Vijay, A., Halu, A., Uchida, S., Aikawa, M.: Gene expression profiling reveals the shared and distinct transcriptional signatures in human lung epithelial cells infected with SARS-CoV-2, MERS-CoV, or SARS-CoV: potential implications in cardiovascular complications of COVID-19. Front Cardiovasc Med. 7, 623012 (2021). Accessed 15 Jan 2021
Google Scholar
Wang, L., Zhou, J., Wang, Q., Wang, Y., Kang, C.: Rapid design and development of CRISPR-Cas13a targeting SARS-CoV-2 spike protein. Theranostics. 11(2), 649–664 (2021). Accessed 1 Jan 2021
Google Scholar
Heo, L., Feig, M.: Modeling of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) proteins by machine learning and physics-based refinement (2020)
Google Scholar
Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations (ICLR 2013), pp. 1–12 (2013)
Google Scholar
Kwan, H.K., Arniker, S.B.: Numerical representation of DNA sequences, pp. 307–310 (2009). https://doi.org/10.1109/EIT.2009.5189632
Lopez-Rincon, A., et al.: Classification and specific primer design for accurate detection of SARS-CoV-2 using deep learning. Sci. Rep. 11(1), 1–11 (2021)
Article Google Scholar
Ballesio, F., et al.: Determining a novel feature-space for SARS-CoV-2 sequence data (2020)
Google Scholar
Asgari, E., Mofrad, M.R.: Continuous distributed representation of biological sequences for deep proteomics and genomics, PLoS One 10, e0141287 (2015)
Google Scholar
Kimothi, D., et al.: Distributed representations for biological sequence analysis. ArXiv abs/1608.05949 (2016). n. Pag
Google Scholar
Le, N.Q.K., Yapp, E.K.Y., Nagasundaram, N., Yeh, H.Y.: Classifying promoters by interpreting the hidden information of DNA sequences via deep learning and combination of continuous FastText N-grams. Front. Bioeng. Biotechnol. 7, 305 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, Tamil Nadu, India
Shamika Ganesan, S. Sachin Kumar & K. P. Soman

Authors

Shamika Ganesan
View author publications
You can also search for this author in PubMed Google Scholar
S. Sachin Kumar
View author publications
You can also search for this author in PubMed Google Scholar
K. P. Soman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Sachin Kumar .

Editor information

Editors and Affiliations

Université de Pau et des Pays de l’Adour, Anglet, France
Richard Chbeir
Open University of Cyprus, Nicosia, Cyprus
Yannis Manolopoulos
Indian Institute of Information Technology Sri City, Chittoor, Andhra Pradesh, India
Rajendra Prasath

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ganesan, S., Kumar, S.S., Soman, K.P. (2022). Deep Learning Based NLP Embedding Approach for Biosequence Classification. In: Chbeir, R., Manolopoulos, Y., Prasath, R. (eds) Mining Intelligence and Knowledge Exploration. MIKE 2021. Lecture Notes in Computer Science(), vol 13119. Springer, Cham. https://doi.org/10.1007/978-3-031-21517-9_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-21517-9_16
Published: 15 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21516-2
Online ISBN: 978-3-031-21517-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Deep Learning Based NLP Embedding Approach for Biosequence Classification