Skip to main content

Deep Learning Based NLP Embedding Approach for Biosequence Classification

  • Conference paper
  • First Online:
Mining Intelligence and Knowledge Exploration (MIKE 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13119))

Abstract

Biological sequence analysis involves the study of structural characteristics and chemical composition of a sequence. From a computational perspective, the goal is to represent sequences using vectors which bring out the essential features of the virus and enable efficient classification. Methods such as one-hot encoding, Word2Vec models, etc. have been explored for embedding sequences into the Euclidean plane. But these methods either fail to capture similarity information between k-mers or face the challenge of handling Out-of-Vocabulary (OOV) k-mers. In order to overcome these challenges, in this paper we aim explore the possibility of embedding Biosequences of MERS, SARS and SARS-CoV-2 using Global Vectors (GloVe) model and FastText n-gram representation. We conduct an extensive study to evaluate their performance using classical Machine Learning algorithms and Deep Learning methods. We compare our results with dna2vec, which is an existing Word2Vec approach. Experimental results show that FastText n-gram based sequence embeddings enable deeper insights into understanding the composition of each virus and thus give a classification accuracy close to 1. We also provide a study regarding the patterns in the viruses and support our results using various visualization techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://fasttext.cc/.

  2. 2.

    https://nlp.stanford.edu/projects/glove/.

  3. 3.

    https://scikit-learn.org/stable/.

  4. 4.

    https://colab.research.google.com/notebooks/intro.ipynb.

  5. 5.

    https://www.nltk.org/.

References

  1. Koyama, T., Platt, D., Parida, L.: Variants of the SARS-CoV-2 genomes. Bull. World Health Organ. 98, 495–504 (2020)

    Article  Google Scholar 

  2. Malik, Y.A.: Properties of coronavirus and SARS-CoV-2. Malays. J. Pathol. 42(1), 3–11 (2020). PMID: 32342926

    Google Scholar 

  3. Lan, T.C.T., et al.: Structure of the full SARS-CoV-2 RNA genome in infected cells

    Google Scholar 

  4. Junior, J.A.C.N., Santos, A.M., Quintans-Júnior, L.J., Walker, C.I.B., Borges, L.P., Serafini, M.R.: SARS, MERS and SARS-CoV-2 (COVID-19) treatment: a patent review. Expert Opin. Ther. Pat. 30(8), 567–579 (2020)

    Article  Google Scholar 

  5. Li, Q., et al.: The impact of mutations in SARS-CoV-2 spike on viral infectivity and antigenicity. Cell 182(5), 1284–1294 (2020)

    Article  Google Scholar 

  6. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  7. NCBI Virus. https://www.ncbi.nlm.nih.gov/labs/virus/vssi

  8. Ng, P.: dna2vec: consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.06279 (2017)

  9. Lopez Rincon, A., et al.: Accurate identification of SARS-CoV-2 from viral genome sequences using deep learning. bioRxiv (2020)

    Google Scholar 

  10. Zhang, J., Chen, Q., Liu, B.: DeepDRBP-2L: a new genome annotation predictor for identifying DNA binding proteins and RNA binding proteins using convolutional neural network and long short-term memory. In: IEEE/ACM Transactions on Computational Biology and Bioinformatics

    Google Scholar 

  11. Jha, P.K., Vijay, A., Halu, A., Uchida, S., Aikawa, M.: Gene expression profiling reveals the shared and distinct transcriptional signatures in human lung epithelial cells infected with SARS-CoV-2, MERS-CoV, or SARS-CoV: potential implications in cardiovascular complications of COVID-19. Front Cardiovasc Med. 7, 623012 (2021). Accessed 15 Jan 2021

    Google Scholar 

  12. Wang, L., Zhou, J., Wang, Q., Wang, Y., Kang, C.: Rapid design and development of CRISPR-Cas13a targeting SARS-CoV-2 spike protein. Theranostics. 11(2), 649–664 (2021). Accessed 1 Jan 2021

    Google Scholar 

  13. Heo, L., Feig, M.: Modeling of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) proteins by machine learning and physics-based refinement (2020)

    Google Scholar 

  14. Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations (ICLR 2013), pp. 1–12 (2013)

    Google Scholar 

  15. Kwan, H.K., Arniker, S.B.: Numerical representation of DNA sequences, pp. 307–310 (2009). https://doi.org/10.1109/EIT.2009.5189632

  16. Lopez-Rincon, A., et al.: Classification and specific primer design for accurate detection of SARS-CoV-2 using deep learning. Sci. Rep. 11(1), 1–11 (2021)

    Article  Google Scholar 

  17. Ballesio, F., et al.: Determining a novel feature-space for SARS-CoV-2 sequence data (2020)

    Google Scholar 

  18. Asgari, E., Mofrad, M.R.: Continuous distributed representation of biological sequences for deep proteomics and genomics, PLoS One 10, e0141287 (2015)

    Google Scholar 

  19. Kimothi, D., et al.: Distributed representations for biological sequence analysis. ArXiv abs/1608.05949 (2016). n. Pag

    Google Scholar 

  20. Le, N.Q.K., Yapp, E.K.Y., Nagasundaram, N., Yeh, H.Y.: Classifying promoters by interpreting the hidden information of DNA sequences via deep learning and combination of continuous FastText N-grams. Front. Bioeng. Biotechnol. 7, 305 (2019)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Sachin Kumar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ganesan, S., Kumar, S.S., Soman, K.P. (2022). Deep Learning Based NLP Embedding Approach for Biosequence Classification. In: Chbeir, R., Manolopoulos, Y., Prasath, R. (eds) Mining Intelligence and Knowledge Exploration. MIKE 2021. Lecture Notes in Computer Science(), vol 13119. Springer, Cham. https://doi.org/10.1007/978-3-031-21517-9_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21517-9_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21516-2

  • Online ISBN: 978-3-031-21517-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics