Skip to main content

Machine Learning Models for Automatic Gene Ontology Annotation of Biological Texts

  • Conference paper
  • First Online:
Artificial Intelligence in Medicine (AIME 2023)

Abstract

Gene ontology (GO) is a major source of biological knowledge that describes the functions of genes and gene products using a comprehensive set of controlled vocabularies or terms organized in a hierarchical structure. Automatic annotation of biological texts using gene ontology (GO) terms gained the attention of the scientific community as it helps to quickly identify relevant documents or parts of text related to specific biological functions or processes. In this paper, we propose and investigate a new GO-term annotation strategy that uses a non-parametric k-nearest neighbor model and relies on various vector-based representations of documents and GO terms linked to these documents. Our vector representations are based on machine learning and natural language processing (NLP) models, including singular value decomposition, Word2Vec and topic-based scoring. We evaluate the performance of our model on a large benchmark corpus using a variety of standard and hierarchical evaluation metrics.

Supported by the Defense Advanced Research Projects Agency (DARPA) through Cooperative Agreement D20AC00002 awarded by the U.S. Department of the Interior, Interior Business Center. The content of the article does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.ebi.ac.uk/GOA/index.

  2. 2.

    https://github.com/juijayati/GOA-AIME2023.git.

  3. 3.

    https://tinyurl.com/ypzttrur.

  4. 4.

    https://allenai.github.io/scispacy/.

  5. 5.

    https://www.nlm.nih.gov/research/umls/index.html.

References

  1. Arighi, C., et al.: Proceedings of the fourth biocreative challenge evaluation workshop (2013)

    Google Scholar 

  2. Blaschke, C., Leon, E.A., Krallinger, M., Valencia, A.: Evaluation of biocreative assessment of task 2. BMC Bioinform. 6, 1–13 (2005)

    Article  Google Scholar 

  3. Camon, E.B., et al.: An evaluation of go annotation retrieval for biocreative and goa. BMC Bioinf. 6, 1–11 (2005)

    Article  Google Scholar 

  4. Chen, Y.D., Yang, C.J., Li, W.G., Huang, C.Y., Chiang, J.H., et al.: Gene ontology evidence sentence extraction and concept extraction: two rule-based approaches (2013)

    Google Scholar 

  5. Faria, D., Schlicker, A., Pesquita, C., Bastos, H., Ferreira, A.E., Albrecht, M., Falcão, A.O.: Mining go annotations for improving annotation consistency. PLoS ONE 7(7), e40519 (2012)

    Article  Google Scholar 

  6. Gobeill, J., Pasche, E., Vishnyakova, D., Ruch, P.: Closing the loop: from paper to protein annotation using supervised gene ontology classification. Database 2014 (2014)

    Google Scholar 

  7. Lena, P.D., Domeniconi, G., Margara, L., Moro, G.: Gota: Go term annotation of biomedical literature. BMC Bioinform. 16, 1–13 (2015)

    Article  Google Scholar 

  8. Lu, Z., Hirschman, L.: Biocuration workflows and text mining: overview of the biocreative 2012 workshop track ii. Database 2012 (2012)

    Google Scholar 

  9. Voorhees, E.M., Buckland, L.: Overview of the trec 2003 question answering track. In: TREC, vol. 2003, pp. 54–68 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jayati H. Jui .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jui, J.H., Hauskrecht, M. (2023). Machine Learning Models for Automatic Gene Ontology Annotation of Biological Texts. In: Juarez, J.M., Marcos, M., Stiglic, G., Tucker, A. (eds) Artificial Intelligence in Medicine. AIME 2023. Lecture Notes in Computer Science(), vol 13897. Springer, Cham. https://doi.org/10.1007/978-3-031-34344-5_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-34344-5_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-34343-8

  • Online ISBN: 978-3-031-34344-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics