Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Protein function prediction for newly sequenced organisms

Abstract

Recent successes in protein function prediction have shown the superiority of approaches that integrate multiple types of experimental evidence over methods that rely solely on homology. However, newly sequenced organisms continue to represent a difficult challenge, because only their protein sequences are available and they lack data derived from large-scale experiments. Here we introduce S2F (Sequence to Function), a network propagation approach for the functional annotation of newly sequenced organisms. Our main idea is to systematically transfer functionally relevant data from model organisms to newly sequenced ones, thus allowing us to use a label propagation approach. S2F introduces a novel label diffusion algorithm that can account for the presence of overlapping communities of proteins with related functions. As most newly sequenced organisms are bacteria, we tested our approach in the context of bacterial genomes. Our extensive evaluation shows a great improvement over existing sequence-based methods, as well as four state-of-the-art general-purpose protein function prediction methods. Our work demonstrates that employing a diffusion process over networks of transferred functional data is an effective way to improve predictions over simple homology. S2F is applicable to any type of newly sequenced organism as well as to those for which experimental evidence is available. A free, easy to run version of S2F is available at https://www.paccanarolab.org/s2f.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of the S2F approach.
Fig. 2: Smin metric for every organism per gene and per term, with lower values being better.
Fig. 3: Fmax for every organism per gene and per term, with higher values being better.
Fig. 4: AUC-ROC for every organism per gene and per term, with higher values being better.
Fig. 5: AUC-PR for every organism per gene and per term, with higher values being better.

Similar content being viewed by others

Data availability

The input sequence files25 in FASTA format for all the organisms used in this paper are available at https://doi.org/10.5281/zenodo.5514323. The same URL also contains the detailed list of all organisms excluded when testing each specific bacterium.

Code availability

The code for S2F is freely available and maintained at https://www.paccanarolab.org/s2f. The exact version26 used for this publication is available at https://doi.org/10.5281/zenodo.5513071.

References

  1. Cruz, L. M., Trefflich, S., Weiss, V. A. & Castro, M. A. A. Protein function prediction. Methods Mol. Biol. 1654, 55–75 (2017).

    Article  Google Scholar 

  2. Shehu, A., Barbará, D. & Molloy, K. in Big Data Analytics in Genomics (ed. Wong, K.-C.) 225–298 (Springer, 2016); https://doi.org/10.1007/978-3-319-41279-5_7

  3. Jiang, Y. et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 17, 184 (2016).

    Article  Google Scholar 

  4. Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).

    Article  Google Scholar 

  5. Cowen, L., Ideker, T., Raphael, B. J. & Sharan, R. Network propagation: a universal amplifier of genetic associations. Nat. Rev. Genet. 18, 551–562 (2017).

    Article  Google Scholar 

  6. Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 244 (2019).

    Article  Google Scholar 

  7. Valentini, G. True path rule hierarchical ensembles for genome-wide gene function prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 8, 832–847 (2011).

    Google Scholar 

  8. Friedberg, I. & Radivojac, P. in The Gene Ontology Handbook (eds Dessimoz, C. & Škunca, N.) 133–146 (Springer, 2017); https://doi.org/10.1007/978-1-4939-3743-1_10

  9. Obozinski, G., Lanckriet, G., Grant, C., Jordan, M. I. & Noble, W. S. Consistent probabilistic outputs for protein function prediction. Genome Biol. 9, S6 (2008).

    Article  Google Scholar 

  10. Mitchell, A. L. et al. InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res. 47, D351–D360 (2019).

    Article  Google Scholar 

  11. Walhout, A. J. et al. Protein interaction mapping in C. elegans using proteins involved in vulval development. Science 287, 116–122 (2000).

    Article  Google Scholar 

  12. Yu, H. et al. Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res. 14, 1107–1118 (2004).

    Article  Google Scholar 

  13. Ben-Hur, A. & Noble, W. S. Kernel methods for predicting protein-protein interactions. Bioinformatics 21, i38–i46 (2005).

    Article  Google Scholar 

  14. Sharan, R. et al. Conserved patterns of protein interaction in multiple species. Proc. Natl Acad. Sci. USA 102, 1974–1979 (2005).

    Article  Google Scholar 

  15. Szklarczyk, D. et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 47, D607–D613 (2019).

    Article  Google Scholar 

  16. Mostafavi, S., Ray, D., Warde-Farley, D., Grouios, C. & Morris, Q. GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol. 9, S4 (2008).

    Article  Google Scholar 

  17. Huntley, R. P. et al. The GOA database: gene ontology annotation updates for 2015. Nucleic Acids Res. 43, D1057–D1063 (2015).

    Article  Google Scholar 

  18. Lavezzo, E., Falda, M., Fontana, P., Bianco, L. & Toppo, S. Enhancing protein function prediction with taxonomic constraints—the Argot2.5 web server. Methods 93, 15–23 (2016).

    Article  Google Scholar 

  19. Kulmanov, M. & Hoehndorf, R. DeepGOPlus: improved protein function prediction from sequence. Bioinformatics 36, 422–429 (2020).

    Google Scholar 

  20. You, R. et al. GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics 34, 2465–2473 (2018).

    Article  Google Scholar 

  21. You, R. et al. NetGO: improving large-scale protein function prediction with massive network information. Nucleic Acids Res. 47, W379–W387 (2019).

    Article  Google Scholar 

  22. Makrodimitris, S., van Ham, R. C. H. J. & Reinders, M. J. T. Automatic gene function prediction in the 2020s. Genes 11, 1264 (2020).

    Article  Google Scholar 

  23. Cao, M. et al. Going the distance for protein function prediction: a new distance metric for protein interaction networks. PLoS ONE 8, e76339 (2013).

    Article  Google Scholar 

  24. Zhou, D., Bousquet, O., Lal, T. N., Weston, J. & Schölkopf, B. Learning with local and global consistency. In Proc. 16th International Conference on Neural Information Processing Systems (eds Thrun, S. et al.) 321–328 (MIT, 2004).

  25. Torres, M., Yang, H., Romero, A. E. & Paccanaro, A. Input data for 'Protein function prediction for newly sequenced organisms'. Zenodo https://doi.org/10.5281/ZENODO.5514323 (2021).

  26. Torres, M., Yang, H., Romero, A. E. & Paccanaro, A. Source code for 'Protein function prediction for newly sequenced organisms'. Zenodo https://doi.org/10.5281/ZENODO.5513071 (2021).

  27. UniProt Consortium UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).

Download references

Acknowledgements

The first idea for this project was conceived in discussions with T. Gianoulis, who we remember dearly for her intelligence, kindness, enthusiasm and passion for research. We also thank P. Bhat, T. Nepusz, J. Caceres, M. Frasca, G. Valentini, A. Devoto, L. Bögre, R. Sasidharan and M. Gerstein for many important and stimulating discussions. A.P. was supported by Biotechnology and Biological Sciences Research Council (https://bbsrc.ukri.org/) grants numbers BB/K004131/1, BB/F00964X/1 and BB/M025047/1, Medical Research Council (https://mrc.ukri.org) grant number MR/T001070/1, Consejo Nacional de Ciencia y Tecnología Paraguay (https://www.conacyt.gov.py/) grants numbers 14-INV-088 and PINV15–315, National Science Foundation Advances in Bio Informatics (https://www.nsf.gov/) grant number 1660648, Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro grant number E-26/201.079/2021 (260380) and Fundação Getulio Vargas.

Author information

Authors and Affiliations

Authors

Contributions

A.P. conceived the study. A.P. and H.Y. devised the algorithms, developed the prototype and performed preliminary evaluations. M.T. and A.E.R. implemented and extended the algorithms and evaluation metrics, performed large-scale experiments and analysed the results. A.P., M.T. and A.E.R. wrote the manuscript and evaluated the biological relevance of the results. All authors discussed the results and implications. A.P. supervised the project.

Corresponding author

Correspondence to Alberto Paccanaro.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks Jiecong Lin and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–58, Notes 1–18 and Table 1.

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Torres, M., Yang, H., Romero, A.E. et al. Protein function prediction for newly sequenced organisms. Nat Mach Intell 3, 1050–1060 (2021). https://doi.org/10.1038/s42256-021-00419-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-021-00419-7

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics