Skip to main content
Log in

Gene Ontology GAN (GOGAN): a novel architecture for protein function prediction

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

One of the most important aspects for a deep interpretation of molecular biology is the precise annotation of protein functions. An overwhelming majority of proteins, across species, do not have sufficient supplementary information available, which causes them to stay uncharacterized. Contrastingly, all known proteins have one key piece of information available: their amino acid sequence. Therefore, for a wider applicability of algorithms, across different species proteins, researchers are motivated to make computational techniques that characterize proteins using their amino acid sequence. However, in case of computational techniques like deep learning algorithms, huge amount of labeled information is required to produce good results. The labeling process of data is time and resource consuming making labeled data scarce. Utilizing the characteristic to address the formerly mentioned issues of uncharacterized proteins and traditional deep learning algorithms, we propose a model called GOGAN, that operates on the amino acid sequence of a protein to predict its functions. Our proposed GOGAN model does not require any handcrafted features, rather it extracts automatically, all the required information from the input sequence. GOGAN model extracts features from the massively large unlabeled protein datasets. The term “Unlabeled data” is used for piece of information that have not been assigned labels to identify their characteristics or properties. The features extracted by GOGAN model can be utilized in other applications like gene variation analysis, gene expression analysis and gene regulation network detection. The proposed model is benchmarked on the Homo sapiens protein dataset extracted from the UniProt database. Experimental results show clear improvements in different evaluation metrics when compared with other methods. Overall, GOGAN achieves an F1 score of 72.1% with Hamming loss of 9.5%, using only the amino acid sequences of protein.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Availability of data and materials

The dataset can be obtained from UniProt  (Consortium 2015). We have also provided dataset at: https://github.com/musadaqmansoor/gogan.

Code Availability Statement

The code for this research project can be found as open source at: https://github.com/musadaqmansoor/gogan.

References

  • (1999) Interpro. https://www.ebi.ac.uk/interpro/. Accessed on 01 July 2020

  • Aebersold R, Mann M (2003) Mass spectrometry-based proteomics. Nature 422(6928):198

    Article  Google Scholar 

  • Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410

    Article  Google Scholar 

  • Ange Tato RN (2018) Improving adam optimizer. bioRxiv p 262501

  • Apostolopoulos ID, Mpesiana TA (2020) Covid-19: automatic detection from X-ray images utilizing transfer learning with convolutional neural networks. Phys Eng Sci Med, p 1

  • Arjovsky M, Chintala S, Bottou L (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875

  • Ashburner M, Ball CA, Blake JA et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25–29

    Article  Google Scholar 

  • Babbar R, Schölkopf B (2019) Data scarcity, robustness and extreme multi-label classification. Mach Learn 108(8–9):1329–1351

    Article  MathSciNet  Google Scholar 

  • Bartel PL, Roecklein JA, SenGupta D et al (1996) A protein linkage map of escherichia coli bacteriophage t7. Nat Genet 12(1):72

    Article  Google Scholar 

  • Benso A, Di Carlo S, ur Rehman H, et al (2013) A combined approach for genome wide protein function annotation/prediction. Proteome Sci 11(1):S1

  • Borhani M (2020) Multi-label log-loss function using l-bfgs for document categorization. Eng Appl Artif Intell 91(103):623

    Google Scholar 

  • Bork P, Dandekar T, Diaz-Lazcoz Y et al (1998) Predicting function: from genes to genomes and back. J Mol Biol 283(4):707–725

    Article  Google Scholar 

  • Causier B (2004) Studying the interactome with the yeast two-hybrid system and mass spectrometry. Mass Spectrom Rev 23(5):350–367

    Article  Google Scholar 

  • Che J, Chen L, Guo ZH et al (2020) Drug target group prediction with multiple drug networks. Combin Chem High Throughput Screen 23(4):274–284

    Article  Google Scholar 

  • Chen Y, Qin X, Wang J, et al (2020) Fedhealth: a federated transfer learning framework for wearable healthcare. IEEE Intell Syst

  • Consortium U (2015) Uniprot: a hub for protein information. Nucleic Acids Res 43(D1):D204–D212

    Article  Google Scholar 

  • Cooper GM (2000) The cell: a molecular approach, 2nd edn. ASM Press, Washington

    Google Scholar 

  • Cruz LM, Trefflich S, Weiss VA, et al (2017) Protein function prediction. Funct Genomics, pp 55–75

  • Deng M, Zhang K, Mehta S et al (2003) Prediction of protein function using protein-protein interaction data. J Comput Biol 10(6):947–960

    Article  Google Scholar 

  • Di Tullio A, Reale S, De Angelis F (2005) Molecular recognition by mass spectrometry. J Mass Spectrom 40(7):845–865

    Article  Google Scholar 

  • Finley RL, Brent R (1994) Interaction mating reveals binary and ternary connections between drosophila cell cycle regulators. Proc Natl Acad Sci 91(26):12,980-12,984

    Article  Google Scholar 

  • Friedberg I (2006) Automated protein function prediction-the genomic challenge. Brief Bioinform 7(3):225–242

    Article  Google Scholar 

  • Gaudet P, Livstone MS, Lewis SE et al (2011) Phylogenetic-based propagation of functional annotations within the gene ontology consortium. Brief Bioinform 12(5):449–462

    Article  Google Scholar 

  • Gene OC, et al (2015) Gene ontology consortium: going forward. Nucleic Acids Res 43(Database issue):D1049–56

  • Ghahramani A, Watt FM, Luscombe NM (2018) Generative adversarial networks uncover epidermal regulators and predict single cell perturbations. bioRxiv p 262501

  • Ghavidel A, Cagney G, Emili A (2005) A skeleton of the human protein interactome. Cell 122(6):830–832

    Article  Google Scholar 

  • Giot L, Bader JS, Brouwer C, et al (2003) A protein interaction map of drosophila melanogaster. Science 302(5651) : 1727–1736

  • Gligorijević V, Barot M, Bonneau R (2018) deepnf: deep network fusion for protein function prediction. Bioinformatics 34(22):3873–3881

    Article  Google Scholar 

  • Goodfellow I, Pouget-Abadie J, Mirza M, et al (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680

  • Gulrajani I, Ahmed F, Arjovsky M, et al (2017) Improved training of wasserstein gans. In: Advances in neural information processing systems, pp 5767–5777

  • Gunnar H (2018) Real-valued medical time series generation with recurrent conditional gans. bioRxiv p 262501

  • Gupta A, Zou J (2018) Feedback gan (fbgan) for dna: a novel feedback-loop architecture for optimizing protein functions. arXiv preprint arXiv:1804.01694

  • Huttenhower C, Hibbs M, Myers C et al (2006) A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics 22(23):2890–2897

    Article  Google Scholar 

  • Jiang Y, Oron TR, Clark WT et al (2016) An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol 17(1):184

    Article  Google Scholar 

  • Joo W, Kim D, Shin S, et al (2020) Generalized gumbel-softmax gradient estimator for various discrete random variables. arXiv preprint arXiv:2003.01847

  • Kanehisa M (2020) Kanehisa Laboratories - Growth of Major Databases. Pathway Solutions; Bioinfomatics Center. https://www.kanehisa.jp/en/db_growth.html. Accessed 01 July 2020

  • Killoran N, Lee LJ, Delong A, et al (2017) Generating and designing dna with deep generative models. arXiv preprint arXiv:1712.06148

  • Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

  • Letovsky S, Kasif S (2003) Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 19(suppl_1):i197–i204

  • Li S, Armstrong CM, Bertin N, et al (2004) A map of the interactome network of the metazoan c. elegans. Science 303 (5657):540–543

  • Liang G, Zheng L (2020) A transfer learning method with deep residual network for pediatric pneumonia diagnosis. Comput Methods Programs Biomed 187(104):964

    Google Scholar 

  • Liao W, Wang Y, Yin Y et al (2020) Improved sequence generation model for multi-label classification via cnn and initialized fully connection. Neurocomputing 382:188–195

    Article  Google Scholar 

  • Liu X (2017) Deep recurrent neural network for protein function prediction from sequence. arXiv preprint arXiv:1701.08318

  • Lv Z, Ao C, Zou Q (2019) Protein function prediction: from traditional classifier to deep learning. Proteomics, p 1900119

  • Marcotte EM, Pellegrini M, Ng HL et al (1999) Detecting protein function and protein–protein interactions from genome sequences. Science 285(5428):751–753

    Article  Google Scholar 

  • Martin Arjovsky S, Bottou L (2017) Wasserstein generative adversarial networks. In: Proceedings of the 34 th international conference on machine learning, Sydney, Australia

  • Nabieva E, Jim K, Agarwal A, et al (2005) Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21(suppl_1):i302–i310

  • Najafabadi MM, Villanustre F, Khoshgoftaar TM et al (2015) Deep learning applications and challenges in big data analytics. J Big Data 2(1):1

    Article  Google Scholar 

  • Nauman M, Rehman HU, Politano G et al (2019) Beyond homology transfer: deep learning for automated annotation of proteins. J Grid Comput 17(2):225–237

    Article  Google Scholar 

  • Ouyang W, Aristov A, Lelek M et al (2018) Deep learning massively accelerates super-resolution localization microscopy. Nat Biotechnol 36(5):460

    Article  Google Scholar 

  • Pal D, Eisenberg D (2005) Inference of protein function from protein structure. Structure 13(1):121–130

    Article  Google Scholar 

  • Pazos F, Sternberg MJ (2004) Automated prediction of protein function and detection of functional sites from structure. Proc Natl Acad Sci 101(41):14754–14759

  • Pellegrini M, Marcotte EM, Thompson MJ et al (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci 96(8):4285–4288

    Article  Google Scholar 

  • Piovesan D, Giollo M, Leonardi E et al (2015) Inga: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res 43(W1):W134–W140

    Article  Google Scholar 

  • Radivojac P, Clark WT, Oron TR et al (2013) A large-scale evaluation of computational protein function prediction. Nat Methods 10(3):221–227

    Article  Google Scholar 

  • Rual JF, Venkatesan K, Hao T et al (2005) Towards a proteome-scale map of the human protein-protein interaction network. Nature 437(7062):1173

    Article  Google Scholar 

  • Shen LX, Basilion JP, Stanton VP (1999) Single-nucleotide polymorphisms can cause different structural folds of mrna. Proc Natl Acad Sci 96(14):7871–7876

    Article  Google Scholar 

  • Shoemaker BA, Panchenko AR (2007) Deciphering protein–protein interactions. Part I. experimental techniques and databases. PLoS Comput Biol 3(3):e42

  • Tieleman T, Hinton G (2012) Divide the gradient by a running average of its recent magnitude. Coursera neural netw. Mach Learn 6:26–31

    Google Scholar 

  • Vazquez A, Flammini A, Maritan A et al (2003) Global protein function prediction from protein–protein interaction networks. Nat Biotechnol 21(6):697

    Article  Google Scholar 

  • Villani C (2008) Optimal transport: old and new, vol 338. Springer, Berlin

    MATH  Google Scholar 

  • Vincent P, Larochelle H, Lajoie I, et al (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(12)

  • Walhout AJ, Sordella R, Lu X, et al (2000) Protein interaction mapping in c. elegans using proteins involved in vulval development. Science 287(5450):116–122

  • Watson JD, Laskowski RA, Thornton JM (2005) Predicting protein function from sequence and structural data. Curr Opin Struct Biol 15(3):275–284

    Article  Google Scholar 

  • Xin F, Radivojac P (2011) Computational methods for identification of functional residues in protein structures. Curr Protein Pept Sci 12(6):456–469

    Article  Google Scholar 

  • Zeiler MD (2012) Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701

  • Zhang F, Song H, Zeng M, et al (2019) Deepfunc: a deep learning framework for accurate prediction of protein functions from protein sequences and interactions. Proteomics, p 1900019

  • Zhang ML, Fang JP (2020) Partial multi-label learning via credible label elicitation. IEEE Trans Pattern Anal Mach Intell

  • Zhuang F, Qi Z, Duan K, et al (2019) A comprehensive survey on transfer learning. arXiv preprint arXiv:1911.02685

Download references

Funding

The authors did not receive any funding for this research project.

Author information

Authors and Affiliations

Authors

Contributions

The idea of using GANs for Bioinformatics was thought by Musadaq Mansoor and Muhammad Nauman. Hafeez Ur Rehman and Alfredo Benso provided domain knowledge. Writing code, running experiments and analyzing results were done by Musadaq Mansoor and Muhammad Nauman. Alfredo Benso and Hafeez Ur Rehman helped in analyzing results and providing the final discussion. Manuscript was written by Musadaq Mansoor. The manuscript was reviewed and updated by all authors.

Corresponding author

Correspondence to Musadaq Mansoor.

Ethics declarations

Conflict of interest

There are no competing financial interests connected with the carried out research work.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Every author agrees for publication of this manuscript.

Additional information

Communicated by Irfan Uddin.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mansoor, M., Nauman, M., Ur Rehman, H. et al. Gene Ontology GAN (GOGAN): a novel architecture for protein function prediction. Soft Comput 26, 7653–7667 (2022). https://doi.org/10.1007/s00500-021-06707-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-021-06707-z

Keywords

Navigation