Skip to main content

Boosting Protein Language Models with Negative Sample Mining

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track (ECML PKDD 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14950))

  • 752 Accesses

Abstract

We introduce a pioneering methodology for boosting large language models in the domain of protein representation learning. Our primary contribution lies in the refinement process for correlating the over-reliance on co-evolution knowledge, in a way that networks are trained to distill invaluable insights from negative samples, constituted by protein pairs sourced from disparate categories. By capitalizing on this novel approach, our technique steers the training of transformer-based models within the attention score space. This advanced strategy not only amplifies performance but also reflects the nuanced biological behaviors exhibited by proteins, offering aligned evidence with traditional biological mechanisms such as protein-protein interaction. We experimentally observed improved performance on various tasks over datasets, on top of several well-established large protein models. This innovative paradigm opens up promising horizons for further progress in the realms of protein research and computational biology. The code is open-sourced at https://github.com/LOGO-CUHKSZ/NM-Transformer.

Y. Xu and X. Zhao—Contributed equally to this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alley, E.C., Khimulya, G., Biswas, S., AlQuraishi, M., Church, G.M.: Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16(12), 1315–1322 (2019)

    Article  Google Scholar 

  2. Baek, M., et al.: Accurate prediction of protein structures and interactions using a three-track neural network. Science 373(6557), 871–876 (2021)

    Article  Google Scholar 

  3. Bairoch, A.: The enzyme database in 2000. Nucleic Acids Res. 28(1), 304–305 (2000)

    Article  Google Scholar 

  4. Bepler, T., Berger, B.: Learning the protein language: evolution, structure, and function. Cell Syst. 12(6), 654–669 (2021)

    Article  Google Scholar 

  5. Bonetta, R., Valentino, G.: Machine learning techniques for protein function prediction. Proteins: Struct. Function Bioinform. 88(3), 397–413 (2020)

    Google Scholar 

  6. Burley, S.K., Berman, H.M., Kleywegt, G.J., Markley, J.L., Nakamura, H., Velankar, S.: Protein data bank (pdb): the single global macromolecular structure archive. Protein crystallography: methods and protocols, pp. 627–641 (2017)

    Google Scholar 

  7. Chen, D., Tian, X., Zhou, B., Gao, J., et al.: Profold: protein fold classification with additional structural features and a novel ensemble classifier. BioMed Res. Int. 2016 (2016)

    Google Scholar 

  8. Dallago, C., et al.: Flip: benchmark tasks in fitness landscape inference for proteins. Advances in Neural Information Processing Systems, pp. 2021–11 (2021)

    Google Scholar 

  9. De Juan, D., Pazos, F., Valencia, A.: Emerging methods in protein co-evolution. Nat. Rev. Genet. 14(4), 249–261 (2013)

    Article  Google Scholar 

  10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  11. Elnaggar, A., et al.: Prottrans: toward understanding the language of life through self-supervised learning. TPAMI 44(10) (2021)

    Google Scholar 

  12. Elnaggar, A., et al.: Prottrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. TPAMI, 1 (2021)

    Google Scholar 

  13. Evans, R., et al.: Protein complex prediction with alphafold-multimer. biorxiv pp. 2021–10 (2021)

    Google Scholar 

  14. Fang, X., et al.: A method for multiple-sequence-alignment-free protein structure prediction using a protein language model. Nature Mach. Intell. 5(10), 1087–1096 (2023)

    Article  Google Scholar 

  15. Ferruz, N., Höcker, B.: Controllable protein design with language models. Nature Mach. Intell. 4(6), 521–532 (2022)

    Article  Google Scholar 

  16. Gligorijević, V., et al.: Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12(1), 3168 (2021)

    Article  Google Scholar 

  17. Heinzinger, M., et al.: Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 20(1), 1–17 (2019)

    Article  Google Scholar 

  18. Hie, B., Zhong, E.D., Berger, B., Bryson, B.: Learning the language of viral evolution and escape. Science 371(6526), 284–288 (2021)

    Article  MathSciNet  Google Scholar 

  19. Hie, B.L., Yang, K.K., Kim, P.S.: Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins. Cell Syst. 13(4), 274–285 (2022)

    Article  Google Scholar 

  20. Hsu, C., et al.: Learning inverse folding from millions of predicted structures. In: ICML, pp. 8946–8970. PMLR (2022)

    Google Scholar 

  21. Hu, M., et al.: Exploring evolution-aware & -free protein language models as protein function predictors. Adv. Neural. Inf. Process. Syst. 35, 38873–38884 (2022)

    Google Scholar 

  22. Jones, D.T., Buchan, D.W., Cozzetto, D., Pontil, M.: Psicov: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 28(2), 184–190 (2012)

    Article  Google Scholar 

  23. Jumper, J., et al.: Highly accurate protein structure prediction with alphafold. Nature 596(7873), 583–589 (2021)

    Article  Google Scholar 

  24. Khurana, S., Rawi, R., Kunji, K., Chuang, G.Y., Bensmail, H., Mall, R.: Deepsol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics 34(15), 2605–2613 (2018)

    Article  Google Scholar 

  25. Lin, Z., et al.: Language models of protein sequences at the scale of evolution enable accurate structure prediction. Science (2023)

    Google Scholar 

  26. Lin, Z., et al.: Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379(6637), 1123–1130 (2023)

    Article  MathSciNet  Google Scholar 

  27. Marks, D.S., et al.: Protein 3d structure computed from evolutionary sequence variation. PLoS ONE 6(12), e28766 (2011)

    Article  Google Scholar 

  28. Meng, Q., Guo, F., Tang, J.: Improved structure-related prediction for insufficient homologous proteins using msa enhancement and pre-trained language model. Briefings Bioinform. 24(4), bbad217 (2023)

    Google Scholar 

  29. Rajagopal, A., Simon, S.M.: Subcellular localization and activity of multidrug resistance proteins. Mol. Biol. Cell 14(8), 3389–3399 (2003)

    Article  Google Scholar 

  30. Rao, J., He, H., Lin, J.: Noise-contrastive estimation for answer selection with deep neural networks. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 1913–1916 (2016)

    Google Scholar 

  31. Rao, R., et al.: Evaluating protein transfer learning with tape. Advances in neural information processing systems 32 (2019)

    Google Scholar 

  32. Rao, R., Meier, J., Sercu, T., Ovchinnikov, S., Rives, A.: Transformer protein language models are unsupervised structure learners. Biorxiv, pp. 2020–12 (2020)

    Google Scholar 

  33. Rao, R.M., et al.: Msa transformer. In: ICML, pp. 8844–8856. PMLR (2021)

    Google Scholar 

  34. Rives, A., et al.: Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. 118(15), e2016239118 (2021)

    Article  Google Scholar 

  35. Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769 (2016)

    Google Scholar 

  36. Strokach, A., Becerra, D., Corbi-Verge, C., Perez-Riba, A., Kim, P.M.: Fast and flexible protein design using deep graph neural networks. Cell Syst. 11(4) (2020)

    Google Scholar 

  37. Wang, H., et al.: Scientific discovery in the age of artificial intelligence. Nature 620(7972), 47–60 (2023)

    Article  Google Scholar 

  38. Wang, X., Xu, Y., He, X., Cao, Y., Wang, M., Chua, T.S.: Reinforced negative sampling over knowledge graph for recommendation. In: WWW, pp. 99–109 (2020)

    Google Scholar 

  39. Wang, Y., Song, J., Dai, Q., Duan, X.: Hierarchical negative sampling based graph contrastive learning approach for drug-disease association prediction. IEEE J. Biomed. Health Inform. (2024)

    Google Scholar 

  40. Wei, L., Xing, P., Zeng, J., Chen, J., Su, R., Guo, F.: Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier. Artif. Intell. Med. 83, 67–74 (2017)

    Article  Google Scholar 

  41. Xu, M., Yuan, X., Miret, S., Tang, J.: Protst: multi-modality learning of protein sequences and biomedical texts. ICML (2023)

    Google Scholar 

  42. Xu, M., et al.: Peer: a comprehensive and multi-task benchmark for protein sequence understanding. NIPS (2022)

    Google Scholar 

  43. Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W.L., Leskovec, J.: Graph convolutional neural networks for web-scale recommender systems. In: SIGKDD, pp. 974–983 (2018)

    Google Scholar 

  44. Yu, T., Cui, H., Li, J.C., Luo, Y., Jiang, G., Zhao, H.: Enzyme function prediction using contrastive learning. Science (2023)

    Google Scholar 

  45. Zhang, Z., et al.: Protein language models learn evolutionary statistics of interacting sequence motifs. bioRxiv, pp. 2024–01 (2024)

    Google Scholar 

  46. Zhang, Z., et al.: Protein representation learning by geometric structure pretraining. In: ICLR (2023)

    Google Scholar 

  47. Zheng, Z., Deng, Y., Xue, D., Zhou, Y., Ye, F., Gu, Q.: Structure-informed language models are protein designers. In: ICML, pp. 2023–02 (2023)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tianshu Yu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, Y., Zhao, X., Song, X., Wang, B., Yu, T. (2024). Boosting Protein Language Models with Negative Sample Mining. In: Bifet, A., Krilavičius, T., Miliou, I., Nowaczyk, S. (eds) Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14950. Springer, Cham. https://doi.org/10.1007/978-3-031-70381-2_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70381-2_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70380-5

  • Online ISBN: 978-3-031-70381-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics