Boosting Protein Language Models with Negative Sample Mining

Xu, Yaoyao; Zhao, Xinjian; Song, Xiaozhuang; Wang, Benyou; Yu, Tianshu

doi:10.1007/978-3-031-70381-2_13

Yaoyao Xu¹¹,
Xinjian Zhao¹¹,
Xiaozhuang Song¹¹,
Benyou Wang¹¹ &
…
Tianshu Yu¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14950))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

752 Accesses

Abstract

We introduce a pioneering methodology for boosting large language models in the domain of protein representation learning. Our primary contribution lies in the refinement process for correlating the over-reliance on co-evolution knowledge, in a way that networks are trained to distill invaluable insights from negative samples, constituted by protein pairs sourced from disparate categories. By capitalizing on this novel approach, our technique steers the training of transformer-based models within the attention score space. This advanced strategy not only amplifies performance but also reflects the nuanced biological behaviors exhibited by proteins, offering aligned evidence with traditional biological mechanisms such as protein-protein interaction. We experimentally observed improved performance on various tasks over datasets, on top of several well-established large protein models. This innovative paradigm opens up promising horizons for further progress in the realms of protein research and computational biology. The code is open-sourced at https://github.com/LOGO-CUHKSZ/NM-Transformer.

Y. Xu and X. Zhao—Contributed equally to this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ProteinGLUE multi-task benchmark suite for self-supervised protein modeling

Article Open access 26 September 2022

Using protein language models for protein interaction hot spot prediction with limited data

Article Open access 16 March 2024

Collectively encoding protein properties enriches protein language models

Article Open access 08 November 2022

References

Alley, E.C., Khimulya, G., Biswas, S., AlQuraishi, M., Church, G.M.: Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16(12), 1315–1322 (2019)
Article Google Scholar
Baek, M., et al.: Accurate prediction of protein structures and interactions using a three-track neural network. Science 373(6557), 871–876 (2021)
Article Google Scholar
Bairoch, A.: The enzyme database in 2000. Nucleic Acids Res. 28(1), 304–305 (2000)
Article Google Scholar
Bepler, T., Berger, B.: Learning the protein language: evolution, structure, and function. Cell Syst. 12(6), 654–669 (2021)
Article Google Scholar
Bonetta, R., Valentino, G.: Machine learning techniques for protein function prediction. Proteins: Struct. Function Bioinform. 88(3), 397–413 (2020)
Google Scholar
Burley, S.K., Berman, H.M., Kleywegt, G.J., Markley, J.L., Nakamura, H., Velankar, S.: Protein data bank (pdb): the single global macromolecular structure archive. Protein crystallography: methods and protocols, pp. 627–641 (2017)
Google Scholar
Chen, D., Tian, X., Zhou, B., Gao, J., et al.: Profold: protein fold classification with additional structural features and a novel ensemble classifier. BioMed Res. Int. 2016 (2016)
Google Scholar
Dallago, C., et al.: Flip: benchmark tasks in fitness landscape inference for proteins. Advances in Neural Information Processing Systems, pp. 2021–11 (2021)
Google Scholar
De Juan, D., Pazos, F., Valencia, A.: Emerging methods in protein co-evolution. Nat. Rev. Genet. 14(4), 249–261 (2013)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Elnaggar, A., et al.: Prottrans: toward understanding the language of life through self-supervised learning. TPAMI 44(10) (2021)
Google Scholar
Elnaggar, A., et al.: Prottrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. TPAMI, 1 (2021)
Google Scholar
Evans, R., et al.: Protein complex prediction with alphafold-multimer. biorxiv pp. 2021–10 (2021)
Google Scholar
Fang, X., et al.: A method for multiple-sequence-alignment-free protein structure prediction using a protein language model. Nature Mach. Intell. 5(10), 1087–1096 (2023)
Article Google Scholar
Ferruz, N., Höcker, B.: Controllable protein design with language models. Nature Mach. Intell. 4(6), 521–532 (2022)
Article Google Scholar
Gligorijević, V., et al.: Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12(1), 3168 (2021)
Article Google Scholar
Heinzinger, M., et al.: Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 20(1), 1–17 (2019)
Article Google Scholar
Hie, B., Zhong, E.D., Berger, B., Bryson, B.: Learning the language of viral evolution and escape. Science 371(6526), 284–288 (2021)
Article MathSciNet Google Scholar
Hie, B.L., Yang, K.K., Kim, P.S.: Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins. Cell Syst. 13(4), 274–285 (2022)
Article Google Scholar
Hsu, C., et al.: Learning inverse folding from millions of predicted structures. In: ICML, pp. 8946–8970. PMLR (2022)
Google Scholar
Hu, M., et al.: Exploring evolution-aware & -free protein language models as protein function predictors. Adv. Neural. Inf. Process. Syst. 35, 38873–38884 (2022)
Google Scholar
Jones, D.T., Buchan, D.W., Cozzetto, D., Pontil, M.: Psicov: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 28(2), 184–190 (2012)
Article Google Scholar
Jumper, J., et al.: Highly accurate protein structure prediction with alphafold. Nature 596(7873), 583–589 (2021)
Article Google Scholar
Khurana, S., Rawi, R., Kunji, K., Chuang, G.Y., Bensmail, H., Mall, R.: Deepsol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics 34(15), 2605–2613 (2018)
Article Google Scholar
Lin, Z., et al.: Language models of protein sequences at the scale of evolution enable accurate structure prediction. Science (2023)
Google Scholar
Lin, Z., et al.: Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379(6637), 1123–1130 (2023)
Article MathSciNet Google Scholar
Marks, D.S., et al.: Protein 3d structure computed from evolutionary sequence variation. PLoS ONE 6(12), e28766 (2011)
Article Google Scholar
Meng, Q., Guo, F., Tang, J.: Improved structure-related prediction for insufficient homologous proteins using msa enhancement and pre-trained language model. Briefings Bioinform. 24(4), bbad217 (2023)
Google Scholar
Rajagopal, A., Simon, S.M.: Subcellular localization and activity of multidrug resistance proteins. Mol. Biol. Cell 14(8), 3389–3399 (2003)
Article Google Scholar
Rao, J., He, H., Lin, J.: Noise-contrastive estimation for answer selection with deep neural networks. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 1913–1916 (2016)
Google Scholar
Rao, R., et al.: Evaluating protein transfer learning with tape. Advances in neural information processing systems 32 (2019)
Google Scholar
Rao, R., Meier, J., Sercu, T., Ovchinnikov, S., Rives, A.: Transformer protein language models are unsupervised structure learners. Biorxiv, pp. 2020–12 (2020)
Google Scholar
Rao, R.M., et al.: Msa transformer. In: ICML, pp. 8844–8856. PMLR (2021)
Google Scholar
Rives, A., et al.: Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. 118(15), e2016239118 (2021)
Article Google Scholar
Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769 (2016)
Google Scholar
Strokach, A., Becerra, D., Corbi-Verge, C., Perez-Riba, A., Kim, P.M.: Fast and flexible protein design using deep graph neural networks. Cell Syst. 11(4) (2020)
Google Scholar
Wang, H., et al.: Scientific discovery in the age of artificial intelligence. Nature 620(7972), 47–60 (2023)
Article Google Scholar
Wang, X., Xu, Y., He, X., Cao, Y., Wang, M., Chua, T.S.: Reinforced negative sampling over knowledge graph for recommendation. In: WWW, pp. 99–109 (2020)
Google Scholar
Wang, Y., Song, J., Dai, Q., Duan, X.: Hierarchical negative sampling based graph contrastive learning approach for drug-disease association prediction. IEEE J. Biomed. Health Inform. (2024)
Google Scholar
Wei, L., Xing, P., Zeng, J., Chen, J., Su, R., Guo, F.: Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier. Artif. Intell. Med. 83, 67–74 (2017)
Article Google Scholar
Xu, M., Yuan, X., Miret, S., Tang, J.: Protst: multi-modality learning of protein sequences and biomedical texts. ICML (2023)
Google Scholar
Xu, M., et al.: Peer: a comprehensive and multi-task benchmark for protein sequence understanding. NIPS (2022)
Google Scholar
Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W.L., Leskovec, J.: Graph convolutional neural networks for web-scale recommender systems. In: SIGKDD, pp. 974–983 (2018)
Google Scholar
Yu, T., Cui, H., Li, J.C., Luo, Y., Jiang, G., Zhao, H.: Enzyme function prediction using contrastive learning. Science (2023)
Google Scholar
Zhang, Z., et al.: Protein language models learn evolutionary statistics of interacting sequence motifs. bioRxiv, pp. 2024–01 (2024)
Google Scholar
Zhang, Z., et al.: Protein representation learning by geometric structure pretraining. In: ICLR (2023)
Google Scholar
Zheng, Z., Deng, Y., Xue, D., Zhou, Y., Ye, F., Gu, Q.: Structure-informed language models are protein designers. In: ICML, pp. 2023–02 (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
Yaoyao Xu, Xinjian Zhao, Xiaozhuang Song, Benyou Wang & Tianshu Yu

Authors

Yaoyao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xinjian Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaozhuang Song
View author publications
You can also search for this author in PubMed Google Scholar
Benyou Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tianshu Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tianshu Yu .

Editor information

Editors and Affiliations

LTCI, Télécom Paris, Palaiseau Cedex, France
Albert Bifet
Faculty of Informatics, Vytautas Magnus University, Akademija, Lithuania
Tomas Krilavičius
Stockholm University, Kista, Sweden
Ioanna Miliou
School of Information Technology, Halmstad University, Halmstad, Sweden
Slawomir Nowaczyk

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, Y., Zhao, X., Song, X., Wang, B., Yu, T. (2024). Boosting Protein Language Models with Negative Sample Mining. In: Bifet, A., Krilavičius, T., Miliou, I., Nowaczyk, S. (eds) Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14950. Springer, Cham. https://doi.org/10.1007/978-3-031-70381-2_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-70381-2_13
Published: 22 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70380-5
Online ISBN: 978-3-031-70381-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Boosting Protein Language Models with Negative Sample Mining