Abstract
We introduce a pioneering methodology for boosting large language models in the domain of protein representation learning. Our primary contribution lies in the refinement process for correlating the over-reliance on co-evolution knowledge, in a way that networks are trained to distill invaluable insights from negative samples, constituted by protein pairs sourced from disparate categories. By capitalizing on this novel approach, our technique steers the training of transformer-based models within the attention score space. This advanced strategy not only amplifies performance but also reflects the nuanced biological behaviors exhibited by proteins, offering aligned evidence with traditional biological mechanisms such as protein-protein interaction. We experimentally observed improved performance on various tasks over datasets, on top of several well-established large protein models. This innovative paradigm opens up promising horizons for further progress in the realms of protein research and computational biology. The code is open-sourced at https://github.com/LOGO-CUHKSZ/NM-Transformer.
Y. Xu and X. Zhao—Contributed equally to this paper.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alley, E.C., Khimulya, G., Biswas, S., AlQuraishi, M., Church, G.M.: Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16(12), 1315–1322 (2019)
Baek, M., et al.: Accurate prediction of protein structures and interactions using a three-track neural network. Science 373(6557), 871–876 (2021)
Bairoch, A.: The enzyme database in 2000. Nucleic Acids Res. 28(1), 304–305 (2000)
Bepler, T., Berger, B.: Learning the protein language: evolution, structure, and function. Cell Syst. 12(6), 654–669 (2021)
Bonetta, R., Valentino, G.: Machine learning techniques for protein function prediction. Proteins: Struct. Function Bioinform. 88(3), 397–413 (2020)
Burley, S.K., Berman, H.M., Kleywegt, G.J., Markley, J.L., Nakamura, H., Velankar, S.: Protein data bank (pdb): the single global macromolecular structure archive. Protein crystallography: methods and protocols, pp. 627–641 (2017)
Chen, D., Tian, X., Zhou, B., Gao, J., et al.: Profold: protein fold classification with additional structural features and a novel ensemble classifier. BioMed Res. Int. 2016 (2016)
Dallago, C., et al.: Flip: benchmark tasks in fitness landscape inference for proteins. Advances in Neural Information Processing Systems, pp. 2021–11 (2021)
De Juan, D., Pazos, F., Valencia, A.: Emerging methods in protein co-evolution. Nat. Rev. Genet. 14(4), 249–261 (2013)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Elnaggar, A., et al.: Prottrans: toward understanding the language of life through self-supervised learning. TPAMI 44(10) (2021)
Elnaggar, A., et al.: Prottrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. TPAMI, 1 (2021)
Evans, R., et al.: Protein complex prediction with alphafold-multimer. biorxiv pp. 2021–10 (2021)
Fang, X., et al.: A method for multiple-sequence-alignment-free protein structure prediction using a protein language model. Nature Mach. Intell. 5(10), 1087–1096 (2023)
Ferruz, N., Höcker, B.: Controllable protein design with language models. Nature Mach. Intell. 4(6), 521–532 (2022)
Gligorijević, V., et al.: Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12(1), 3168 (2021)
Heinzinger, M., et al.: Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 20(1), 1–17 (2019)
Hie, B., Zhong, E.D., Berger, B., Bryson, B.: Learning the language of viral evolution and escape. Science 371(6526), 284–288 (2021)
Hie, B.L., Yang, K.K., Kim, P.S.: Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins. Cell Syst. 13(4), 274–285 (2022)
Hsu, C., et al.: Learning inverse folding from millions of predicted structures. In: ICML, pp. 8946–8970. PMLR (2022)
Hu, M., et al.: Exploring evolution-aware & -free protein language models as protein function predictors. Adv. Neural. Inf. Process. Syst. 35, 38873–38884 (2022)
Jones, D.T., Buchan, D.W., Cozzetto, D., Pontil, M.: Psicov: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 28(2), 184–190 (2012)
Jumper, J., et al.: Highly accurate protein structure prediction with alphafold. Nature 596(7873), 583–589 (2021)
Khurana, S., Rawi, R., Kunji, K., Chuang, G.Y., Bensmail, H., Mall, R.: Deepsol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics 34(15), 2605–2613 (2018)
Lin, Z., et al.: Language models of protein sequences at the scale of evolution enable accurate structure prediction. Science (2023)
Lin, Z., et al.: Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379(6637), 1123–1130 (2023)
Marks, D.S., et al.: Protein 3d structure computed from evolutionary sequence variation. PLoS ONE 6(12), e28766 (2011)
Meng, Q., Guo, F., Tang, J.: Improved structure-related prediction for insufficient homologous proteins using msa enhancement and pre-trained language model. Briefings Bioinform. 24(4), bbad217 (2023)
Rajagopal, A., Simon, S.M.: Subcellular localization and activity of multidrug resistance proteins. Mol. Biol. Cell 14(8), 3389–3399 (2003)
Rao, J., He, H., Lin, J.: Noise-contrastive estimation for answer selection with deep neural networks. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 1913–1916 (2016)
Rao, R., et al.: Evaluating protein transfer learning with tape. Advances in neural information processing systems 32 (2019)
Rao, R., Meier, J., Sercu, T., Ovchinnikov, S., Rives, A.: Transformer protein language models are unsupervised structure learners. Biorxiv, pp. 2020–12 (2020)
Rao, R.M., et al.: Msa transformer. In: ICML, pp. 8844–8856. PMLR (2021)
Rives, A., et al.: Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. 118(15), e2016239118 (2021)
Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769 (2016)
Strokach, A., Becerra, D., Corbi-Verge, C., Perez-Riba, A., Kim, P.M.: Fast and flexible protein design using deep graph neural networks. Cell Syst. 11(4) (2020)
Wang, H., et al.: Scientific discovery in the age of artificial intelligence. Nature 620(7972), 47–60 (2023)
Wang, X., Xu, Y., He, X., Cao, Y., Wang, M., Chua, T.S.: Reinforced negative sampling over knowledge graph for recommendation. In: WWW, pp. 99–109 (2020)
Wang, Y., Song, J., Dai, Q., Duan, X.: Hierarchical negative sampling based graph contrastive learning approach for drug-disease association prediction. IEEE J. Biomed. Health Inform. (2024)
Wei, L., Xing, P., Zeng, J., Chen, J., Su, R., Guo, F.: Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier. Artif. Intell. Med. 83, 67–74 (2017)
Xu, M., Yuan, X., Miret, S., Tang, J.: Protst: multi-modality learning of protein sequences and biomedical texts. ICML (2023)
Xu, M., et al.: Peer: a comprehensive and multi-task benchmark for protein sequence understanding. NIPS (2022)
Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W.L., Leskovec, J.: Graph convolutional neural networks for web-scale recommender systems. In: SIGKDD, pp. 974–983 (2018)
Yu, T., Cui, H., Li, J.C., Luo, Y., Jiang, G., Zhao, H.: Enzyme function prediction using contrastive learning. Science (2023)
Zhang, Z., et al.: Protein language models learn evolutionary statistics of interacting sequence motifs. bioRxiv, pp. 2024–01 (2024)
Zhang, Z., et al.: Protein representation learning by geometric structure pretraining. In: ICLR (2023)
Zheng, Z., Deng, Y., Xue, D., Zhou, Y., Ye, F., Gu, Q.: Structure-informed language models are protein designers. In: ICML, pp. 2023–02 (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, Y., Zhao, X., Song, X., Wang, B., Yu, T. (2024). Boosting Protein Language Models with Negative Sample Mining. In: Bifet, A., Krilavičius, T., Miliou, I., Nowaczyk, S. (eds) Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14950. Springer, Cham. https://doi.org/10.1007/978-3-031-70381-2_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-70381-2_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70380-5
Online ISBN: 978-3-031-70381-2
eBook Packages: Computer ScienceComputer Science (R0)