skip to main content
10.1145/3459930.3469527acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

Transfer learning for predicting virus-host protein interactions for novel virus sequences

Published:01 August 2021Publication History

ABSTRACT

Viruses such as SARS-CoV-2 infect the human body by forming interactions between virus proteins and human proteins. However, experimental methods to find protein interactions are inadequate: large scale experiments are noisy, and small scale experiments are slow and expensive. Inspired by the recent successes of deep neural networks, we hypothesize that deep learning methods are well-positioned to aid and augment biological experiments, hoping to help identify more accurate virus-host protein interaction maps. Moreover, computational methods can quickly adapt to predict how virus mutations change protein interactions with the host proteins.

We propose DeepVHPPI, a novel deep learning framework combining a self-attention-based transformer architecture and a transfer learning training strategy to predict interactions between human proteins and virus proteins that have novel sequence patterns. We show that our approach outperforms the state-of-the-art methods significantly in predicting Virus-Human protein interactions for SARS-CoV-2, H1N1, and Ebola. In addition, we demonstrate how our framework can be used to predict and interpret the interactions of mutated SARS-CoV-2 Spike protein sequences.

Availability: We make all of our data and code available on GitHub https://github.com/QData/DeepVHPPI.

References

  1. Mohammed AlQuraishi. End-to-end differentiable learning of protein structure. Cell systems, 8(4):292--301, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  2. Mais G Ammari, Cathy R Gresham, Fiona M McCarthy, and Bindu Nanduri. Hpidb 2.0: a curated database for host-pathogen interactions. Database, 2016, 2016.Google ScholarGoogle Scholar
  3. Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817--1853, 2005.Google ScholarGoogle Scholar
  4. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.Google ScholarGoogle Scholar
  5. Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. Cloze-driven pretraining of self-attention networks. arXiv preprint arXiv:1903.07785, 2019.Google ScholarGoogle Scholar
  6. Ranjan Kumar Barman, Sudipto Saha, and Santasabuj Das. Prediction of interactions between viral and host proteins using supervised machine learning methods. PloS one, 9(11):e112034, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  7. Asa Ben-Hur and William Stafford Noble. Kernel methods for predicting protein-protein interactions. Bioinformatics, 21(suppl_1):i38--i46, 2005.Google ScholarGoogle Scholar
  8. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137--1155, 2003.Google ScholarGoogle Scholar
  9. Tristan Bepler and Bonnie Berger. Learning protein sequence embeddings using information from structure. arXiv preprint arXiv:1902.08661, 2019.Google ScholarGoogle Scholar
  10. Anne-Florence Bitbol. Inferring interaction partners from protein sequences using mutual information. PLoS computational biology, 14(11):e1006401, 2018.Google ScholarGoogle Scholar
  11. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Adittya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.Google ScholarGoogle Scholar
  12. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of machine learning research, 12(Aug):2493--2537, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Qian Cong, Ivan Anishchenko, Sergey Ovchinnikov, and David Baker. Protein interaction networks revealed by proteome coevolution. Science, 365(6449):185--189, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  14. UniProt Consortium. Uniprot: a worldwide hub of protein knowledge. Nucleic acids research, 47(D1):D506--D515, 2019.Google ScholarGoogle Scholar
  15. Guangyu Cui, Chao Fang, and Kyungsook Han. Prediction of protein-protein interactions between viruses and human by an svm model. In BMC bioinformatics, volume 13, p. S5. Springer, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  16. Norman E Davey, Gilles Travé, and Toby J Gibson. How viruses hijack cell regulation. Trends in biochemical sciences, 36(3):159--169, 2011.Google ScholarGoogle Scholar
  17. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.Google ScholarGoogle Scholar
  18. Fatma-Elzahraa Eid, Mahmoud ElHefnawi, and Lenwood S Heath. Denovo: virus-host sequence-based protein-protein interaction prediction. Bioinformatics, 32(8):1144--1150, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  19. Stanley Fields and Ok-kyu Song. A novel genetic system to detect protein-protein interactions. Nature, 340(6230):245--246, 1989.Google ScholarGoogle ScholarCross RefCross Ref
  20. Shawn M Gomez, William Stafford Noble, and Andrey Rzhetsky. Learning to predict protein-protein interactions from protein sequences. Bioinformatics, 19(15):1875--1881, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  21. David E Gordon, Gwendolyn M Jang, Mehdi Bouhaddou, Jiewei Xu, Kirsten Obernier, Kris M White, Matthew J O'Meara, Veronica V Rezelj, Jeffrey Z Guo, Danielle L Swaney, et al. A sars-cov-2 protein interaction map reveals targets for drug repurposing. Nature, pp. 1--13, 2020.Google ScholarGoogle Scholar
  22. Yanzhi Guo, Lezheng Yu, Zhining Wen, and Menglong Li. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic acids research, 36(9):3025--3030, 2008.Google ScholarGoogle Scholar
  23. Tobias Hamp and Burkhard Rost. Evolutionary profiles improve protein-protein interaction prediction from sequence. Bioinformatics, 31(12):1945--1950, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  24. Somaye Hashemifar, Behnam Neyshabur, Aly A Khan, and Jinbo Xu. Predicting protein-protein interactions through sequence-based deep learning. Bioinformatics, 34(17):i802--i810, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  25. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770--778, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  26. Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.Google ScholarGoogle Scholar
  27. Yuen Ho, Albrecht Gruhler, Adrian Heilbut, Gary D Bader, Lynda Moore, Sally-Lin Adams, Anna Millar, Paul Taylor, Keiryn Bennett, Kelly Boutilier, et al. Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature, 415(6868):180--183, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  28. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jie Hou, Badri Adhikari, and Jianlin Cheng. Deepsf: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics, 34(8):1295--1303, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  30. Kalyani B Karunakaran, N Balakrishnan, and Madhavi K Ganapathiraju. Interactome of sars-cov-2/ncov19 modulated host proteins with computationally predicted ppis, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  31. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.Google ScholarGoogle Scholar
  32. Michael Schantz Klausen, Martin Closter Jespersen, Henrik Nielsen, Kamilla Kjaergaard Jensen, Vanessa Isabell Jurtz, Casper Kaae Soenderby, Morten Otto Alexander Sommer, Ole Winther, Morten Nielsen, Bent Petersen, et al. Netsurfp-2.0: Improved prediction of protein structural features by integrated deep learning. Proteins: Structure, Function, and Bioinformatics, 87(6):520--527, 2019.Google ScholarGoogle Scholar
  33. Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835, 2017.Google ScholarGoogle Scholar
  34. Dekang Lin and Xiaoyun Wu. Phrase clustering for discriminative learning. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pp. 1030--1038. Association for Computational Linguistics, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Zeming Lin, Jack Lanchantin, and Yanjun Qi. Must-cnn: a multilayer shift-and-stitch deep convolutional architecture for sequence-based protein structure prediction. In Thirtieth AAAI conference on artificial intelligence, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  36. Shawn Martin, Diana Roe, and Jean-Loup Faulon. Predicting protein-protein interactions using signature products. Bioinformatics, 21(2):218--226, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111--3119, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Seonwoo Min, Seunghyun Park, Siwon Kim, Hyun-Soo Choi, and Sungroh Yoon. Pre-training of deep bidirectional protein sequence representations with structural information, 2019.Google ScholarGoogle Scholar
  39. John X Morris, Eli Lifland, Jack Lanchantin, Yangfeng Ji, and Yanjun Qi. Reevaluating adversarial examples in natural language. arXiv preprint arXiv:2004.14174, 2020.Google ScholarGoogle Scholar
  40. Esmaeil Nourani, Farshad Khunjush, and Saliha Durmuş. Computational approaches for prediction of pathogen-host protein-protein interactions. Frontiers in microbiology, 6:94, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  41. Rose Oughtred, Chris Stark, Bobby-Joe Breitkreutz, Jennifer Rust, Lorrie Boucher, Christie Chang, Nadine Kolas, Lara O'Donnell, Genie Leung, Rochelle McAdam, et al. The biogrid interaction database: 2019 update. Nucleic acids research, 47(D1):D529--D541, 2019.Google ScholarGoogle Scholar
  42. Yungki Park and Edward M Marcotte. Flaws in evaluation schemes for pair-input computational predictions. Nature methods, 9(12):1134, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  43. Florencio Pazos and Alfonso Valencia. Similarity of phylogenetic trees as indicator of protein-protein interaction. Protein engineering, 14(9):609--614, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  44. Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532--1543, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  45. Matthew E Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. Semi-supervised sequence tagging with bidirectional language models. arXiv preprint arXiv:1705.00108, 2017.Google ScholarGoogle Scholar
  46. EM Phizicky and S. Fields. Protein-protein interactions: methods for detection and analysis. Microbiol Rev., 59(1):94--123, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  47. Sylvain Pitre, Mohsen Hooshyar, Andrew Schoenrock, Bahram Samanfar, Matthew Jessulat, James R Green, Frank Dehne, and Ashkan Golshani. Short co-occurring polypeptide regions can predict global protein interaction maps. Scientific reports, 2:239, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  48. Yanjun Qi, Merja Oja, Jason Weston, and William Stafford Noble. A unified multitask architecture for predicting local protein properties. PloS one, 7(3):e32235, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  49. Yanjun Qi, Oznur Tastan, Jaime G Carbonell, Judith Klein-Seetharaman, and Jason Weston. Semi-supervised multi-task learning for predicting interactions between hiv-1 and human proteins. Bioinformatics, 26(18):i645--i652, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019.Google ScholarGoogle Scholar
  51. Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone self-attention in vision models. arXiv preprint arXiv:1906.05909, 2019.Google ScholarGoogle Scholar
  52. Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, and Yun S Song. Evaluating protein transfer learning with tape. arXiv preprint arXiv:1906.08230, 2019.Google ScholarGoogle Scholar
  53. Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning, 2016.Google ScholarGoogle Scholar
  54. Emma Redhead and Timothy L Bailey. Discriminative motif discovery in dna and protein sequences using the deme algorithm. BMC bioinformatics, 8(1):385, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  55. Michael Remmert, Andreas Biegert, Andreas Hauser, and Johannes Söding. Hhblits: lightning-fast iterative protein sequence searching by hmm-hmm alignment. Nature methods, 9(2):173, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  56. Florian Richoux, Charlène Servantie, Cynthia Borès, and Stéphane Téletchéa. Comparing two deep learning sequence-based models for protein-protein interaction prediction. arXiv preprint arXiv:1901.06268, 2019.Google ScholarGoogle Scholar
  57. Alexander Rives, Siddharth Goyal, Joshua Meier, Demi Guo, Myle Ott, C Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. bioRxiv, p. 622803, 2019.Google ScholarGoogle Scholar
  58. Alejandro A Schäffer, L Aravind, Thomas L Madden, Sergei Shavirin, John L Spouge, Yuri I Wolf, Eugene V Koonin, and Stephen F Altschul. Improving the accuracy of psi-blast protein database searches with composition-based statistics and other refinements. Nucleic acids research, 29(14):2994--3005, 2001.Google ScholarGoogle Scholar
  59. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.Google ScholarGoogle Scholar
  60. Tyler N Starr, Allison J Greaney, Sarah K Hilton, Daniel Ellis, Katharine HD Crawford, Adam S Dingens, Mary Jane Navarro, John E Bowen, M Alejandra Tortorici, Alexandra C Walls, et al. Deep mutational scanning of sars-cov-2 receptor binding domain reveals constraints on folding and ace2 binding. Cell, 182(5):1295--1310, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  61. Tanlin Sun, Bo Zhou, Luhua Lai, and Jianfeng Pei. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC bioinformatics, 18(1):1--8, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  62. Oznur Tastan, Yanjun Qi, Jaime G Carbonell, and Judith Klein-Seetharaman. Prediction of interactions between hiv-1 and human proteins by information integration, 2009.Google ScholarGoogle Scholar
  63. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998--6008, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Christian von Mering, Roland Krause, Berend Snel, Michael Cornell, Stephen G. Oliver, Stanley Fields, and Peer Bork. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417(6887):399--403, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  65. Lei Wang, Hai-Feng Wang, San-Rong Liu, Xin Yan, and Ke-Jian Song. Predicting protein-protein interactions from matrix-based protein sequence using convolution neural network and feature-selective rotation forest. Scientific reports, 9(1):1--12, 2019.Google ScholarGoogle Scholar
  66. Kevin K Yang, Zachary Wu, and Frances H Arnold. Machine-learning-guided directed evolution for protein engineering. Nature methods, 16(8):687--694, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  67. Lei Yang, Jun-Feng Xia, and Jie Gui. Prediction of protein-protein interactions from protein sequence using local descriptors. Protein and Peptide Letters, 17(9):1085--1090, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  68. Xiaodi Yang, Shiping Yang, Qinmengge Li, Stefan Wuchty, and Ziding Zhang. Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method. Computational and structural biotechnology journal, 18:153--161, 2020.Google ScholarGoogle Scholar
  69. Zhu-Hong You, Lin Zhu, Chun-Hou Zheng, Hong-Jie Yu, Su-Ping Deng, and Zhen Ji. Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. In BMC bioinformatics, volume 15, p. S9. Springer, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  70. Shao-Wu Zhang and Ze-Gang Wei. Some remarks on prediction of protein-protein interaction with machine learning. Medicinal Chemistry, 11(3):254--264, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  71. Xiang Zhou, Byungkyu Park, Daesik Choi, and Kyungsook Han. A generalized approach to predicting protein-protein interactions between virus and host. BMC genomics, 19(6):568, 2018.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Transfer learning for predicting virus-host protein interactions for novel virus sequences
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            BCB '21: Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
            August 2021
            603 pages
            ISBN:9781450384506
            DOI:10.1145/3459930

            Copyright © 2021 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 August 2021

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate254of885submissions,29%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader