skip to main content
10.1145/3534678.3539284acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

SIPF: Sampling Method for Inverse Protein Folding

Published: 14 August 2022 Publication History

Abstract

Protein engineering has important applications in drug discovery. Among others, inverse protein folding is a fundamental task in protein design, which aims at generating protein's amino acid sequence given a 3D graph structure. However, most existing methods for inverse protein folding are based on sequential generative models and therefore limited in uncertainty quantification and exploration ability to the entire protein space. To address the issues, we propose a sampling method for inverse protein folding (SIPF). Specifically, we formulate inverse protein folding as a sampling problem and design two pretrained neural networks as Markov Chain Monte Carlo (MCMC) proposal distribution. To ensure sampling efficiency, we further design (i) an adaptive sampling scheme to select variables for sampling and (ii) an approximate target distribution as a surrogate of the unavailable target distribution. Empirical studies have been conducted to validate the effectiveness of SIPF, achieving 7.4% relative improvement on recovery rate and 6.4% relative reduction in perplexity compared to the best baseline.

Supplemental Material

MP4 File
sampling method for protein inverse folding.

References

[1]
Rahmad Akbar et al. 2021. In silico proof of principle of machine learning-based antibody design at unconstrained scale. BioRXiV (2021).
[2]
Ethan C Alley et al. 2019. Unified rational protein engineering with sequencebased deep representation learning. Nature methods (2019).
[3]
Christophe Andrieu and Gareth O Roberts. 2009. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics (2009).
[4]
Jose Juan Almagro Armenteros et al. 2020. Language modelling for biological sequences--curated datasets and baselines. BioRxiv (2020).
[5]
Tristan Bepler and Bonnie Berger. 2019. Learning protein sequence embeddings using information from structure. ICLR (2019).
[6]
Nadav Brandes et al. 2021. ProteinBERT: A universal deep-learning model of protein sequence and function. bioRxiv (2021).
[7]
Yue Cao et al. 2021. Fold2Seq: A Joint Sequence (1D)-Fold (3D) Embedding-based Generative Model for Protein Design. In ICML.
[8]
Jacob Devlin et al. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL (2019).
[9]
Mathieu Dondelinger et al. 2018. Understanding the significance and implications of antibody numbering and antigen-binding surface/residue definition. Frontiers in immunology (2018).
[10]
Tianfan Fu et al. 2020. MIMOSA: Multi-constraint Molecule Sampling for Molecule Optimization. AAAI (2020).
[11]
Pablo Gamallo et al. 2017. A perplexity-based method for similar languages discrimination. In 4-th workshop on NLP for similar languages, varieties.
[12]
WGao et al. 2020. Deep learning in protein modeling and design. Patterns (2020).
[13]
Alan Gelfand. 2000. Gibbs sampling. J. American statistical Association (2000).
[14]
Stuart Geman and Donald Geman. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. TPAMI (1984).
[15]
Walter Gilks. 2005. Markov Chain Monte Carlo. Encyclopedia of biostat. (2005).
[16]
Bryan D He et al. 2016. Scan Order in Gibbs Sampling: Models in Which it Matters and Bounds on How Much. In NIPS.
[17]
Weihua Hu et al. 2019. Strategies for pre-training graph neural networks. ICLR (2019).
[18]
Kexin Huang et al. 2020. DeepPurpose: a deep learning library for drug--target interaction prediction. Bioinformatics (2020).
[19]
Kexin Huang et al. 2021. Therapeutics data Commons: machine learning datasets and tasks for therapeutics. NeurIPS Track Datasets and Benchmarks (2021).
[20]
John Ingraham et al. 2019. Generative Models for Graph-Based Protein Design. NeurIPS (2019).
[21]
Wengong Jin et al. 2022. Iterative refinement graph neural network for antibody sequence-structure co-design. ICLR (2022).
[22]
Andrew Leaver-Fay et al. 2011. ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. In Methods in enzymology.
[23]
Chengbo Li et al. 2013. An efficient augmented Lagrangian method with applications to total variation minimization. Computational Optimization (2013).
[24]
Ge Liu et al. 2020. Antibody complementarity determining region design using high-capacity machine learning. Bioinformatics (2020).
[25]
Jun S Liu et al. 2001. Monte Carlo strategies in scientific computing. Springer.
[26]
Amy X Lu et al. 2020. Self-supervised contrastive learning of protein representations by mutual information maximization. BioRxiv (2020).
[27]
Shitong Luo et al. 2021. A 3D Generative Model for Structure-Based Drug Design. NeurIPS (2021).
[28]
H Narayanan et al. 2021. Machine learning for biologics: opportunities for protein engineering, developability, and formulation. Trends in pharmaco. sci. (2021).
[29]
James O'Connell et al. 2018. SPIN2: Predicting sequence profiles from protein structures using deep neural networks. Proteins: Structure, Function, and Bioinformatics (2018).
[30]
Christine A Orengo et al. 1997. CATH--a hierarchic classification of protein domain structures. Structure (1997).
[31]
Cristian Pasarica and Andrew Gelman. 2010. Adaptively scaling the Metropolis algorithm using expected squared jumped distance. Statistica Sinica (2010).
[32]
Yifei Qi et al. 2020. DenseCPD: improving the accuracy of neural-network-based computational protein sequence design with DenseNet. JCIM (2020).
[33]
Prajit Ramachandran et al. 2017. Searching for activation functions. arXiv (2017).
[34]
Donatas Repecka et al. 2021. Expanding functional protein sequence spaces using generative adversarial networks. Nature Machine Intelligence (2021).
[35]
Victor Garcia Satorras et al. 2021. E(n) equivariant graph neural networks. ICML (2021).
[36]
Sam Sinai et al. 2017. Variational auto-encoding of protein sequences. arXiv (2017).
[37]
Alexey Strokach et al. 2020. Fast and flexible protein design using deep graph neural networks. Cell Systems (2020).
[38]
Kathryn E Tiller et al. 2015. Advances in antibody design. Annual review of biomedical engineering (2015).
[39]
Jérôme Tubiana et al. 2019. Learning protein constitutive motifs from sequence data. Elife (2019).
[40]
Max Welling et al. 2011. Bayesian learning via stochastic gradient Langevin dynamics. In ICML.
[41]
Kaizhi Yue and Ken A Dill. 1992. Inverse protein folding problem: designing polymer sequences. Proceedings of the National Academy of Sciences (1992).
[42]
Yuan Zhang et al. 2020. ProDCoNN: Protein design using a convolutional neural network. Proteins: Structure, Function, and Bioinformatics (2020).
[43]
Jun Zhao et al. 2018. In silico methods in antibody design. Antibodies (2018).
[44]
Yue Zhao et al. 2021. Pyhealth: A python library for health predict models. arXiv (2021).

Index Terms

  1. SIPF: Sampling Method for Inverse Protein Folding

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
    August 2022
    5033 pages
    ISBN:9781450393850
    DOI:10.1145/3534678
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 August 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. drug discovery
    2. inverse protein folding
    3. protein design
    4. protein engineering
    5. sampling method

    Qualifiers

    • Research-article

    Funding Sources

    • NSF

    Conference

    KDD '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 475
      Total Downloads
    • Downloads (Last 12 months)55
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media