research-article

SIPF: Sampling Method for Inverse Protein Folding

Authors:

Jimeng SunAuthors Info & Claims

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 378 - 388

https://doi.org/10.1145/3534678.3539284

Published: 14 August 2022 Publication History

Abstract

Protein engineering has important applications in drug discovery. Among others, inverse protein folding is a fundamental task in protein design, which aims at generating protein's amino acid sequence given a 3D graph structure. However, most existing methods for inverse protein folding are based on sequential generative models and therefore limited in uncertainty quantification and exploration ability to the entire protein space. To address the issues, we propose a sampling method for inverse protein folding (SIPF). Specifically, we formulate inverse protein folding as a sampling problem and design two pretrained neural networks as Markov Chain Monte Carlo (MCMC) proposal distribution. To ensure sampling efficiency, we further design (i) an adaptive sampling scheme to select variables for sampling and (ii) an approximate target distribution as a surrogate of the unavailable target distribution. Empirical studies have been conducted to validate the effectiveness of SIPF, achieving 7.4% relative improvement on recovery rate and 6.4% relative reduction in perplexity compared to the best baseline.

Supplemental Material

MP4 File

sampling method for protein inverse folding.

Download
24.07 MB

References

[1]

Rahmad Akbar et al. 2021. In silico proof of principle of machine learning-based antibody design at unconstrained scale. BioRXiV (2021).

[2]

Ethan C Alley et al. 2019. Unified rational protein engineering with sequencebased deep representation learning. Nature methods (2019).

[3]

Christophe Andrieu and Gareth O Roberts. 2009. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics (2009).

[4]

Jose Juan Almagro Armenteros et al. 2020. Language modelling for biological sequences--curated datasets and baselines. BioRxiv (2020).

[5]

Tristan Bepler and Bonnie Berger. 2019. Learning protein sequence embeddings using information from structure. ICLR (2019).

[6]

Nadav Brandes et al. 2021. ProteinBERT: A universal deep-learning model of protein sequence and function. bioRxiv (2021).

[7]

Yue Cao et al. 2021. Fold2Seq: A Joint Sequence (1D)-Fold (3D) Embedding-based Generative Model for Protein Design. In ICML.

[8]

Jacob Devlin et al. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL (2019).

[9]

Mathieu Dondelinger et al. 2018. Understanding the significance and implications of antibody numbering and antigen-binding surface/residue definition. Frontiers in immunology (2018).

[10]

Tianfan Fu et al. 2020. MIMOSA: Multi-constraint Molecule Sampling for Molecule Optimization. AAAI (2020).

[11]

Pablo Gamallo et al. 2017. A perplexity-based method for similar languages discrimination. In 4-th workshop on NLP for similar languages, varieties.

[12]

WGao et al. 2020. Deep learning in protein modeling and design. Patterns (2020).

[13]

Alan Gelfand. 2000. Gibbs sampling. J. American statistical Association (2000).

[14]

Stuart Geman and Donald Geman. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. TPAMI (1984).

[15]

Walter Gilks. 2005. Markov Chain Monte Carlo. Encyclopedia of biostat. (2005).

[16]

Bryan D He et al. 2016. Scan Order in Gibbs Sampling: Models in Which it Matters and Bounds on How Much. In NIPS.

[17]

Weihua Hu et al. 2019. Strategies for pre-training graph neural networks. ICLR (2019).

[18]

Kexin Huang et al. 2020. DeepPurpose: a deep learning library for drug--target interaction prediction. Bioinformatics (2020).

[19]

Kexin Huang et al. 2021. Therapeutics data Commons: machine learning datasets and tasks for therapeutics. NeurIPS Track Datasets and Benchmarks (2021).

[20]

John Ingraham et al. 2019. Generative Models for Graph-Based Protein Design. NeurIPS (2019).

[21]

Wengong Jin et al. 2022. Iterative refinement graph neural network for antibody sequence-structure co-design. ICLR (2022).

[22]

Andrew Leaver-Fay et al. 2011. ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. In Methods in enzymology.

[23]

Chengbo Li et al. 2013. An efficient augmented Lagrangian method with applications to total variation minimization. Computational Optimization (2013).

[24]

Ge Liu et al. 2020. Antibody complementarity determining region design using high-capacity machine learning. Bioinformatics (2020).

[25]

Jun S Liu et al. 2001. Monte Carlo strategies in scientific computing. Springer.

[26]

Amy X Lu et al. 2020. Self-supervised contrastive learning of protein representations by mutual information maximization. BioRxiv (2020).

[27]

Shitong Luo et al. 2021. A 3D Generative Model for Structure-Based Drug Design. NeurIPS (2021).

[28]

H Narayanan et al. 2021. Machine learning for biologics: opportunities for protein engineering, developability, and formulation. Trends in pharmaco. sci. (2021).

[29]

James O'Connell et al. 2018. SPIN2: Predicting sequence profiles from protein structures using deep neural networks. Proteins: Structure, Function, and Bioinformatics (2018).

[30]

Christine A Orengo et al. 1997. CATH--a hierarchic classification of protein domain structures. Structure (1997).

[31]

Cristian Pasarica and Andrew Gelman. 2010. Adaptively scaling the Metropolis algorithm using expected squared jumped distance. Statistica Sinica (2010).

[32]

Yifei Qi et al. 2020. DenseCPD: improving the accuracy of neural-network-based computational protein sequence design with DenseNet. JCIM (2020).

[33]

Prajit Ramachandran et al. 2017. Searching for activation functions. arXiv (2017).

[34]

Donatas Repecka et al. 2021. Expanding functional protein sequence spaces using generative adversarial networks. Nature Machine Intelligence (2021).

[35]

Victor Garcia Satorras et al. 2021. E(n) equivariant graph neural networks. ICML (2021).

[36]

Sam Sinai et al. 2017. Variational auto-encoding of protein sequences. arXiv (2017).

[37]

Alexey Strokach et al. 2020. Fast and flexible protein design using deep graph neural networks. Cell Systems (2020).

[38]

Kathryn E Tiller et al. 2015. Advances in antibody design. Annual review of biomedical engineering (2015).

[39]

Jérôme Tubiana et al. 2019. Learning protein constitutive motifs from sequence data. Elife (2019).

[40]

Max Welling et al. 2011. Bayesian learning via stochastic gradient Langevin dynamics. In ICML.

[41]

Kaizhi Yue and Ken A Dill. 1992. Inverse protein folding problem: designing polymer sequences. Proceedings of the National Academy of Sciences (1992).

[42]

Yuan Zhang et al. 2020. ProDCoNN: Protein design using a convolutional neural network. Proteins: Structure, Function, and Bioinformatics (2020).

[43]

Jun Zhao et al. 2018. In silico methods in antibody design. Antibodies (2018).

[44]

Yue Zhao et al. 2021. Pyhealth: A python library for health predict models. arXiv (2021).

Index Terms

SIPF: Sampling Method for Inverse Protein Folding
1. Computing methodologies
  1. Artificial intelligence
    1. Search methodologies
      1. Continuous space search

Recommendations

Inverse Protein Folding in 2D HP Mode (Extended Abstract)
CSB '04: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference

The inverse protein folding problem is that of designing an amino acid sequence which has a particular native protein fold. This problem arises in drug design where a particular structure is necessary to ensure proper protein-protein interactions. In ...
A novel ensemble-based scoring and search algorithm for protein redesign, and its application to modify the substrate specificity of the gramicidin synthetase A phenylalanine adenylation enzyme
RECOMB '04: Proceedings of the eighth annual international conference on Research in computational molecular biology

Realization of novel molecular function requires the ability to alter molecular complex formation. Enzymatic function can be altered by changing enzyme-substrate interactions via modification of an enzyme's active site. A redesigned enzyme may either ...
Robustness and efficiency in inverse protein folding
Proceedings of the 16th annual international conference of the Center for Nonlinear Studies on Landscape paradigms in physics and biology : concepts, structures and dynamics: concepts, structures and dynamics

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2022

5033 pages

ISBN:9781450393850

DOI:10.1145/3534678

General Chairs:
Aidong Zhang
University of Virginia
,
Huzefa Rangwala
Amazon/George Mason University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF

Conference

KDD '22

Sponsor:

KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 14 - 18, 2022

Washington DC, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
475
Total Downloads

Downloads (Last 12 months)55
Downloads (Last 6 weeks)7

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten