Abstract
Generative models for structure-based molecular design hold considerable promise for drug discovery, with the potential to speed up the hit-to-lead development cycle while improving the quality of drug candidates and reducing costs. Data sparsity and bias are, however, the two main roadblocks to the development of three-dimensionally aware models. Here we propose a training protocol based on multilevel self-contrastive learning for improved bias control and data efficiency. The framework leverages the large data resources available for two-dimensional generative modelling with datasets of ligand–protein complexes, resulting in hierarchical generative models that are topologically unbiased, explainable and customizable. We show how, by deconvolving the generative posterior into chemical, topological and structural context factors, we not only avoid common pitfalls in the design and evaluation of generative models, but also gain detailed insight into the generative process itself. This improved transparency considerably aids method development and allows fine-grained control over novelty versus familiarity.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout





Similar content being viewed by others
Data availability
The datasets used in this study are publicly available, for pointers see https://github.com/capoe/libpqr/tree/master/data (repository https://doi.org/10.5281/ZENODO.6827338; ref. 58).
Code availability
The source code and pre-trained models can be accessed at https://github.com/capoe/libpqr (ref. 58).
References
Schneider, G. Automating drug discovery. Nat. Rev. Drug Discovery 17, 97–113 (2018).
Boström, J., Brown, D. G., Young, R. J. & Keserü, G. M. Expanding the medicinal chemistry synthetic toolbox. Nat. Rev. Drug Discov. 17, 709–727 (2018).
Blakemore, D. C. et al. Organic synthesis provides opportunities to transform drug discovery. Nat. Chem. 10, 383–394 (2018).
Erlanson, D. A., Fesik, S. W., Hubbard, R. E., Jahnke, W. & Jhoti, H. Twenty years on: the impact of fragments on drug discovery. Nat. Rev. Drug Discov. 15, 605–619 (2016).
Anderson, A. C. The process of structure-based drug design. Chem. Biol. 10, 787–797 (2003).
Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 18, 463–477 (2019).
Chen, H., Engkvist, O., Wang, Y., Olivecrona, M. & Blaschke, T. The rise of deep learning in drug discovery. Drug Discov. Today 23, 1241–1250 (2018).
Paul, D. et al. Artificial intelligence in drug discovery and development. Drug Discov. Today 26, 80–93 (2021).
Tong, X. et al. Generative models for de novo drug design. J. Med. Chem. 64, 14011–14027 (2021).
Sousa, T., Correia, J., Pereira, V. & Rocha, M. Generative deep learning for targeted compound design. J. Chem. Inf. Model. 61, 5343–5361 (2021).
Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminf. 9, 48 (2017).
Segler, M. H. S., Kogej, T., Tyrchan, C. & Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 4, 120–131 (2018).
Popova, M., Isayev, O. & Tropsha, A. Deep reinforcement learning for de novo drug design. Sci. Adv. 4, eaap7885 (2018).
Born, J. et al. Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2. Mach. Learn. Sci. Technol. 2, 025024 (2021).
You, J., Liu, B., Ying, R., Pande, V. & Leskovec, J. Graph convolutional policy network for goal-directed molecular graph generation. In NIPS 6412–6422 (2018).
Jin, W., Yang, K., Barzilay, R. & Jaakkola, T. Learning multimodal graph-to-graph translation for molecule optimization. In ICLR (2019).
Jin, W., Barzilay, R. & Jaakkola, T. S. Junction tree variational autoencoder for molecular graph generation. In ICML 2328–2337 (2018).
Shi, C. et al. GraphAF: a flow-based autoregressive model for molecular graph generation. CoRR abs/2001.09382 (2020).
Jin, W., Barzilay, D. & Jaakkola, T. Hierarchical generation of molecular graphs using structural motifs. In ICML 4839–4848 (2020).
Chen, Z., Min, M. R., Parthasarathy, S. & Ning, X. A deep generative model for molecule optimization via one fragment modification. Nat. Mach. Intell. 3, 1040–1049 (2021).
Joshi, R. P. et al. 3D-Scaffold: a deep learning framework to generate 3D coordinates of drug-like molecules with desired scaffolds. J. Phys. Chem. B 125, 12166–12176 (2021).
Simm, G. N. C., Pinsler, R., Csányi, G. & Hernández-Lobato, J. M. Symmetry-aware actor-critic for 3D molecular design. In ICLR (2021).
Ghanbarpour, A. & Lill, M. A. Seq2mol: automatic design of de novo molecules conditioned by the target protein sequences through deep neural networks (2020). https://arxiv.org/abs/2010.15900
Skalic, M., Sabbadin, D., Sattarov, B., Sciabola, S. & De Fabritiis, G. From target to drug: generative modeling for the multimodal structure-based ligand design. Mol. Pharmaceutics 16, 4282–4291 (2019).
Xu, M., Ran, T. & Chen, H. De novo molecule design through the molecular generative model conditioned by 3D information of protein binding sites. J. Chem. Inf. Model. 61, 3240–3254 (2021).
Krishnan, S. R. et al. De novo structure-based drug design using deep learning. J. Chem. Inf. Model. (2021).
Wang, M. et al. RELATION: a deep generative model for structure-based de novo drug design. J. Med. Chem. (2022).
Zhang, J. & Chen, H. De novo molecule design using molecular generative models constrained by ligand–protein interactions. J. Chem. Inf. Model. (2022).
Imrie, F., Hadfield, T. E., Bradley, A. R. & Deane, C. M. Deep generative design with 3D pharmacophoric constraints. Chem. Sci. 12, 14577–14589 (2021).
Li, Y., Pei, J. & Lai, L. Structure-based de novo drug design using 3D deep generative models. Chem. Sci. 12, 13664–13675 (2021).
Green, H., Koes, D. R. & Durrant, J. D. Deepfrag: a deep convolutional neural network for fragment-based lead optimization. Chem. Sci. 12, 8036–8047 (2021).
Ragoza, M., Masuda, T. & Koes, D. R. Generating 3D molecules conditional on receptor binding sites with deep generative models. Chem. Sci. 13, 2701–2713 (2022).
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).
Godinez, W. J. et al. Design of potent antimalarials with generative chemistry. Nat. Mach. Intell. 4, 180–186 (2022).
Krenn, M., Häse, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach. Learn.: Sci. Technol. 1, 045024 (2020).
Cross, S. & Cruciani, G. Fragexplorer: Grid-based fragment growing and replacement. J. Chem. Inf. Model. 62, 1224–1235 (2022).
Tan, X. et al. Discovery of pyrazolo[3,4-d]pyridazinone derivatives as selective DDR1 inhibitors via deep learning based design, synthesis, and biological evaluation. J. Med. Chem. 65, 103–119 (2022).
Piticchio, S. G. et al. Discovery of novel BRD4 ligand scaffolds by automated navigation of the fragment chemical space. J. Med. Chem. 64, 17887–17900 (2021).
Gebauer, N. W. A., Gastegger, M., Hessmann, S. S. P., Müller, K.-R. & Schütt, K. T. Inverse design of 3D molecular structures with conditional generative neural networks. Nat. Commun. 13, 973 (2022).
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Brown, N., Fiscato, M., Segler, M. H. & Vaucher, A. C. GuacaMol: benchmarking models for de novo molecular design. J. Chem. Inf. Model. 59, 1096–1108 (2019).
Schnabel, T., Swaminathan, A., Singh, A., Chandak, N. & Joachims, T. Recommendations as treatments: debiasing learning and evaluation. In ICML 1670–1679 (ICML, 2016).
Hu, L., Benson, M. L., Smith, R. D., Lerner, M. G. & Carlson, H. A. Binding MOAD (mother of all databases). Proteins. 60, 333–340 (2005).
Ahmed, A., Smith, R. D., Clark, J. J., Dunbar, J. B. & Carlson, H. A. Recent improvements to Binding MOAD: a resource for protein–ligand binding affinities and structures. Nucleic Acids Res. 43, D465–D469 (2015).
Smith, R. D. et al. Updates to Binding MOAD (mother of all databases): polypharmacology tools and their utility in drug repurposing. J. Mol. Biol. 431, 2423–2433 (2019).
Wangtrakuldee, P. et al. Discovery of Inhibitors of Burkholderia pseudomallei methionine aminopeptidase with antibacterial activity. ACS Med. Chem. Lett. 4, 699–703 (2013).
Helgren, T. R. et al. Rickettsia prowazekii methionine aminopeptidase as a promising target for the development of antibacterial agents. Bioorg. Med. Chem. 25, 813–824 (2017).
Zhou, C., Ma, J., Zhang, J., Zhou, J. & Yang, H. Contrastive learning for debiased candidate generation in large-scale recommender systems. In KDD 3985–3995 (2021).
Khac, P. H. L., Healy, G. & Smeaton, A. F. Contrastive representation learning: a framework and review. IEEE Access 8, 193907–193934 (2020).
You, Y. et al. Graph contrastive learning with augmentations. In NeurIPS 5812–5823 (NeurIPS, 2020).
Wang, Y., Wang, J., Cao, Z. & Barati Farimani, A. Molecular contrastive learning of representations via graph neural networks. Nat. Mach. Intell. 4, 279–287 (2022).
Landrum, G. RDKit: Open-Source Cheminformatics (2020); https://www.rdkit.org
Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds (ICLR, 2019).
Enamine REAL Compounds (Enamine, 2020); https://enamine.net/compound-libraries
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Shultz, M. D. Two decades under the influence of the rule of five and the changing properties of approved oral drugs: miniperspective. J. Med. Chem. 62, 1701–1714 (2019).
Gao, M. & Skolnick, J. Apoc: large-scale identification of similar protein pockets. Bioinformatics 29, 597–604 (2013).
Poelking, C. & Chan, L. libpqr_v0.3 (Zenodo, 2022); https://zenodo.org/record/6827338
Reppert, S. M. et al. Molecular characterization of a second melatonin receptor expressed in human retina and brain: the mel1b melatonin receptor. Proc. Natl. Acad. Sci. USA 92, 8734–8738 (1995).
Boivin, R. P., Luu-The, V., Lachance, R., Labrie, F. & Poirier, D. Structure–activity relationships of 17α-derivatives of estradiol as inhibitors of steroid sulfatase. J. Med. Chem. 43, 4465–4478 (2000).
Güzel, O., Innocenti, A., Scozzafava, A., Salman, A. & Supuran, C. T. Carbonic anhydrase inhibitors. Phenacetyl-, pyridylacetyl- and thienylacetyl-substituted aromatic sulfonamides act as potent and selective isoform VII inhibitors. Bioorg. Med. Chem. Lett. 19, 3170–3173 (2009).
Acknowledgements
L.C. acknowledges funding from Astex through the Sustaining Innovation Postdoctoral Program. We thank C. Murray and D. Branduardi for thoughtful comments on the manuscript, and L. Colwell for fruitful discussions.
Author information
Authors and Affiliations
Contributions
C.P. and M.V. conceived the project. C.P. developed the PQR formalism. L.C. and C.P. developed the code, ran the experiments, performed the data analysis and wrote the paper. R.K. contributed to data preprocessing and visualization. All authors contributed to discussions and to the preparation of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Jannis Born and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Discussion, Figs. 1–4 and Tables 1–6.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chan, L., Kumar, R., Verdonk, M. et al. A multilevel generative framework with hierarchical self-contrasting for bias control and transparency in structure-based ligand design. Nat Mach Intell 4, 1130–1142 (2022). https://doi.org/10.1038/s42256-022-00564-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-022-00564-7