Abstract
Generative machine learning models sample molecules from chemical space without the need for explicit design rules. To enable the generative design of innovative molecular entities with limited training data, a deep learning framework for customized compound library generation is presented that aims to enrich and expand the pharmacologically relevant chemical space with drug-like molecular entities on demand. This de novo design approach combines best practices and was used to generate molecules that incorporate features of both bioactive synthetic compounds and natural products, which are a primary source of inspiration for drug discovery. The results show that the data-driven machine intelligence acquires implicit chemical knowledge and generates novel molecules with bespoke properties and structural diversity. The method is available as an open-access tool for medicinal and bioorganic chemistry.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The training data used for model building are available from the supplementary Code Ocean capsule55 at https://doi.org/10.24433/CO.0753661.v1, or as GitHub repository56 at https://github.com/ETHmodlab/virtual_libraries.
Code availability
The computational framework for generative molecular design, along with the pretrained neural network weights, is available from the supplementary Code Ocean capsule55 at https://doi.org/10.24433/CO.0753661.v1, or as a GitHub repository56 at https://github.com/ETHmodlab/virtual_libraries.
References
Walters, W. P. Virtual chemical libraries. J. Med. Chem. 62, 1116–1124 (2019).
Mullard, A. 2018 FDA drug approvals. Nat. Rev. Drug Discov. 18, 85–89 (2019).
Dowden, H. & Munro, J. Trends in clinical success rates and therapeutic focus. Nat. Rev. Drug Discov. 18, 495 (2019).
Yuan, W. et al. Chemical space mimicry for drug discovery. J. Chem. Inf. Model. 57, 875–882 (2017).
Guimaraes, G. L., Sanchez-Lengeling, B., Outeiral, C., Farias, P. L. C. & Aspuru-Guzik, A. Objective-reinforced generative adversarial networks (ORGAN) for sequence generation models. Preprint at http://arxiv.org/abs/1705.10843 (2017).
Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminform. 9, 48 (2017).
Putin, E. et al. Reinforced adversarial neural computer for de novo molecular design. J. Chem. Inf. Model. 58, 1194–1204 (2018).
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).
Popova, M., Shvets, M., Oliva, J. & Isayev, O. MolecularRNN: generating realistic molecular graphs with optimized properties. Preprint at https://arxiv.org/abs/1905.13372 (2019).
Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. Preprint at https://arxiv.org/abs/1802.04364 (2018).
You, J., Liu, B., Ying, Z., Pande, V. & Leskovec, J. Graph convolutional policy network for goal-directed molecular graph generation. Adv. NIPS 32, 6410–6421 (2018).
Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 18, 463–477 (2019).
Yang, X., Wang, Y., Byrne, R., Schneider, G. & Yang, S. Concepts of artificial intelligence for computer-assisted drug discovery. Chem. Rev. 119, 10520–10594 (2019).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 61, 85–117 (2015).
Jebara, T. Machine Learning: Discriminative and Generative (Kluwer Academic, Springer, 2004).
Segler, M. H. S., Kogej, T., Tyrchan, C. & Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 4, 120–131 (2018).
Bengio, Y., Ducharme, R., Vincent, P. & Jauvin, C. A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003).
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Merk, D., Grisoni, F., Friedrich, L. & Schneider, G. Tuning artificial intelligence on the de novo design of natural-product-inspired retinoid X receptor modulators. Commun. Chem. 1, 68 (2018).
Merk, D., Friedrich, L., Grisoni, F. & Schneider, G. De novo design of bioactive small molecules by artificial intelligence. Mol. Inf. 37, 1700153 (2018).
Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. How transferable are features in deep neural networks? Adv. NIPS 27, 3320–3328 (2014).
Peters, M. E., Ruder, S. & Smith, N. A. To tune or not to tune? Adapting pretrained representations to diverse tasks. In Proc. 4th Workshop on Representation Learning for NLP 7–14 (RepL4NLP, 2019).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Rodrigues, T., Reker, D., Schneider, P. & Schneider, G. Counting on natural products for drug design. Nat. Chem. 8, 531–541 (2016).
Follmann, M. et al. An approach towards enhancement of a screening library: the next generation library initiative (NGLI) at Bayer—against all odds? Drug Discov. Today 24, 668–672 (2019).
Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res. 45, D945–D954 (2016).
Radford, A. et al. Language models are unsupervised multitask learners. Preprint at https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (2019).
Simard, P., Victorri, B., LeCun, Y. & Denker, J. Tangent prop—a formalism for specifying selected invariances in an adaptive network. Adv. NIPS 4, 895–903 (1991).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Adv. NIPS 25, 1097–1105 (2012).
Dao, T. et al. A kernel theory of modern data augmentation. Proc. Mach. Lern. Res. 97, 1528–1537 (2019).
Bjerrum, E. J. SMILES enumeration as data augmentation for neural network modeling of molecules. Preprint at https://arxiv.org/abs/1703.07076 (2017).
Bjerrum, E. & Sattarov, B. Improving chemical autoencoder latent space and molecular de novo generation diversity with heteroencoders. Biomolecules 8, 131 (2018).
Arús-Pous, J. et al. Randomized SMILES strings improve the quality of molecular generative models. J. Cheminformatics 11, 71 (2019).
Prykhodko, O. et al. A de novo molecular generation method using latent vector based generative adversarial network. J. Cheminformatics 11, 74 (2019).
Gupta, A. et al. Generative recurrent networks for de novo drug design. Mol. Inf. 37, 1700111 (2018).
Neil, D. et al. Exploring deep recurrent models with reinforcement learning for molecule design. In The Sixth International Conference on Learning Representations. Vancouver Convention Center Workshop paper (ICLR, 2018); https://iclr.cc/Conferences/2018
Awale, M., Sirockin, F., Stiefl, N. & Reymond, J. L. Drug analogs from fragment-based long short-term memory generative neural networks. J. Chem. Inf. Model. 59, 1347–1356 (2019).
Tanimoto, T. T. An Elementary Mathematical Theory of Classification and Prediction (International Business Machines Corporation, 1958).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S. & Klambauer, G. Fréchet ChemNet distance: a metric for generative models for molecules in drug discovery. J. Chem. Inf. Model. 58, 1736–1741 (2018).
Boufridi, A. & Quinn, R. J. Harnessing the properties of natural products. Annu. Rev. Pharmacol. Toxicol. 58, 451–470 (2018).
Lovering, F., Bikker, J. & Humblet, C. Escape from flatland: increasing saturation as an approach to improving clinical success. J. Med. Chem. 52, 6752–6756 (2009).
Stratton, C. F., Newman, D. J. & Tan, D. S. Cheminformatic comparison of approved drugs from natural product versus synthetic origins. Bioorg. Med. Chem. Lett. 25, 4802–4807 (2015).
Reutlinger, M. & Schneider, G. Nonlinear dimensionality reduction and mapping of compound libraries for drug discovery. J. Mol. Graph. Model. 34, 108–117 (2012).
McInnes, L. & Healy, J. UMAP: Uniform manifold approximation and projection for dimension reduction. Preprint at http://arxiv.org/abs/1802.03426v1 (2018).
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2018).
Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39, 2887–2893 (1996).
Medina‐Franco, J. L. & Martínez‐Mayorga, K. Scaffold diversity analysis of compound data sets using an entropy‐based measure. QSAR Comb. Sci. 28, 1551–1560 (2009).
Johnson, M. A. & Maggiora, G. M. Concepts and Applications of Molecular Similarity (John Wiley & Sons, 1990).
O’Boyle, N. M. Towards a universal SMILES representation—a standard method to generate canonical SMILES based on the InChI. J. Cheminform. 4, 22 (2012).
Sander, T., Freyss, J., von Korff, M. & Rufener, C. DataWarrior: an open-source program for chemistry aware data visualization and analysis. J. Chem. Inf. Model. 55, 460–473 (2015).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at http://arxiv.org/abs/1412.6980 (2014).
Fréchet, M. Sur la distance de deux lois de probabilité. Comp. Rend. Hebdom. Séances l’Acad. Sci. 244, 689–692 (1957).
Moret M., Friedrich L., Grisoni F., Merk D. & Schneider G. Generative Molecular Design in Low Data Regimes (CodeOcean, 2020); https://doi.org/10.24433/CO.0753661.v1
Moret M., Friedrich L., Grisoni F., Merk D. & Schneider G. Generative Molecular Design in Low Data Regimes (GitHub, ETH Zurich, 2020); https://github.com/ETHmodlab/virtual_libraries
Acknowledgements
This work was financially supported by the Novartis Forschungsstiftung (FreeNovation: AI in Drug Discovery), the Swiss National Science Foundation (grant no. 205321_182176 to G.S.) and the RETHINK initiative at ETH Zurich.
Author information
Authors and Affiliations
Contributions
M.M. and L.F. contributed equally to this work. M.M. and L.F. designed the overall computational workflow. M.M. implemented the workflow and the open-access software release. L.F. performed the scaffold and descriptor analysis. All authors contributed to the study design, analysed the data and jointly wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
G.S. declares a potential financial conflict of interest as a consultant to the pharmaceutical industry and co-founder of inSili.com GmbH, Zurich. No other potential conflicts of interest are declared.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
About this article
Cite this article
Moret, M., Friedrich, L., Grisoni, F. et al. Generative molecular design in low data regimes. Nat Mach Intell 2, 171–180 (2020). https://doi.org/10.1038/s42256-020-0160-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-020-0160-y
This article is cited by
-
Integrating QSAR modelling and deep learning in drug discovery: the emergence of deep QSAR
Nature Reviews Drug Discovery (2024)
-
Macrocyclization of linear molecules by deep learning to facilitate macrocyclic drug candidates discovery
Nature Communications (2023)
-
Leveraging molecular structure and bioactivity with chemical language models for de novo drug design
Nature Communications (2023)
-
67 million natural product-like compound database generated via molecular language processing
Scientific Data (2023)
-
Artificial intelligence for natural product drug discovery
Nature Reviews Drug Discovery (2023)