Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Generative molecular design in low data regimes

Abstract

Generative machine learning models sample molecules from chemical space without the need for explicit design rules. To enable the generative design of innovative molecular entities with limited training data, a deep learning framework for customized compound library generation is presented that aims to enrich and expand the pharmacologically relevant chemical space with drug-like molecular entities on demand. This de novo design approach combines best practices and was used to generate molecules that incorporate features of both bioactive synthetic compounds and natural products, which are a primary source of inspiration for drug discovery. The results show that the data-driven machine intelligence acquires implicit chemical knowledge and generates novel molecules with bespoke properties and structural diversity. The method is available as an open-access tool for medicinal and bioorganic chemistry.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: CLM training and sampling of new molecules.
Fig. 2: Data augmentation and temperature sampling.
Fig. 3: Chemical space navigation by transfer learning with five similar molecules.
Fig. 4: The five most frequent scaffolds from different training epochs during chemical space navigation to de novo-generated focused compound libraries.
Fig. 5: Chemical space navigation by transfer learning with five dissimilar molecules.
Fig. 6: The five most frequent scaffolds from different training epochs during chemical space navigation.

Similar content being viewed by others

Data availability

The training data used for model building are available from the supplementary Code Ocean capsule55 at https://doi.org/10.24433/CO.0753661.v1, or as GitHub repository56 at https://github.com/ETHmodlab/virtual_libraries.

Code availability

The computational framework for generative molecular design, along with the pretrained neural network weights, is available from the supplementary Code Ocean capsule55 at https://doi.org/10.24433/CO.0753661.v1, or as a GitHub repository56 at https://github.com/ETHmodlab/virtual_libraries.

References

  1. Walters, W. P. Virtual chemical libraries. J. Med. Chem. 62, 1116–1124 (2019).

    Article  Google Scholar 

  2. Mullard, A. 2018 FDA drug approvals. Nat. Rev. Drug Discov. 18, 85–89 (2019).

    Article  Google Scholar 

  3. Dowden, H. & Munro, J. Trends in clinical success rates and therapeutic focus. Nat. Rev. Drug Discov. 18, 495 (2019).

    Article  Google Scholar 

  4. Yuan, W. et al. Chemical space mimicry for drug discovery. J. Chem. Inf. Model. 57, 875–882 (2017).

    Article  Google Scholar 

  5. Guimaraes, G. L., Sanchez-Lengeling, B., Outeiral, C., Farias, P. L. C. & Aspuru-Guzik, A. Objective-reinforced generative adversarial networks (ORGAN) for sequence generation models. Preprint at http://arxiv.org/abs/1705.10843 (2017).

  6. Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminform. 9, 48 (2017).

    Article  Google Scholar 

  7. Putin, E. et al. Reinforced adversarial neural computer for de novo molecular design. J. Chem. Inf. Model. 58, 1194–1204 (2018).

    Article  Google Scholar 

  8. Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).

    Article  Google Scholar 

  9. Popova, M., Shvets, M., Oliva, J. & Isayev, O. MolecularRNN: generating realistic molecular graphs with optimized properties. Preprint at https://arxiv.org/abs/1905.13372 (2019).

  10. Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. Preprint at https://arxiv.org/abs/1802.04364 (2018).

  11. You, J., Liu, B., Ying, Z., Pande, V. & Leskovec, J. Graph convolutional policy network for goal-directed molecular graph generation. Adv. NIPS 32, 6410–6421 (2018).

    Google Scholar 

  12. Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 18, 463–477 (2019).

    Article  Google Scholar 

  13. Yang, X., Wang, Y., Byrne, R., Schneider, G. & Yang, S. Concepts of artificial intelligence for computer-assisted drug discovery. Chem. Rev. 119, 10520–10594 (2019).

    Article  Google Scholar 

  14. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    Article  Google Scholar 

  15. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 61, 85–117 (2015).

    Article  Google Scholar 

  16. Jebara, T. Machine Learning: Discriminative and Generative (Kluwer Academic, Springer, 2004).

  17. Segler, M. H. S., Kogej, T., Tyrchan, C. & Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 4, 120–131 (2018).

    Article  Google Scholar 

  18. Bengio, Y., Ducharme, R., Vincent, P. & Jauvin, C. A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003).

    MATH  Google Scholar 

  19. Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).

    Article  Google Scholar 

  20. Merk, D., Grisoni, F., Friedrich, L. & Schneider, G. Tuning artificial intelligence on the de novo design of natural-product-inspired retinoid X receptor modulators. Commun. Chem. 1, 68 (2018).

    Article  Google Scholar 

  21. Merk, D., Friedrich, L., Grisoni, F. & Schneider, G. De novo design of bioactive small molecules by artificial intelligence. Mol. Inf. 37, 1700153 (2018).

    Article  Google Scholar 

  22. Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. How transferable are features in deep neural networks? Adv. NIPS 27, 3320–3328 (2014).

    Google Scholar 

  23. Peters, M. E., Ruder, S. & Smith, N. A. To tune or not to tune? Adapting pretrained representations to diverse tasks. In Proc. 4th Workshop on Representation Learning for NLP 7–14 (RepL4NLP, 2019).

  24. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

    Article  Google Scholar 

  25. Rodrigues, T., Reker, D., Schneider, P. & Schneider, G. Counting on natural products for drug design. Nat. Chem. 8, 531–541 (2016).

    Article  Google Scholar 

  26. Follmann, M. et al. An approach towards enhancement of a screening library: the next generation library initiative (NGLI) at Bayer—against all odds? Drug Discov. Today 24, 668–672 (2019).

    Article  Google Scholar 

  27. Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res. 45, D945–D954 (2016).

    Article  Google Scholar 

  28. Radford, A. et al. Language models are unsupervised multitask learners. Preprint at https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (2019).

  29. Simard, P., Victorri, B., LeCun, Y. & Denker, J. Tangent prop—a formalism for specifying selected invariances in an adaptive network. Adv. NIPS 4, 895–903 (1991).

    Google Scholar 

  30. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Adv. NIPS 25, 1097–1105 (2012).

    Google Scholar 

  31. Dao, T. et al. A kernel theory of modern data augmentation. Proc. Mach. Lern. Res. 97, 1528–1537 (2019).

    Google Scholar 

  32. Bjerrum, E. J. SMILES enumeration as data augmentation for neural network modeling of molecules. Preprint at https://arxiv.org/abs/1703.07076 (2017).

  33. Bjerrum, E. & Sattarov, B. Improving chemical autoencoder latent space and molecular de novo generation diversity with heteroencoders. Biomolecules 8, 131 (2018).

    Article  Google Scholar 

  34. Arús-Pous, J. et al. Randomized SMILES strings improve the quality of molecular generative models. J. Cheminformatics 11, 71 (2019).

    Article  Google Scholar 

  35. Prykhodko, O. et al. A de novo molecular generation method using latent vector based generative adversarial network. J. Cheminformatics 11, 74 (2019).

    Article  Google Scholar 

  36. Gupta, A. et al. Generative recurrent networks for de novo drug design. Mol. Inf. 37, 1700111 (2018).

    Article  Google Scholar 

  37. Neil, D. et al. Exploring deep recurrent models with reinforcement learning for molecule design. In The Sixth International Conference on Learning Representations. Vancouver Convention Center Workshop paper (ICLR, 2018); https://iclr.cc/Conferences/2018

  38. Awale, M., Sirockin, F., Stiefl, N. & Reymond, J. L. Drug analogs from fragment-based long short-term memory generative neural networks. J. Chem. Inf. Model. 59, 1347–1356 (2019).

    Article  Google Scholar 

  39. Tanimoto, T. T. An Elementary Mathematical Theory of Classification and Prediction (International Business Machines Corporation, 1958).

  40. Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).

    Article  Google Scholar 

  41. Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S. & Klambauer, G. Fréchet ChemNet distance: a metric for generative models for molecules in drug discovery. J. Chem. Inf. Model. 58, 1736–1741 (2018).

    Article  Google Scholar 

  42. Boufridi, A. & Quinn, R. J. Harnessing the properties of natural products. Annu. Rev. Pharmacol. Toxicol. 58, 451–470 (2018).

    Article  Google Scholar 

  43. Lovering, F., Bikker, J. & Humblet, C. Escape from flatland: increasing saturation as an approach to improving clinical success. J. Med. Chem. 52, 6752–6756 (2009).

    Article  Google Scholar 

  44. Stratton, C. F., Newman, D. J. & Tan, D. S. Cheminformatic comparison of approved drugs from natural product versus synthetic origins. Bioorg. Med. Chem. Lett. 25, 4802–4807 (2015).

    Article  Google Scholar 

  45. Reutlinger, M. & Schneider, G. Nonlinear dimensionality reduction and mapping of compound libraries for drug discovery. J. Mol. Graph. Model. 34, 108–117 (2012).

    Article  Google Scholar 

  46. McInnes, L. & Healy, J. UMAP: Uniform manifold approximation and projection for dimension reduction. Preprint at http://arxiv.org/abs/1802.03426v1 (2018).

  47. Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2018).

    Article  Google Scholar 

  48. Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39, 2887–2893 (1996).

    Article  Google Scholar 

  49. Medina‐Franco, J. L. & Martínez‐Mayorga, K. Scaffold diversity analysis of compound data sets using an entropy‐based measure. QSAR Comb. Sci. 28, 1551–1560 (2009).

    Article  Google Scholar 

  50. Johnson, M. A. & Maggiora, G. M. Concepts and Applications of Molecular Similarity (John Wiley & Sons, 1990).

  51. O’Boyle, N. M. Towards a universal SMILES representation—a standard method to generate canonical SMILES based on the InChI. J. Cheminform. 4, 22 (2012).

    Article  Google Scholar 

  52. Sander, T., Freyss, J., von Korff, M. & Rufener, C. DataWarrior: an open-source program for chemistry aware data visualization and analysis. J. Chem. Inf. Model. 55, 460–473 (2015).

    Article  Google Scholar 

  53. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at http://arxiv.org/abs/1412.6980 (2014).

  54. Fréchet, M. Sur la distance de deux lois de probabilité. Comp. Rend. Hebdom. Séances l’Acad. Sci. 244, 689–692 (1957).

    MATH  Google Scholar 

  55. Moret M., Friedrich L., Grisoni F., Merk D. & Schneider G. Generative Molecular Design in Low Data Regimes (CodeOcean, 2020); https://doi.org/10.24433/CO.0753661.v1

  56. Moret M., Friedrich L., Grisoni F., Merk D. & Schneider G. Generative Molecular Design in Low Data Regimes (GitHub, ETH Zurich, 2020); https://github.com/ETHmodlab/virtual_libraries

Download references

Acknowledgements

This work was financially supported by the Novartis Forschungsstiftung (FreeNovation: AI in Drug Discovery), the Swiss National Science Foundation (grant no. 205321_182176 to G.S.) and the RETHINK initiative at ETH Zurich.

Author information

Authors and Affiliations

Authors

Contributions

M.M. and L.F. contributed equally to this work. M.M. and L.F. designed the overall computational workflow. M.M. implemented the workflow and the open-access software release. L.F. performed the scaffold and descriptor analysis. All authors contributed to the study design, analysed the data and jointly wrote the manuscript.

Corresponding author

Correspondence to Gisbert Schneider.

Ethics declarations

Competing interests

G.S. declares a potential financial conflict of interest as a consultant to the pharmaceutical industry and co-founder of inSili.com GmbH, Zurich. No other potential conflicts of interest are declared.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moret, M., Friedrich, L., Grisoni, F. et al. Generative molecular design in low data regimes. Nat Mach Intell 2, 171–180 (2020). https://doi.org/10.1038/s42256-020-0160-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-020-0160-y

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research