Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks

Kotsias, Panagiotis-Christos; Arús-Pous, Josep; Chen, Hongming; Engkvist, Ola; Tyrchan, Christian; Bjerrum, Esben Jannik

doi:10.1038/s42256-020-0174-5

Article
Published: 18 May 2020

Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks

Nature Machine Intelligence volume 2, pages 254–265 (2020)Cite this article

5612 Accesses
138 Citations
17 Altmetric
Metrics details

Subjects

Matters Arising to this article was published on 10 December 2020

Abstract

Deep learning has acquired considerable momentum over the past couple of years in the domain of de novo drug design. Here, we propose a simple approach to the task of focused molecular generation for drug design purposes by constructing a conditional recurrent neural network (cRNN). We aggregate selected molecular descriptors and transform them into the initial memory state of the network before starting the generation of alphanumeric strings that describe molecules. We thus tackle the inverse design problem directly, as the cRNNs may generate molecules near the specified conditions. Moreover, we exemplify a novel way of assessing the focus of the conditional output of such a model using negative log-likelihood plots. The output is more focused than traditional unbiased RNNs, yet less focused than autoencoders, thus representing a novel method with intermediate output specificity between well-established methods. Conceptually, our architecture shows promise for the generalized problem of steering of sequential data generation with recurrent neural networks.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: cRNN models based on different conditions.**

**Fig. 2: NLL of sampling known molecules.**

**Fig. 3: Unique structures corresponding to generated SMILES strings from two different known active seeds.**

**Fig. 4: Property satisfaction with the PCB model.**

Generative molecular design in low data regimes

Article 16 March 2020

Bridging the gap between chemical reaction pretraining and conditional molecule generation with a unified model

Article 05 December 2023

Generative deep learning enables the discovery of a potent and selective RIPK1 inhibitor

Article Open access 12 November 2022

Data availability

The curated datasets used to train all models are available at https://github.com/pcko1/Deep-Drug-Coder/tree/master/datasets.

Code availability

The Python code and the trained neural networks used in this work are available under MIT licence⁵⁷ in the Deep Drug Coder (DDC) GitHub repository https://github.com/pcko1/Deep-Drug-Coder and https://doi.org/10.5281/zenodo.3739063, which also includes an optional encoding network to constitute a molecular heteroencoder.

References

Lopyrev, K. Generating news headlines with recurrent neural networks. Preprint at https://arxiv.org/pdf/1512.01712.pdf (2015).
Briot, J.-P., Hadjeres, G. & Pachet, F.-D. Deep Learning Techniques for Music Generation (Springer, 2020).
Wang, Z. et al. Chinese poetry generation with planning based neural network. In Proceedings of 26th International Conference of Computing and Linguistics 1051–1060 (COLING 2016 Organizing Committee, 2016).
Elgammal, A., Liu, B., Elhoseiny, M. & Mazzone, M. CAN: Creative adversarial networks, generating ‘art’ by learning about styles and deviating from style norms. Preprint at https://arxiv.org/abs/1706.07068 (2017).
Segler, M. H. S., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604–610 (2018).
Article Google Scholar
Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks for biomedical image segmentation. In Proceedings of Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015 (eds Navab, N., Hornegger, J., Wells, W. M. & Frangi, A. F.) 234–241 (Springer, 2015).
Chen, H., Engkvist, O., Wang, Y., Olivecrona, M. & Blaschke, T. The rise of deep learning in drug discovery. Drug Discov. Today 23, 1241–1250 (2018).
Article Google Scholar
Xu, Y. et al. Deep learning for molecular generation. Future Med. Chem. 11, 567–597 (2019).
Article Google Scholar
Elton, D. C., Boukouvalas, Z., Fuge, M. D. & Chung, P. W. Deep learning for molecular design-a review of the state of the art. Mol. Syst. Des. Eng. 4, 828–849 (2019).
Article Google Scholar
Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 18, 463–477 (2019).
Article Google Scholar
Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: generative models for matter engineering. Science 361, 360–365 (2018).
Article Google Scholar
Weininger, D. SMILES, a chemical language and information system. 1: Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Article Google Scholar
Schwalbe-Koda, D. & Gómez-Bombarelli, R. Generative models for automatic chemical design. Preprint at https://arxiv.org/pdf/1907.01632.pdf (2019).
Lipton, Z. C., Berkowitz, J. & Elkan, C. A critical review of recurrent neural networks for sequence learning. Preprint at https://arxiv.org/pdf/1506.00019.pdf (2015).
Arús-Pous, J. et al. Exploring the GDB-13 chemical space using deep generative models. J. Cheminform. 11, 20 (2019).
Article Google Scholar
Arús-Pous, J. et al. Randomized SMILES strings improve the quality of molecular generative models. J. Cheminform. 11, 71 (2019).
Article Google Scholar
Segler, M. H. S., Kogej, T., Tyrchan, C. & Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 4, 120–131 (2018).
Article Google Scholar
Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminform 9, 48 (2017).
Article Google Scholar
Zhou, Z., Kearnes, S., Li, L., Zare, R. N. & Riley, P. Optimization of molecules via deep reinforcement learning. Sci. Rep. 9, 10752 (2019).
Article Google Scholar
Popova, M., Isayev, O. & Tropsha, A. Deep reinforcement learning for de novo drug design. Sci. Adv 4, 7 (2018).
Article Google Scholar
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).
Article Google Scholar
Polykovskiy, D., Artamonov, A., Veselov, M., Kadurin, A. & Nikolenko, S. Molecular Sets (MOSES): a benchmarking platform for molecular generation models. Preprint at https://arxiv.org/pdf/1811.12823.pdf (2019).
Brown, N., Fiscato, M., Segler, M. H. S. & Vaucher, A. C. GuacaMol: benchmarking models for de novo molecular design. J. Chem. Inf. Model. 59, 1096–1108 (2019).
Article Google Scholar
Bjerrum, E. J. SMILES enumeration as data augmentation for neural network modeling of molecules. Preprint at https://arxiv.org/pdf/1703.07076.pdf (2017).
Bjerrum, E. J. & Sattarov, B. Improving chemical autoencoder latent space and molecular de novo generation diversity with heteroencoders. Biomolecules 8, 131 (2018).
Article Google Scholar
Winter, R., Montanari, F., Noé, F. & Clevert, D. A. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem. Sci. 10, 1692–1701 (2019).
Article Google Scholar
Blaschke, T., Olivecrona, M., Engkvist, O., Bajorath, J. & Chen, H. Application of generative autoencoder in de novo molecular design. Mol. Inform. 37, 1–11 (2018).
Article Google Scholar
Winter, R. et al. Efficient multi-objective molecular optimization in a continuous latent space. Chem. Sci 10, 8016–8024 (2019).
Article Google Scholar
Prykhodko, O. et al. A de novo molecular generation method using latent vector based generative adversarial network. J. Cheminform 11, 74 (2019).
Article Google Scholar
Lim, J., Ryu, S., Kim, J. W. & Kim, W. Y. Molecular generative model based on conditional variational autoencoder for de novo molecular design. J. Cheminform. 10, 31 (2018).
Article Google Scholar
Jin, W., Barzilay, R. & Jaakkola, T. S. Multi-resolution autoregressive graph-to-graph translation for molecules. Preprint at https://chemrxiv.org/articles/Multi-Resolution_Autoregressive_Graph-to-Graph_Translation_for_Molecules/8266745/1 (2019).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1–32 (1997).
Article Google Scholar
Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S. & Klambauer, G. Fréchet ChemNet distance: a metric for generative models for molecules in drug discovery. J. Chem. Inf. Model. 58, 1736–1741 (2018).
Article Google Scholar
Duvenaud, D. K. et al. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems Vol. 28 (eds Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R.) 2224–2232 (Curran Associates, 2015).
Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90–98 (2012).
Article Google Scholar
Ester, M., Kriegel, H., Xu, X. & Miinchen, D. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge, Discovery and Data Mining 226–231 (AAAI Press, 1996).
Škrlj, B., Džeroski, S., Lavrač, N. & Petkovič, M. Feature importance estimation with self-attention networks. Preprint at https://arxiv.org/pdf/2002.04464.pdf (2020).
Olden, J. D., Joy, M. K. & Death, R. G. An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data. Ecol. Model. 178, 389–397 (2004).
Article Google Scholar
Hung, L. & Chung, H. Decoupled control using neural network-based sliding-mode controller for nonlinear systems. Expert Syst. Appl. 32, 1168–1182 (2007).
Article Google Scholar
Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res. 45, D945–D954 (2017).
Article Google Scholar
Sun, J. et al. ExCAPE-DB: an integrated large scale dataset facilitating big data analysis in chemogenomics. J. Cheminform. 9, 41 (2017).
Article Google Scholar
Swain, M. MolVS: Molecule Validation and Standardization v0.1.1 (2019); https://molvs.readthedocs.io/en/latest/
Sun, J. et al. ExCAPEDB (2019); https://solr.ideaconsult.net/search/excape/
Landrum, G. et al. RDKit: Open-Source Cheminformatics Software (2019); https://www.rdkit.org/
Butina, D. Unsupervised data base clustering based on daylight’s fingerprint and Tanimoto similarity: a fast and automated way to cluster small and large data sets. J. Chem. Inf. Comput. Sci. 39, 747–750 (1999).
Article Google Scholar
Bjerrum, E. J. Molvecgen: Molecular Vectorization and Batch Generation (2019); https://github.com/EBjerrum/molvecgen
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
O’Boyle, N. M. & Sayle, R. A. Comparing structural fingerprints using a literature-based similarity benchmark. J. Cheminform. 8, 36 (2016).
Article Google Scholar
Probst, D. & Reymond, J. L. A probabilistic molecular fingerprint for big data settings. J. Cheminform. 10, 66 (2018).
Article Google Scholar
Chollet, F. Keras (2019); https://keras.io/
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. Preprint at https://arxiv.org/pdf/1603.04467.pdf (2016).
Arora, R., Basu, A., Mianjy, P. & Mukherjee, A. Understanding deep neural networks with rectified linear units. Preprint at https://arxiv.org/pdf/1611.01491.pdf (2016).
Williams, R. J. & Zipser, D. A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1, 270–280 (2008).
Article Google Scholar
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In 3rd International Conference for Learning Representations, (ICLR) 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (eds Bengio, Y. & LeCun, Y.) (2015).
Wildman, S. A. & Crippen, G. M. Prediction of physicochemical parameters by atomic contributions. J. Chem. Inf. Comput. Sci. 39, 868–873 (1999).
Article Google Scholar
Tan, C. et al. A survey on deep transfer learning. In Artificial Neural Networks and Machine Learning — ICANN 2018 (eds Krurková, V., Manolopoulos, Y., Hammer, B., Iliadis, L. & Maglogiannis, I.) 270–279 (Springer, 2018).
MIT Licence; https://opensource.org/licenses/MIT

Download references

Acknowledgements

We thank the entire MolecularAI team at AstraZeneca for their invaluable input and the fruitful discussions held during development of the present work. J.A.-P. is supported financially by the European Union’s Horizon 2020 research and innovation programme under a Marie Skłodowska-Curie grant (agreement no. 676434, ‘Big Data in Chemistry’, ‘BIGCHEM’; http://bigchem.eu).

Author information

Authors and Affiliations

Hit Discovery, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden
Panagiotis-Christos Kotsias, Josep Arús-Pous, Hongming Chen, Ola Engkvist & Esben Jannik Bjerrum
Department of Chemistry and Biochemistry, University of Bern, Bern, Switzerland
Josep Arús-Pous
Centre of Chemistry and Chemical Biology, Guangzhou Regenerative Medicine and Health Guangdong Laboratory, Guangzhou, China
Hongming Chen
Medicinal Chemistry, Research and Early Development, Respiratory & Immunology, BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden
Christian Tyrchan

Authors

Panagiotis-Christos Kotsias
View author publications
You can also search for this author in PubMed Google Scholar
Josep Arús-Pous
View author publications
You can also search for this author in PubMed Google Scholar
Hongming Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ola Engkvist
View author publications
You can also search for this author in PubMed Google Scholar
Christian Tyrchan
View author publications
You can also search for this author in PubMed Google Scholar
Esben Jannik Bjerrum
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

P.-C.K. and E.J.B. planned the project and jointly performed analysis of the results. P.-C.K. developed the necessary code. E.J.B. supervised the overall project. J.A.-P. assisted with the preprocessing of the datasets. J.A.-P., H.C., O.E. and C.T. provided valuable feedback on the methods used, the experimental set-up and the results at every stage. P.-C.K. wrote the manuscript and all authors reviewed it.

Corresponding author

Correspondence to Esben Jannik Bjerrum.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Distribution of physicochemical properties of datasets.

a, Wildman-Crippen coefficient (logP), b, topological polar surface area (TPSA), c, molecular weight (MW), d, drug-likeness score (QED), e, number of hydrogen bond acceptors (HBA) and f, hydrogen bond donors (HBD) with respect to the complete CHEMBL25 and DRD2 datasets before splitting. Subfigures a-d show the continuous histogram density as estimated by the kdeplot method of the seaborn Python library using default parameters.

Extended Data Fig. 2 Tanimoto similarity and predicted activity of generated structures.

a, Distribution of pairwise Tanimoto similarity of uniquely generated Murcko scaffolds to the seeding Murcko scaffold. The physchem-based (PCB) model generates SMILES that correspond to new scaffolds whereas the fingerprint-based (FPB) model generates scaffolds that are more similar or even identical to the seeding scaffold. b, Predicted active probability of all unique structures behind all generated SMILES strings per model. Both models generate SMILES that are predicted to be active with similar probability distributions.

Extended Data Fig. 3 Novelty of uniquely generated underlying molecules with respect to different datasets.

Novelty is assessed with respect to the train and test ChEMBL datasets using the physchem-based (PCB) and fingerprint-based (FPB) models. The first element of every pair on the x-axis corresponds to the dataset the conditions were drawn from. The second element represents the dataset with respect to which novelty was calculated. For any model the difference between datasets is insignificant, reflecting a consistent generation of novel compounds regardless of the seeding conditions. The numbers correspond to the fraction of valid unique novel molecules out of 25,600 generated SMILES strings.

Extended Data Fig. 4 Optimization of properties individually in every direction with the physchem-based model.

The pattern of the molecular properties of the generated valid SMILES (blue dots) seems to follow the set conditions (red lines). The length of a step represents the number of valid SMILES for that setpoint out of 256 sampled SMILES strings. Low molecular weight or high QED setpoints lead to unstable generation of valid SMILES for the given condition. QED displays the largest deviations from the seed conditions and is the hardest property to control as the formula contains a weighted sum of the other five properties. The area annotated by arrows refers to an input combination with a high QED target that caused the output to collapse with respect to the rate of valid SMILES and the fulfillment of the specified conditions. The exact percentage of unique molecules stemming from all valid SMILES sampled at each step is shown in Supplementary Fig. 12.

Supplementary information

Supplementary Information

Likelihood of sampling of canonical SMILES and Figs. 1–12.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kotsias, PC., Arús-Pous, J., Chen, H. et al. Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks. Nat Mach Intell 2, 254–265 (2020). https://doi.org/10.1038/s42256-020-0174-5

Download citation

Received: 20 November 2019
Accepted: 14 April 2020
Published: 18 May 2020
Issue Date: May 2020
DOI: https://doi.org/10.1038/s42256-020-0174-5

This article is cited by

Llamol: a dynamic multi-conditional generative transformer for de novo molecular design
- Niklas Dobberstein
- Astrid Maass
- Jan Hamaekers
Journal of Cheminformatics (2024)
A systematic review of deep learning chemical language models in recent era
- Hector Flores-Hernandez
- Emmanuel Martinez-Ledesma
Journal of Cheminformatics (2024)
De novo design of polymer electrolytes using GPT-based and diffusion-based generative models
- Zhenze Yang
- Weike Ye
- Arash Khajeh
npj Computational Materials (2024)
Integrating QSAR modelling and deep learning in drug discovery: the emergence of deep QSAR
- Alexander Tropsha
- Olexandr Isayev
- Artem Cherkasov
Nature Reviews Drug Discovery (2024)
LOGICS: Learning optimal generative distribution for designing de novo chemical structures
- Bongsung Bae
- Haelee Bae
- Hojung Nam
Journal of Cheminformatics (2023)