Abstract
This work proposes a novel approach for drug molecule design using data-assisted techniques. This approach leverages a generation-based framework to expedite the drug discovery process, aiming to identify candidate molecules suitable for production while minimizing development timelines and regulatory hurdles. The core of the proposed method is a conditional variational autoencoder (CVAE) for molecule generation, employing NCSMILES string representation. The framework involves three key stages: (1) molecule generation using the CVAE, (2) filtering based on a scoring function, and (3) identification of the optimal molecule from the generated pool. To enhance the latent space representation, we incorporate molecule properties alongside conditional selection criteria. The performance of the proposed scheme is comprehensively evaluated on standard benchmark datasets using various metrics, including validity, diversity, usefulness, and novelty. The method demonstrates superior performance compared to existing state-of-the-art approaches, attributable to several key improvements, including intermediary optimizations and condition-based selection.











Similar content being viewed by others
Data availability
The data that support the findings of this study are openly available in https://github.com/arunsinghbhadwal/NRC-VABS.
References
Whitesides GM (2015) Reinventing chemistry. Angew Chem Int Ed 54:3196–3209
Schneider P, Schneider G (2016) De novo design at the edge of chaos: miniperspective. J Med Chem 59(9):4077–4086
Polishchuk PG, Madzhidov TI, Varnek A (2013) Estimation of the size of drug-like chemical space based on GDB-17 data. J Comput Aided Mol Des 27:675–679
Harel S, Radinsky K (2018) Accelerating prototype-based drug discovery using conditional diversity networks. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 331–339
Elton DC, Boukouvalas Z, Fuge MD, Chung PW (2019) Deep learning for molecular design—a review of the state of the art. Mol Syst Des Eng 4(4):828–849
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, vol 26
Hinton G, Deng L, Yu D, Dahl GE, Mohamed A-R, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, vol 25
Gómez-Bombarelli R, Wei JN, Duvenaud D, Hernández-Lobato JM, Sánchez-Lengeling B, Sheberla D, Aguilera-Iparraguirre J, Hirzel TD, Adams RP, Aspuru-Guzik A (2018) Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci 4(2):268–276
Segler MH, Kogej T, Tyrchan C, Waller MP (2018) Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent Sci 4(1):120–131
Bhadwal AS, Kumar K, Kumar N (2024) NRC-VABS: Normalized reparameterized conditional variational autoencoder with applied beam search in latent space for drug molecule design. Expert Syst Appl 240:122396
Olivecrona M, Blaschke T, Engkvist O, Chen H (2017) Molecular de-novo design through deep reinforcement learning. J Cheminform 9(1):1–14
Popova M, Isayev O, Tropsha A (2018) Deep reinforcement learning for de novo drug design. Sci Adv 4(7):7885
Kumari M, Kaul A (2023) Deep learning techniques for remote sensing image scene classification: a comprehensive review, current challenges, and future directions. Concurr Comput Pract Exp 7733:e7733
Bhadwal AS, Kumar K, Kumar N (2023) GenSMILES: an enhanced validity conscious representation for inverse design of molecules. Knowl Based Syst 268:110429
Kaul A, Kumari M (2023) A literature review on remote sensing scene categorization based on convolutional neural networks. Int J Remote Sens 44(8):2611–2642
Bhadwal AS, Kumar K, Kumar N (2023) GMG-NCDVAE: guided de novo molecule generation using NLP techniques and constrained diverse variational autoencoder. ACM Trans Asian Low-Resour Lang Inf Process. https://doi.org/10.1145/3610533
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, vol 27
Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114
White D, Wilson RC (2010) Generative models for chemical structures. J Chem Inf Model 50(7):1257–1274
Bhadwal AS, Kumar K (2022) GVA: gated variational autoencoder for de novo molecule generation. In: 2022 IEEE 9th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON). IEEE, pp 1–5
Singh Bhadwal A, Kumar K (2023) Direct de novo molecule generation using probabilistic diverse variational autoencoder. In: Computer Vision and machine Intelligence: Proceedings of CVMI 2022. Springer, pp 13–22
Gómez-Bombarelli R, Wei JN, Duvenaud D, Hernández-Lobato JM, Sánchez-Lengeling B, Sheberla D, Aguilera-Iparraguirre J, Hirzel TD, Adams RP, Aspuru-Guzik A (2018) Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci 4(2):268–276
Blaschke T, Olivecrona M, Engkvist O, Bajorath J, Chen H (2018) Application of generative autoencoder in de novo molecular design. Mol Inform 37(1–2):1700123
Kusner MJ, Paige B, Hernández-Lobato JM (2017) Grammar variational autoencoder. In: International Conference on Machine Learning. PMLR, pp 1945–1954
Dai H, Tian Y, Dai B, Skiena S, Song L (2018) Syntax-directed variational autoencoder for structured data. arXiv preprint arXiv:1802.08786
Makhzani A, Shlens J, Jaitly N, Goodfellow I, Frey B (2015) Adversarial autoencoders. arXiv preprint arXiv:1511.05644
Bjerrum EJ, Threlfall R (2017) Molecular generation with recurrent neural networks (RNNs). arXiv preprint arXiv:1705.04612
Yuan W, Jiang D, Nambiar DK, Liew LP, Hay MP, Bloomstein J, Lu P, Turner B, Le Q-T, Tibshirani R et al (2017) Chemical space mimicry for drug discovery. J Chem Inf Model 57(4):875–882
Gupta A, Müller AT, Huisman BJ, Fuchs JA, Schneider P, Schneider G (2018) Generative recurrent networks for de novo drug design. Mol Inform 37(1–2):1700111
Guimaraes GL, Sanchez-Lengeling B, Outeiral C, Farias PLC, Aspuru-Guzik A (2017) Objective-reinforced generative adversarial networks (organ) for sequence generation models. arXiv preprint arXiv:1705.10843
Jaques N, Gu S, Bahdanau D, Hernández-Lobato JM, Turner RE, Eck D (2017) Sequence tutor: conservative fine-tuning of sequence generation models with kl-control. In: International Conference on Machine Learning. PMLR, pp 1645–1654
Yüksel A, Ulusoy E, Ünlü A, Doğan T (2023) SELFormer: molecular representation learning via selfies language models. Sci Technol Mach Learn 4:025035
Yoshikai Y, Mizuno T, Nemoto S, Kusuhara H (2024) Difficulty in chirality recognition for transformer architectures learning chemical structures from string representations. Nat Commun 15(1):1197
Weininger D (1988) Smiles, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28(1):31–36
Krenn M, Häse F, Nigam A, Friederich P, Aspuru-Guzik A (2020) Self-referencing embedded strings (selfies): a 100% robust molecular string representation. Mach Learn Sci Technol 1(4):045024
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, etal. (2016) \(\{\)TensorFlow\(\}\): a system for \(\{\)Large-Scale\(\}\) machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp 265–283
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Williams RJ (1989) A learning algorithm for continually running fully recurrent neural netwokrs. Neural Comput 1:256–263
Irwin JJ, Sterling T, Mysinger MM, Bolstad ES, Coleman RG (2012) Zinc: a free tool to discover chemistry for biology. J Chem Inf Model 52(7):1757–1768
Lipinski CA (2000) Drug-like properties and the causes of poor solubility and poor permeability. J Pharmacol Toxicol Methods 44(1):235–249
Landrum G (2013) Rdkit documentation. Release 1(1–79):4
Wildman SA, Crippen GM (1999) Prediction of physicochemical parameters by atomic contributions. J Chem Inf Comput Sci 39(5):868–873
Prasanna S, Doerksen R (2009) Topological polar surface area: a useful descriptor in 2D-QSAR. Curr Med Chem 16(1):21–41
Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA et al (2016) Pubchem substance and compound databases. Nucleic Acids Res 44(D1):1202–1213
Ertl P, Schuffenhauer A (2009) Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform 1:1–11
Preuer K, Renz P, Unterthiner T, Hochreiter S, Klambauer G (2018) Fréchet chemnet distance: a metric for generative models for molecules in drug discovery. J Chem Inf Model 58(9):1736–1741
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that he or she has no conflict of interest
Research involving human participants and/or animals
This article does not contain any studies with human participants or animals performed by any of the authors
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bhadwal, A.S., Kumar, K. Nc-vae: normalised conditional diverse variational autoencoder guided de novo molecule generation. J Supercomput 80, 21207–21228 (2024). https://doi.org/10.1007/s11227-024-06250-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-024-06250-2