Abstract
Generating molecules with desired properties is an important task in chemistry and pharmacy. An efficient method may have a positive impact on finding drugs to treat diseases like COVID-19. Data mining and artificial intelligence may be good ways to find an efficient method. Recently, both the generative models based on deep learning and the work based on genetic algorithms have made some progress in generating molecules and optimizing the molecule's properties. However, existing methods need to be improved in efficiency and performance. To solve these problems, we propose a method named the Chemical Genetic Algorithm for Large Molecular Space (CALM). Specifically, CALM employs a scalable and efficient molecular representation called molecular matrix. Then, we design corresponding crossover, mutation, and mask operators inspired by domain knowledge and previous studies. We apply our genetic algorithm to several tasks related to molecular property optimization and constraint molecular optimization. The results of these tasks show that our approach outperforms the other state-of-the-art deep learning and genetic algorithm methods, where the z tests performed on the results of several experiments show that our method is more than 99% likely to be significant. At the same time, based on the experimental results, we point out the insufficiency in the experimental evaluation standard which affects the fair evaluation of previous work.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
DiMasi J A, Grabowski H G, Hansen R W. Innovation in the pharmaceutical industry: New estimates of R&D costs. Journal of Health Economics, 2016, 47: 20-33. https://doi.org/10.1016/j.jhealeco.2016.01.012.
Sanchez-Lengeling B, Aspuru-Guzik A. Inverse molecular design using machine learning: Generative models for matter engineering. Science, 2018, 361(6400): 360-365. https://doi.org/10.1126/science.aat2663.
Broadbelt L J, Stark S M, Klein M T. Computer generated pyrolysis modeling: On-the-y generation of species, reactions, and rates. Industrial and Engineering Chemistry Research, 1994, 33(4): 790-799. https://doi.org/10.1021/ie00028a003.
Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pretraining of deep bidirectional transformers for language understanding. arXiv.: 1810.04805, 2018. https://arxiv.org/abs/1810.04805, Nov. 2022.
Girshick R. Fast R-CNN. In Proc. the 15th IEEE International Conference on Computer Vision, December 2015, pp.1440-1448. https://doi.org/10.1109/ICCV.2015.169.
He K M, Gkioxari G, Dollár P, Girshick R, Mask R-CNN. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 386-397. https://doi.org/10.1109/TPAMI.2018.2844175.
LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86(11): 2278-2324. https://doi.org/10.1109/5.726791.
Peters J, Schaal S. Policy gradient methods for robotics. In Proc. the 19th IEEE/RSJ International Conference on Intelligent Robots and Systems, October 2006, pp.2219-2225. https://doi.org/10.1109/IROS.2006.282564.
Liu Q, Allamanis M, Brockschmidt M, Gaunt A L. Constrained graph variational autoencoders for molecule design. In Proc. the 32nd International Conference on Neural Information Processing Systems, Dec. 2018, pp.7806-7815.
Schütt K T, Arbabzadah F, Chmiela S, Müller K R, Tkatchenko A. Quantum-chemical insights from deep tensor neural networks. Nature Communications, 2017, 8: 13890. https://doi.org/10.1038/ncomms13890.
Lu C Q, Liu Q, Wang C, Huang Z Y, Lin P Z, He L X. Molecular property prediction: A multilevel quantum interactions modeling perspective. In Proc. the 33rd AAAI Conference on Artificial Intelligence, Jul. 2019, pp.1052-1060. https://doi.org/10.1609/aaai.v33i01.33011052.
You J X, Liu B W, Ying R, Pande V, Leskovec J. Graph convolutional policy network for goal-directed molecular graph generation. In Proc. the 32nd International Conference on Neural Information Processing Systems, Dec. 2018, pp.6412-6422.
Hao Z K, Lu C Q, Huang Z Y,Wang H, Hu Z Y, Liu Q, Chen E H, Lee C. ASGN: An active semi-supervised graph neural network for molecular property prediction. In Proc. the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2020, pp.731-752. https://doi.org/10.1145/3394486.3403117.
Polishchuk P G, Madzhidov T I, Varnek A. Estimation of the size of drug-like chemical space based on GDB-17 data. Journal of Computer Aided Molecular Design, 2013, 27(8): 675-679. https://doi.org/10.1007/s10822-013-9672-4.
Macarron R, Banks M N, Bojanic D, Burns D J, Cirovic D A, Garyantes T, Green D V S, Hertzberg R P, Janzen W P, Paslay J W, Schopfer U, Sittampalam G S. Impact of high-throughput screening in biomedical research. Nature Reviews Drug Discovery, 2011, 10(3): 188-195. https://doi.org/10.1038/nrd3368.
Pyzer-Knapp E O, Suh C, Gómez-Bombarelli R, Aguilera-Iparraguirre J, Aspuru-Guzik A. What is high-throughput virtual screening? A perspective from organic materials discovery. Annual Review of Materials Research, 2015, 45: 195-216. https://doi.org/10.1146/annurev-matsci-070214-020823.
Goodfellow I J, PougetAbadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In Proc. the 27th International Conference on Neural Information Processing Systems, December 2014, pp.2672-2680.
Kingma D P, Welling M. Auto-encoding variational bayes. arXiv: 1312.6114, 2013. https://arxiv.org/abs/1312.6114, Nov. 2022.
Kipf T N, Welling M. Variational graph auto-encoders. arXiv: 1611.07308, 2011. https://arxiv.org/abs/1611.073-08, Nov. 2022.
Grover A, Zweig A, Ermon S. Graphite: Iterative generative modeling of graphs. In Proc. the 36th International Conference on Machine Learning, May 2019, pp.2434-2444.
Simonovsky M, Komodakis N. GraphVAE: Towards generation of small graphs using variational autoencoders. In Proc. the 27th International Conference on Artificial Neural Networks, Oct. 2018, pp.412-422.
You J X, Ying R, Ren X, Hamilton W L, Leskovec J. GraphRNN: Generating realistic graphs with deep autoregressive models. In Proc. the 35th International Conference on Machine Learning, Jul. 2018, pp.5694-5703.
Liao R J, Li Y J, Song Y, Wang S L, Hamilton W L, Duvenaud D, Urtasun R, Zemel R. Efficient graph generation with graph recurrent attention networks. arXiv: 1910.00760, 2019. https://arxiv.org/abs/1910.00760, Oct. 2019.
You J X, Wu H Z, Barrett C, Ramanujan R, Leskovec J. G2SAT: Learning to generate SAT formulas. In Proc. the 32nd International Conference on Neural Information Processing Systems, Dec. 2019, pp.10552-10563.
Gómez-Bombarelli R, Wei J N, Duvenaud D, Hernández-Lobato J M, Sánchez-Lengeling B, Sheberla D, Aguilera-Iparraguirre J, Hirzel T D, Adams R P, Aspuru-Guzik A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Science, 2018, 4(2): 268-276. https://doi.org/10.1021/acscentsci.7b00572.
Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Modeling, 1988, 28(1): 31-36. https://doi.org/10.1021/ci00057a005.
Samanta B, De A, Jana G, Chattaraj P K, Ganguly N, Rodriguez M G. NeVAE: A deep generative model for molecular graphs. In Proc. the 33rd AAAI Conference on Artificial Intelligence, Jul. 2019, pp.1110-1117. https://doi.org/10.1609/aaai.v33i01.33011110.
Jin W G, Barzilay R, Jaakkola T S. Junction tree variational autoencoder for molecular graph generation. In Proc. the 35th International Conference on Machine Learning, Jul. 2018, pp. 2328-2337.
Sutton R S, Barto A G. Reinforcement Learning: An Introduction. MIT Press, 2018.
Alperstein Z, Cherkasov A, Rolfe J T. All SMILES variational autoencoder. 1905.13343, 2019. https://arxiv.org/abs/1905.13343, Nov. 2022.
Yoshikawa N, Terayama K, Sumita M, Homma T, Oono K, Tsuda K. Population-based de novo molecule generation, using grammatical evolution. Chemistry Letters, 2018, 47(11): 1431-1434. https://doi.org/10.1246/cl.180665.
Jensen J H. A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space. Chemical Science, 2019, 10(12): 3567-3572. https://doi.org/10.1039/C8SC05372C.
Nigam A, Friederich P, Krenn M, Aspuru-Guzik A. Augmenting genetic algorithms with deep neural networks for exploring the chemical space. In Proc. the 8th International Conference on Learning Representations, April 2020, pp.250-256.
Banzhaf W, Nordin P, Keller R E, Francone F D. Genetic Programming: An Introduction on the Automatic Evolution of Computer Programs and Its Application. Morgan Kaufmann Publishers, 1998.
Kim Y, Kim W Y. Universal structure conversion method for organic molecules: From atomic connectivity to three-dimensional geometry. Bulletin of the Korean Chemical Society, 2015, 36(7): 1769-1777. https://doi.org/10.1002/bkcs.10334.
Irwin J J, Sterling T, Mysinger M M, Bolstad E S, Coleman R G. ZINC: A free tool to discover chemistry for biology. Journal of Chemical Information and Modeling, 2012, 52(7): 1757-1768. https://doi.org/10.1021/ci3001277.
Coley C W, Green W H, Jensen K F. RDChiral: An RDKit wrapper for handling stereochemistry in retrosynthetic template extraction and application. Journal of Chemical Information and Modeling, 2019, 59(6): 2529-2537. https://doi.org/10.1021/acs.jcim.9b00286.
Ertl P, Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. Journal of Cheminformatics, 2009, 1: Article No. 8. https://doi.org/10.1186/1758-2946-1-8.
Bickerton G R, Paolini G V, Besnard J, Muresan S, Hopkins A L. Quantifying the chemical beauty of drugs. Nature Chemistry, 2012, 4(2): 90-98. https://doi.org/10.1038/nchem.1243.
Zhou Z P, Kearnes S, Li L, Zare R N, Riley P. Optimization of molecules via deep reinforcement learning. Scientific Reports, 2019, 9(1): 10752. https://doi.org/10.1038/s41598-019-47148-x.
Bleicher K H, Böhm H J, Müller K, Alanine A I. Hit and lead generation: Beyond high-throughput screening. Nature Reviews Drug Discovery, 2003, 2(5): 369-378. https://doi.org/10.1038/nrd1086.
Jin W G, Yang K, Barzilay R, Jaakkola T. Learning multimodal graph-to-graph translation for molecular optimization. arXiv: 1812.01070, 2018. https://arxiv.org/abs/181-2.01070, Nov. 2022.
Assouel R, Ahmed M, Segler M H, Saffari A, Bengio Y. DEFactor: Differentiable edge factorization-based probabilistic graph generation. arXiv: 1811.09766, 2018. https://arxiv.org/abs/1811.09766, Nov. 2022.
Acknowledgement
The authors would like to thank the valuable comments from the reviewers and those important corrections from Dr. Jan H. Jenson.
Author information
Authors and Affiliations
Corresponding author
Supplementary Information
ESM 1
(PDF 107 kb)
Rights and permissions
About this article
Cite this article
Zhu, JF., Hao, ZK., Liu, Q. et al. Towards Exploring Large Molecular Space: An Efficient Chemical Genetic Algorithm. J. Comput. Sci. Technol. 37, 1464–1477 (2022). https://doi.org/10.1007/s11390-021-0970-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-021-0970-3