NPGPT: natural product-like compound generation with GPT-based chemical language models

Sakano, Koh; Furui, Kairi; Ohue, Masahito

doi:10.1007/s11227-024-06860-w

NPGPT: natural product-like compound generation with GPT-based chemical language models

Open access
Published: 28 December 2024

Volume 81, article number 352, (2025)
Cite this article

Download PDF

You have full access to this open access article

The Journal of Supercomputing Aims and scope Submit manuscript

NPGPT: natural product-like compound generation with GPT-based chemical language models

Download PDF

Koh Sakano¹,
Kairi Furui¹ &
Masahito Ohue¹

891 Accesses
9 Altmetric
Explore all metrics

Abstract

Natural products are substances produced by organisms in nature and often possess biological activity and structural diversity. Drug development based on natural products has been common for many years. However, the intricate structures of these compounds present challenges in terms of structure determination and synthesis, particularly compared to the efficiency of high-throughput screening of synthetic compounds. In recent years, deep learning-based methods have been applied to the generation of molecules. In this study, we trained chemical language models on a natural product dataset and generated natural product-like compounds and verified the performance of the generated compounds as a drug candidate library. The results showed that the distribution of the compounds generated was similar to that of natural products. We also evaluated the effectiveness of the generated compounds as drug candidates. Our method can be used to explore the vast chemical space and reduce the time and cost of drug discovery of natural products.

67 million natural product-like compound database generated via molecular language processing

Article Open access 19 May 2023

A pharmacophore-guided deep learning approach for bioactive molecular generation

Article Open access 06 October 2023

DeepSA: a deep-learning driven predictor of compound synthesis accessibility

Article Open access 02 November 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Natural products derived from plants and microorganisms have attracted attention for their beneficial properties and diverse biological activities [1, 2]. These compounds are known for their complex structures and large molecular weights. Because they are biosynthesized within living organisms and the structures have been evolutionarily optimized over millions of years to perform specific biological functions, many of them display potent biological activities and are often used as lead compounds in drug development. Between 1981 and 2002, natural products accounted for over 60% and 75% of the new chemical entities (NCEs) developed for cancer and infectious diseases, respectively [3]. In addition, approximately half of the drugs currently available on the market are derived from natural products [4], highlighting their vital role in drug discovery and development. Although the proportion of natural products in novel drugs is decreasing in the pharmaceutical industry today, the influence of the structure of natural products still cannot be ignored [5].

The unique molecular structures of natural products, which are rarely found in synthetic compounds, contribute to their biological activity [1]. The golden age of natural product drug discovery began in the 1940 s with the discovery of penicillin. Many drugs were discovered from microbes, especially actinomycetes and fungi, until the early 1970 s. However, from the late 1980 s to early 1990 s, new drug discoveries from natural products declined [6]. Pharmaceutical companies began to withdraw from natural product research due to the emergence of combinatorial chemistry and high-throughput screening (HTS), which allowed artificial creation of chemical diversity. Furthermore, the complexity of the structures of natural products made synthesis and derivatization difficult, complicating lead compound optimization [7, 8]. Despite these challenges, natural products have recently been reassessed and are once again gaining attention as valuable resources in drug discovery due to their diverse structures and biological activities [6].

In recent years, advances in deep learning-based molecular generation have been used for the discovery of novel pharmaceuticals [9]. This approach involves the virtual generation of compounds on computers, with the aim of identifying useful candidate molecules. However, because the training process typically utilizes general chemical databases comprising relatively small molecules such as PubChem [10], there are challenges to generate large and complex compounds similar to natural products. Consequently, this limitation narrows the chemical space that can be explored [11].

In this study, we propose a molecular generative model capable of producing natural product-like compounds. By generating a group of molecules using a model that has learned the distribution of natural products, we aim to facilitate the search for lead molecules in drug discovery and reduce the costs of natural product-based drug development.

A closely related previous study to this research is the work of Tay et al., who used a recurrent neural network (RNN) to generate natural products [12]. They trained an RNN equipped with LSTM units on natural products from the COCONUT database [13] and developed a model capable of generating compounds similar to natural products. They showed that the distribution of natural product-likeness scores of the compounds generated was similar to that of the natural products in COCONUT. This study aims to create a more high performance model using Transformers compared to the approach of Tay et al., and further evaluates whether the generated library is useful as a candidate for pharmaceuticals. Figure 1 summarizes the method of this paper.

2 Methods

2.1 Fine-tuning and chemical language models

Fine-tuning language models is a technique that refines models, initially trained on extensive datasets, to excel in particular tasks, tailoring them to specialized requirements. In this study, we fine-tuned chemical language models using a natural product dataset. A chemical language model refers to a model that processes string representations of molecules, e.g., simplified molecular-input line-entry system (SMILES) [14] and self-referencing embedded strings (SELFIES) [15]. Examples of these string representations are shown in Fig. 2. We hypothesized that, since pretrained models have already learned chemical structures, we can efficiently construct a model capable of generating natural product-like compounds.

2.2 Dataset

We used the COCONUT database, which encompasses approximately 400,000 natural products [13]. As a preprocessing step, we standardized the SMILES strings and removed large compounds (with an atom count greater than 150 or more than 10 rings). Here, we basically refer to ReinventCommunity’s data preparation [16]. MolVS [17] was used in the SMILES standardization process that includes hydrogens removal and functional groups normalization steps, which produced a consistent SMILES data set and it was expected to increase the validity of generated molecule data. In the filtering process, ReinventCommunity removes compounds the number of atoms is greater than 70, but we removed greater than 150 because we are dealing with natural products with relatively high molecular weight. Subsequently, we employed a technique that enumerates SMILES by randomizing the traversal order of the molecular graph [18], augmenting the data by approximately nine times. The final dataset included approximately 3.6 million entries and was used for the fine-tuning process. Some compounds in COCONUT appear to be nonnatural, but the proportion is small, so we believe it has negligible impact on fine-tuning pretrained models.

2.3 Models

We selected pretrained models that satisfy the following criteria:

It has been trained on a dataset of significant size.
It is a decoder-only model that utilizes only the decoder of a transformer architecture. A generative pretrained transformer (GPT) [19] is an example of such a model and GPT-based models are widely used for generative tasks due to their high performance.

We selected two models, smiles-gpt [20] and ChemGPT [21]. The details of the models are shown in Table 1. Both models used the PubChem-10 M dataset [22] for pretraining (smiles-gpt used the first 5 million molecules of PubChem-10 M), and their architecture is based on GPT. They especially differ in the molecular string representation used: smiles-gpt employs SMILES, whereas ChemGPT uses SELFIES. We used two models to investigate what features of models would be suitable for our method.

Table 1 Pretrained models used in this study

Full size table

2.4 Training

We fine-tuned the models on the natural product dataset using the AdamW optimizer [25]. The learning rate was set from \(5.0 \times 10^{-4}\) to \(5.0 \times 10^{-8}\) (using a cosine annealing schedule) for smiles-gpt and \(5.0 \times 10^{-5}\) for ChemGPT. The batch size was set to 256 and 32 for smiles-gpt and ChemGPT, respectively. Due to the lengthy nature of SELFIES and their substantial byte size, the use of SELFIES in ChemGPT necessitated a reduction in batch size due to the constraints imposed by GPU memory capacity. This training was conducted on four GeForce RTX 3090 GPUs.

3 Results and discussion

3.1 Evaluation of generated molecules

We calculated validity, uniqueness, novelty, internal diversity [26], and Fréchet ChemNet Distance (FCD) [27] for the 100 million molecules generated and made public in a previous study [12], as well as for the 100 million molecules generated by fine-tuned ChemGPT and smiles-gpt.

Validity: The ratio of valid molecules to the total number of generated molecules. A valid molecule is one that can be parsed by RDKit [28].
Uniqueness: The ratio of unique molecules to the total number of generated molecules.
Novelty: The ratio of molecules that do not exist in the COCONUT database.
Internal diversity: The average pairwise Tanimoto similarity between the generated molecules, calculated using Morgan fingerprints with a radius of 2 and 1024 bits. This metric was calculated using MOSES [26].
FCD: A metric of the distance between the distribution of generated molecules and that of training dataset. A smaller FCD indicates that the set of generated molecules are closer to the training data distribution.

The results are shown in Table 2. Smiles-gpt achieved results close to those of a previous study [12]. Compared to Tay et al. , the smaller FCD suggests that more compounds similar to natural products were generated, indicating sampling from a smaller chemical space that is better adapted to the space of natural products. In this respect, it has managed to generate compounds more closely resembling natural products than the previous study.

ChemGPT exhibited high validity, which is believed to be due to the use of SELFIES. However, the significantly large FCD indicates that the distribution of natural products was not captured accurately. Although high uniqueness and novelty are numerically positive outcomes, the magnitude of FCD suggests sampling from a broader chemical space, resulting in the generation of compounds that appear to be nearly random.

We measured the FCD of the molecular sets generated by the models before and after fine-tuning. ChemGPT had an FCD of 29.01, while smiles-gpt had an FCD of 6.75. This indicates that the distribution of molecules generated by ChemGPT changed significantly after fine-tuning compared to smiles-gpt. Although Table 2 suggests that ChemGPT may not have learned the distribution of COCONUT, at least the distribution of generated molecules is different.

Table 2 Validity, uniqueness, novelty, internal diversity, FCD of the generated set of 100 million molecules

Full size table

3.2 Visualization of the distribution in physicochemical space of generated molecules

We visualized the distribution of molecules generated by the original and fine-tuned models, along with COCONUT compounds, using t-distributed stochastic neighbor embedding (t-SNE). We randomly selected 2,000 molecules from the generated ones and embedded them in two dimensions using t-SNE based on 209 physicochemical descriptors for each molecule. For the calculation of the descriptors, we utilized Descriptors. CalcMolDescriptors from RDKit. The visualization results are shown in Figs. 3 and 4.

From the smiles-gpt results in Fig. 3, it appears that the overall distribution of the molecules has moved closer to COCONUT through fine-tuning. In contrast, as shown in Fig. 4, ChemGPT still exhibits a different distribution from COCONUT even after fine-tuning.

3.3 Distribution of scores for generated molecules

We calculated the natural product-likeness score (NP Score) [29] and the synthetic accessibility score (SA Score) [30] for molecules generated by the original model and the model after fine-tuning, as well as for molecules generated in the previous research by Tay et al. , and compared their distributions with those of the natural product data. Kernel density estimation was performed on the NP and SA Scores data for each molecular library, and the results are shown in Figs. 5 and 6.

The NP Score is an index that measures the natural product-likeness of a compound, calculated based on the frequency of occurrence of substructures in natural products. The SA Score is an index used to quantitatively assess the synthetic accessibility of a compound, where a lower score indicates a greater ease of synthesis.

smiles-gpt, through fine-tuning, has approached a distribution of both NP Scores and SA Scores closer to those of COCONUT, whereas ChemGPT continues to generate compounds with a significantly different distribution from COCONUT even after fine-tuning. Furthermore, in comparison to previous research, the fine-tuned smiles-gpt is capable of generating compounds that are closer to those in COCONUT, particularly in terms of SA Score.

From the above results, it is evident that fine-tuned smiles-gpt can generate compounds that are more reminiscent of natural products compared to fine-tuned ChemGPT. Although it is difficult to make a definitive statement due to differences in training conditions and model specifics, it is believed that the distinction between SMILES and SELFIES plays a significant role. Although it is advantageous that SELFIES are 100% valid, they appear to be a more verbose and relatively less intuitive molecular representation compared to SMILES.

Comparative studies between SMILES and SELFIES have reported that SMILES-trained models exhibit better performance [31, 32]. Although the lower validity of SMILES has been a concern, current language models have become sufficiently adept at learning the syntax of SMILES. Gao et al. [31] have pointed out that the advantage of SELFIES being 100% valid is decreasing.

3.4 Evaluation of bioactivity potential by protein–ligand docking

The utility of the generated compound library as potential drug candidates was evaluated through protein–ligand docking calculations with proteins. We evaluated the viability of these compounds for pharmaceutical use from protein–ligand interactions.

For the target protein, the epidermal growth factor receptor (EGFR) was selected. Inhibition of EGFR has been reported to significantly suppress cancer cell proliferation [33], and several EGFR inhibitors have been developed as pharmaceuticals. Gefitinib and erlotinib are among the well-known inhibitor drugs. In this experiment, the crystal structure of EGFR with PDB ID: 2ITY [34] was used, which is the complex structure of EGFR and gefitinib.

Initially, 1000 molecules were randomly selected as ligands from those generated by the fine-tuned smiles-gpt. As indicated in the results above, because the fine-tuned ChemGPT was unable to generate natural product-like compounds, molecules generated by ChemGPT were not used for docking. Subsequently, the ligands were prepared using Schrödinger LigPrep [35], and the generated 12,930 conformers were docked using Schrödinger Glide software version 2020-2 [36].

The distribution of GlideScores for each conformation obtained from the docking is shown in Fig. 7. The GlideScore represents the predicted binding free energy between a protein and a ligand, with lower values indicating stronger binding. Although the GlideScore for gefitinib is \(-7.02\) kcal/mol [37], there are 1,216 conformations with a better score than gefitinib, accounting for 9.8% of all docked conformations. Among these, the lowest (best) GlideScore was \(-11.51\) kcal/mol. This indicates that a significant number of compounds with docking scores that are better than those of existing inhibitors have been generated.

Table 3 presents the top ten compounds with the best GlideScores of the 1,000 compounds subjected to docking, together with the natural products of the COCONUT database that exhibited the highest similarity to each of these compounds. Although compounds with substructures similar to those of gefitinib were generated, most have relatively complex structures. Based on the NP Scores of these ten compounds, most of them are likely to be natural products, as NP Score ranges from -5 to 5. The SA Scores are relatively high, considering that many compounds in small molecule databases like ChEMBL [38] have values of 2-3, suggesting that they are difficult to synthesize. Furthermore, observing similar natural products reveals that the model has successfully learned to build scaffolds of natural products. Figure 8 shows the docking pose of the compound with the best GlideScore. We can see that the compound is well-adjusted in the EGFR binding pocket.

Furthermore, to verify whether natural product-likeness influences drug-likeness, we calculated the similarity between the natural products and the compounds subjected to docking and investigated the correlation. The Tanimoto index of the Morgan fingerprint with a radius of 2 and 2048 bits was used for similarity measures. For the 1000 compounds selected for docking, we calculated their mean similarity to all compounds in the COCONUT database and depicted the relationship with the GlideScore in Fig. 9. Note that the overall similarity is low because they are the mean values. For compounds with multiple stereoisomers, the one with the minimum GlideScore was chosen. The Pearson correlation coefficient between mean similarity to natural products and GlideScore was \(r=-0.313\), indicating a weak correlation. Although the difference in the similarity of the generated compounds is slight (0.02–0.14), it can be inferred that compounds with a certain degree of natural product-likeness tend to have better docking scores.

However, it should be noted that there is a tendency for docking scores to improve as the molecular weight increases [39, 40]. Figure 10 shows the relationship between molecular weight and GlideScore for the 1000 compounds. There is a weak correlation between molecular weight and GlideScore (\(r=-0.358\)), suggesting that larger molecules tend to have better docking scores. Therefore, the improvement in docking scores cannot be solely attributed to the resemblance to natural products.

Table 3 Structures, natural product-likeness scores and synthetic accessibility scores of the top ten compounds out of 1000 compounds, based on GlideScore in the docking experiment, and the natural products with the highest similarity to those compounds

Full size table

4 Conclusion

In this research, we fine-tuned a language model pretrained on a natural product dataset to generate natural product-like compounds. We measured various metrics for the molecules generated by the fine-tuned model and demonstrated that they are closer to the distribution of natural products.

In the docking experiments with EGFR, we found that the molecules generated by the fine-tuned smiles-gpt model included viable drug candidates. This illustrates the effectiveness of the language model developed in this research in creating a collection of potential pharmaceutical candidate compounds.

Compared to the previous research by Tay et al. [12], we have been able to create a model that generates compounds that are closer to natural products. Furthermore, this study demonstrates the relationship between the similarity of natural products and the potential utility as drug candidates, which distinguishes it from the previous study. Moreover, there is a need to develop methodologies to extract knowledge from functional structures, such as the potential bioactivity of natural products. Visualization studies focusing on substructures [41, 42] may prove to be a valuable tool in this area.

Availability of data and materials

Not applicable.

Code availability

NPGPT is available at https://github.com/ohuelab/npgpt under the MIT license.

References

Dias DA, Urban S, Roessner U (2012) A historical overview of natural products in drug discovery. Metabolites 2(2):303–336
Article MATH Google Scholar
Cragg GM, Newman DJ (2013) Natural products: a continuing source of novel drug leads. Biochim Biophys Acta 1830(6):3670–3695
Article MATH Google Scholar
Newman DJ, Cragg GM (2016) Natural products as sources of new drugs from 1981 to 2014. J Nat Prod 79(3):629–661
Article MATH Google Scholar
Demain AL (2014) Importance of microbial natural products and the need to revitalize their discovery. J Ind Microbiol Biotechnol 41(2):185–201
Article Google Scholar
Newman DJ, Cragg GM (2020) Natural products as sources of new drugs over the nearly four decades from 01/1981 to 09/2019. J Nat Prod 83(3):770–803
Article MATH Google Scholar
Pelaez F (2006) The historical delivery of antibiotics from microbial natural products-can history repeat? Biochem Pharmacol 71(7):981–990
Article MATH Google Scholar
Shen B (2015) A new golden age of natural products drug discovery. Cell 163(6):1297–1300
Article MATH Google Scholar
Li JW-H, Vederas JC (2009) Drug discovery and natural products: end of an era or an endless frontier? Science 325(5937):161–165
Article MATH Google Scholar
Bilodeau C, Jin W, Jaakkola T, Barzilay R, Jensen KF (2022) Generative models for molecular discovery: recent advances and challenges. Wiley Interdiscip Rev Comput Mol Sci 12(5):e1608
Article MATH Google Scholar
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B, Zaslavsky L, Zhang J, Bolton EE (2023) PubChem 2023 update. Nucleic Acids Res 51(D1):1373–1380
Article Google Scholar
Jin W, Barzilay DR, Jaakkola T (2020) Hierarchical generation of molecular graphs using structural motifs. In: Proceedings of the 37th International Conference on Machine Learning, vol. 119, pp. 4839–4848
Tay DWP, Yeo NZX, Adaikkappan K, Lim YH, Ang SJ (2023) 67 million natural product-like compound database generated via molecular language processing. Sci. Data 10(1):296
Article Google Scholar
Sorokina M, Merseburger P, Rajan K, Yirik MA, Steinbeck C (2021) COCONUT online: collection of open natural products database. J. Cheminform. 13(1):2
Article Google Scholar
Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28(1):31–36
Article MATH Google Scholar
Krenn M, Häse F, Nigam A, Friederich P, Aspuru-Guzik A (2020) Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach Learn Sci Technol 1(4):045024
Article Google Scholar
Margreitter C, Patronov A (2020) ReinventCommunity. https://github.com/MolecularAI/ReinventCommunity
Swain M MolVS: molecule validation and standardization. https://github.com/mcs07/MolVS
Bjerrum EJ (2017) SMILES enumeration as data augmentation for neural network modeling of molecules. arXiv preprint (arXiv:1703.07076)
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training. Preprint, 1–12
Adilov S (2021) Generative pre-training from molecules. ChemRxiv (10.26434/chemrxiv-2021-5fwjd)
Frey NC, Soklaski R, Axelrod S, Samsi S, Gómez-Bombarelli R, Coley CW, Gadepally V (2023) Neural scaling of deep chemical models. Nat Mach Intellig 5:1297–1305
Article Google Scholar
Chithrananda S, Grand G, Ramsundar B (2020) ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint (arXiv:2010.09885)
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I et al. (2019) Language models are unsupervised multitask learners. https://openai.com/research/better-language-models
Black S, Gao L, Wang P, Leahy C, Biderman S GPT-Neo: Large Scale Autoregressive Language Modeling with mesh-Tensorflow. https://doi.org/10.5281/zenodo.5297715
Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv preprint (arXiv:1711.05101)
Polykovskiy D, Zhebrak A, Sanchez-Lengeling B, Golovanov S, Tatanov O, Belyaev S, Kurbanov R, Artamonov A, Aladinskiy V, Veselov M, Kadurin A, Johansson S, Chen H, Nikolenko S, Aspuru-Guzik A, Zhavoronkov A (2020) Molecular sets (MOSES): a benchmarking platform for molecular generation models. Front Pharmacol 11:565644
Article Google Scholar
Preuer K, Renz P, Unterthiner T, Hochreiter S, Klambauer G (2018) Fréchet ChemNet distance: a metric for generative models for molecules in drug discovery. J Chem Inf Model 58(9):1736–1741
Article Google Scholar
Landrum G et al. RDKit: Open-source Cheminformatics https://www.rdkit.org/
Ertl P, Roggo S, Schuffenhauer A (2008) Natural product-likeness score and its application for prioritization of compound libraries. J Chem Inf Model 48(1):68–74
Article Google Scholar
Ertl P, Schuffenhauer A (2009) Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform 1(8):1
MATH Google Scholar
Gao W, Fu T, Sun J, Coley C (2022) Sample efficiency matters: a benchmark for practical molecular optimization. Adv Neural Inf Process Syst 35:21342–21357
Google Scholar
Ghugare R, Miret S, Hugessen A, Phielipp M, Berseth G (2023) Searching for high-value molecules using reinforcement learning and transformers. In: AI4Mat - NeurIPS 2023 Workshop. https://openreview.net/forum?id=O8mZO2ri33
Normanno N, Bianco C, De Luca A, Maiello MR, Salomon DS (2003) Target-based agents against ErbB receptors and their ligands: a novel approach to cancer treatment. Endocr Relat Cancer 10(1):1–21
Article MATH Google Scholar
Yun C-H, Boggon TJ, Li Y, Woo MS, Greulich H, Meyerson M, Eck MJ (2007) Structures of lung cancer-derived EGFR mutants and inhibitor complexes: mechanism of activation and insights into differential inhibitor sensitivity. Cancer Cell 11(3):217–227
Article Google Scholar
Schrödinger, LLC: LigPrep (2023)
Friesner RA, Banks JL, Murphy RB, Halgren TA, Klicic JJ, Mainz DT, Repasky MP, Knoll EH, Shelley M, Perry JK, Shaw DE, Francis P, Shenkin PS (2004) Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem 47(7):1739–1749
Article Google Scholar
Ochiai T, Inukai T, Akiyama M, Furui K, Ohue M, Matsumori N, Inuki S, Uesugi M, Sunazuka T, Kikuchi K, Kakeya H, Sakakibara Y (2023) Variational autoencoder-based chemical latent space for large molecular structures with 3D complexity. Commun Chem 6(1):249
Article Google Scholar
Mendez D, Gaulton A, Bento AP, Chambers J, De Veij M, Félix E, Magariños MP, Mosquera JF, Mutowo P, Nowotka M, Gordillo-Marañón M, Hunter F, Junco L, Mugumbate G, Rodriguez-Lopez M, Atkinson F, Bosc N, Radoux CJ, Segura-Cabrera A, Hersey A, Leach AR (2019) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47(D1):930–940. https://doi.org/10.1093/nar/gky1075
Article Google Scholar
Reynolds CH, Tounge BA, Bembenek SD (2008) Ligand binding efficiency: trends, physical basis, and implications. J Med Chem 51(8):2432–2438
Article MATH Google Scholar
Zhu H, Yang J, Huang N (2022) Assessment of the generalization abilities of machine-learning scoring functions for structure-based virtual screening. J Chem Inf Model 62(22):5485–5502
Article MATH Google Scholar
Kengkanna A, Ohue M (2024) Enhancing property and activity prediction and interpretation using multiple molecular graph representations with MMGX. Commun Chem 7(1):74
Article Google Scholar
Wu Z, Wang J, Du H, Jiang D, Kang Y, Li D, Pan P, Deng Y, Cao D, Hsieh CY, Hou T (2023) Chemistry-intuitive explanation of graph neural networks for molecular property prediction with substructure masking. Nat Commun 14(1):2585
Article Google Scholar

Download references

Funding

This work was financially supported by the Japan Science and Technology Agency FOREST (Grant No. JPMJFR216J), the Japan Society for the Promotion of Science KAKENHI (Grant Nos. JP23H04880 and JP23H04887), and the Japan Agency for Medical Research and Development Basis for Supporting Innovative Drug Discovery and Life Science Research (Grant No. JP24ama121026).

Author information

Authors and Affiliations

Department of Computer Science, School of Computing, Institute of Science Tokyo, Kanagawa, Japan
Koh Sakano, Kairi Furui & Masahito Ohue

Authors

Koh Sakano
View author publications
You can also search for this author in PubMed Google Scholar
Kairi Furui
View author publications
You can also search for this author in PubMed Google Scholar
Masahito Ohue
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

KS and MO conceived the idea of the study. KS developed the computational methods and conducted implementation and analyses. KF and MO contributed to the interpretation of the results. KS drafted the original manuscript. MO supervised the conduct of this study. All authors reviewed the manuscript draft and revised it critically on intellectual content. All authors approved the final version of the manuscript to be published.

Corresponding author

Correspondence to Masahito Ohue.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sakano, K., Furui, K. & Ohue, M. NPGPT: natural product-like compound generation with GPT-based chemical language models. J Supercomput 81, 352 (2025). https://doi.org/10.1007/s11227-024-06860-w

Download citation

Accepted: 19 December 2024
Published: 28 December 2024
DOI: https://doi.org/10.1007/s11227-024-06860-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

NPGPT: natural product-like compound generation with GPT-based chemical language models

Abstract

Similar content being viewed by others

67 million natural product-like compound database generated via molecular language processing

A pharmacophore-guided deep learning approach for bioactive molecular generation

DeepSA: a deep-learning driven predictor of compound synthesis accessibility

1 Introduction

2 Methods

2.1 Fine-tuning and chemical language models

2.2 Dataset

2.3 Models

2.4 Training

3 Results and discussion

3.1 Evaluation of generated molecules

3.2 Visualization of the distribution in physicochemical space of generated molecules

3.3 Distribution of scores for generated molecules

3.4 Evaluation of bioactivity potential by protein–ligand docking

4 Conclusion

Availability of data and materials

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords