1 Introduction

Natural products derived from plants and microorganisms have attracted attention for their beneficial properties and diverse biological activities [1, 2]. These compounds are known for their complex structures and large molecular weights. Because they are biosynthesized within living organisms and the structures have been evolutionarily optimized over millions of years to perform specific biological functions, many of them display potent biological activities and are often used as lead compounds in drug development. Between 1981 and 2002, natural products accounted for over 60% and 75% of the new chemical entities (NCEs) developed for cancer and infectious diseases, respectively [3]. In addition, approximately half of the drugs currently available on the market are derived from natural products [4], highlighting their vital role in drug discovery and development. Although the proportion of natural products in novel drugs is decreasing in the pharmaceutical industry today, the influence of the structure of natural products still cannot be ignored [5].

The unique molecular structures of natural products, which are rarely found in synthetic compounds, contribute to their biological activity [1]. The golden age of natural product drug discovery began in the 1940 s with the discovery of penicillin. Many drugs were discovered from microbes, especially actinomycetes and fungi, until the early 1970 s. However, from the late 1980 s to early 1990 s, new drug discoveries from natural products declined [6]. Pharmaceutical companies began to withdraw from natural product research due to the emergence of combinatorial chemistry and high-throughput screening (HTS), which allowed artificial creation of chemical diversity. Furthermore, the complexity of the structures of natural products made synthesis and derivatization difficult, complicating lead compound optimization [7, 8]. Despite these challenges, natural products have recently been reassessed and are once again gaining attention as valuable resources in drug discovery due to their diverse structures and biological activities [6].

In recent years, advances in deep learning-based molecular generation have been used for the discovery of novel pharmaceuticals [9]. This approach involves the virtual generation of compounds on computers, with the aim of identifying useful candidate molecules. However, because the training process typically utilizes general chemical databases comprising relatively small molecules such as PubChem [10], there are challenges to generate large and complex compounds similar to natural products. Consequently, this limitation narrows the chemical space that can be explored [11].

In this study, we propose a molecular generative model capable of producing natural product-like compounds. By generating a group of molecules using a model that has learned the distribution of natural products, we aim to facilitate the search for lead molecules in drug discovery and reduce the costs of natural product-based drug development.

A closely related previous study to this research is the work of Tay et al., who used a recurrent neural network (RNN) to generate natural products [12]. They trained an RNN equipped with LSTM units on natural products from the COCONUT database [13] and developed a model capable of generating compounds similar to natural products. They showed that the distribution of natural product-likeness scores of the compounds generated was similar to that of the natural products in COCONUT. This study aims to create a more high performance model using Transformers compared to the approach of Tay et al., and further evaluates whether the generated library is useful as a candidate for pharmaceuticals. Figure 1 summarizes the method of this paper.

Fig. 1
figure 1

Overview of the proposed method

2 Methods

2.1 Fine-tuning and chemical language models

Fine-tuning language models is a technique that refines models, initially trained on extensive datasets, to excel in particular tasks, tailoring them to specialized requirements. In this study, we fine-tuned chemical language models using a natural product dataset. A chemical language model refers to a model that processes string representations of molecules, e.g., simplified molecular-input line-entry system (SMILES) [14] and self-referencing embedded strings (SELFIES) [15]. Examples of these string representations are shown in Fig. 2. We hypothesized that, since pretrained models have already learned chemical structures, we can efficiently construct a model capable of generating natural product-like compounds.

Fig. 2
figure 2

Examples of SMILES and SELFIES encoding (cochliodinol, a natural product compound)

2.2 Dataset

We used the COCONUT database, which encompasses approximately 400,000 natural products [13]. As a preprocessing step, we standardized the SMILES strings and removed large compounds (with an atom count greater than 150 or more than 10 rings). Here, we basically refer to ReinventCommunity’s data preparation [16]. MolVS [17] was used in the SMILES standardization process that includes hydrogens removal and functional groups normalization steps, which produced a consistent SMILES data set and it was expected to increase the validity of generated molecule data. In the filtering process, ReinventCommunity removes compounds the number of atoms is greater than 70, but we removed greater than 150 because we are dealing with natural products with relatively high molecular weight. Subsequently, we employed a technique that enumerates SMILES by randomizing the traversal order of the molecular graph [18], augmenting the data by approximately nine times. The final dataset included approximately 3.6 million entries and was used for the fine-tuning process. Some compounds in COCONUT appear to be nonnatural, but the proportion is small, so we believe it has negligible impact on fine-tuning pretrained models.

2.3 Models

We selected pretrained models that satisfy the following criteria:

  • It has been trained on a dataset of significant size.

  • It is a decoder-only model that utilizes only the decoder of a transformer architecture. A generative pretrained transformer (GPT) [19] is an example of such a model and GPT-based models are widely used for generative tasks due to their high performance.

We selected two models, smiles-gpt [20] and ChemGPT [21]. The details of the models are shown in Table 1. Both models used the PubChem-10 M dataset [22] for pretraining (smiles-gpt used the first 5 million molecules of PubChem-10 M), and their architecture is based on GPT. They especially differ in the molecular string representation used: smiles-gpt employs SMILES, whereas ChemGPT uses SELFIES. We used two models to investigate what features of models would be suitable for our method.

Table 1 Pretrained models used in this study

2.4 Training

We fine-tuned the models on the natural product dataset using the AdamW optimizer [25]. The learning rate was set from \(5.0 \times 10^{-4}\) to \(5.0 \times 10^{-8}\) (using a cosine annealing schedule) for smiles-gpt and \(5.0 \times 10^{-5}\) for ChemGPT. The batch size was set to 256 and 32 for smiles-gpt and ChemGPT, respectively. Due to the lengthy nature of SELFIES and their substantial byte size, the use of SELFIES in ChemGPT necessitated a reduction in batch size due to the constraints imposed by GPU memory capacity. This training was conducted on four GeForce RTX 3090 GPUs.

3 Results and discussion

3.1 Evaluation of generated molecules

We calculated validity, uniqueness, novelty, internal diversity [26], and Fréchet ChemNet Distance (FCD) [27] for the 100 million molecules generated and made public in a previous study [12], as well as for the 100 million molecules generated by fine-tuned ChemGPT and smiles-gpt.

  • Validity: The ratio of valid molecules to the total number of generated molecules. A valid molecule is one that can be parsed by RDKit [28].

  • Uniqueness: The ratio of unique molecules to the total number of generated molecules.

  • Novelty: The ratio of molecules that do not exist in the COCONUT database.

  • Internal diversity: The average pairwise Tanimoto similarity between the generated molecules, calculated using Morgan fingerprints with a radius of 2 and 1024 bits. This metric was calculated using MOSES [26].

  • FCD: A metric of the distance between the distribution of generated molecules and that of training dataset. A smaller FCD indicates that the set of generated molecules are closer to the training data distribution.

The results are shown in Table 2. Smiles-gpt achieved results close to those of a previous study [12]. Compared to Tay et al. , the smaller FCD suggests that more compounds similar to natural products were generated, indicating sampling from a smaller chemical space that is better adapted to the space of natural products. In this respect, it has managed to generate compounds more closely resembling natural products than the previous study.

ChemGPT exhibited high validity, which is believed to be due to the use of SELFIES. However, the significantly large FCD indicates that the distribution of natural products was not captured accurately. Although high uniqueness and novelty are numerically positive outcomes, the magnitude of FCD suggests sampling from a broader chemical space, resulting in the generation of compounds that appear to be nearly random.

We measured the FCD of the molecular sets generated by the models before and after fine-tuning. ChemGPT had an FCD of 29.01, while smiles-gpt had an FCD of 6.75. This indicates that the distribution of molecules generated by ChemGPT changed significantly after fine-tuning compared to smiles-gpt. Although Table 2 suggests that ChemGPT may not have learned the distribution of COCONUT, at least the distribution of generated molecules is different.

Table 2 Validity, uniqueness, novelty, internal diversity, FCD of the generated set of 100 million molecules

3.2 Visualization of the distribution in physicochemical space of generated molecules

We visualized the distribution of molecules generated by the original and fine-tuned models, along with COCONUT compounds, using t-distributed stochastic neighbor embedding (t-SNE). We randomly selected 2,000 molecules from the generated ones and embedded them in two dimensions using t-SNE based on 209 physicochemical descriptors for each molecule. For the calculation of the descriptors, we utilized Descriptors. CalcMolDescriptors from RDKit. The visualization results are shown in Figs. 3 and 4.

From the smiles-gpt results in Fig. 3, it appears that the overall distribution of the molecules has moved closer to COCONUT through fine-tuning. In contrast, as shown in Fig. 4, ChemGPT still exhibits a different distribution from COCONUT even after fine-tuning.

Fig. 3
figure 3

t-SNE visualization of 2,000 molecules generated by the original and fine-tuned models of smiles-gpt, along with molecules from COCONUT

Fig. 4
figure 4

t-SNE visualization of 2,000 molecules generated by the original and fine-tuned models of ChemGPT, along with molecules from COCONUT

3.3 Distribution of scores for generated molecules

We calculated the natural product-likeness score (NP Score) [29] and the synthetic accessibility score (SA Score) [30] for molecules generated by the original model and the model after fine-tuning, as well as for molecules generated in the previous research by Tay et al. , and compared their distributions with those of the natural product data. Kernel density estimation was performed on the NP and SA Scores data for each molecular library, and the results are shown in Figs. 5 and 6.

The NP Score is an index that measures the natural product-likeness of a compound, calculated based on the frequency of occurrence of substructures in natural products. The SA Score is an index used to quantitatively assess the synthetic accessibility of a compound, where a lower score indicates a greater ease of synthesis.

smiles-gpt, through fine-tuning, has approached a distribution of both NP Scores and SA Scores closer to those of COCONUT, whereas ChemGPT continues to generate compounds with a significantly different distribution from COCONUT even after fine-tuning. Furthermore, in comparison to previous research, the fine-tuned smiles-gpt is capable of generating compounds that are closer to those in COCONUT, particularly in terms of SA Score.

From the above results, it is evident that fine-tuned smiles-gpt can generate compounds that are more reminiscent of natural products compared to fine-tuned ChemGPT. Although it is difficult to make a definitive statement due to differences in training conditions and model specifics, it is believed that the distinction between SMILES and SELFIES plays a significant role. Although it is advantageous that SELFIES are 100% valid, they appear to be a more verbose and relatively less intuitive molecular representation compared to SMILES.

Comparative studies between SMILES and SELFIES have reported that SMILES-trained models exhibit better performance [31, 32]. Although the lower validity of SMILES has been a concern, current language models have become sufficiently adept at learning the syntax of SMILES. Gao et al. [31] have pointed out that the advantage of SELFIES being 100% valid is decreasing.

Fig. 5
figure 5

Kernel density estimation of NP Scores for molecules generated by the original and fine-tuned models of smiles-gpt and ChemGPT, compared with molecules generated in the previous research (Tay et al. [12]) and natural products from COCONUT

Fig. 6
figure 6

Kernel density estimation of SA Scores for molecules generated by the original and fine-tuned models of smiles-gpt and ChemGPT, compared with molecules generated in the previous research (Tay et al. [12]) and natural products from COCONUT

3.4 Evaluation of bioactivity potential by protein–ligand docking

The utility of the generated compound library as potential drug candidates was evaluated through protein–ligand docking calculations with proteins. We evaluated the viability of these compounds for pharmaceutical use from protein–ligand interactions.

For the target protein, the epidermal growth factor receptor (EGFR) was selected. Inhibition of EGFR has been reported to significantly suppress cancer cell proliferation [33], and several EGFR inhibitors have been developed as pharmaceuticals. Gefitinib and erlotinib are among the well-known inhibitor drugs. In this experiment, the crystal structure of EGFR with PDB ID: 2ITY [34] was used, which is the complex structure of EGFR and gefitinib.

Initially, 1000 molecules were randomly selected as ligands from those generated by the fine-tuned smiles-gpt. As indicated in the results above, because the fine-tuned ChemGPT was unable to generate natural product-like compounds, molecules generated by ChemGPT were not used for docking. Subsequently, the ligands were prepared using Schrödinger LigPrep [35], and the generated 12,930 conformers were docked using Schrödinger Glide software version 2020-2 [36].

The distribution of GlideScores for each conformation obtained from the docking is shown in Fig. 7. The GlideScore represents the predicted binding free energy between a protein and a ligand, with lower values indicating stronger binding. Although the GlideScore for gefitinib is \(-7.02\) kcal/mol [37], there are 1,216 conformations with a better score than gefitinib, accounting for 9.8% of all docked conformations. Among these, the lowest (best) GlideScore was \(-11.51\) kcal/mol. This indicates that a significant number of compounds with docking scores that are better than those of existing inhibitors have been generated.

Table 3 presents the top ten compounds with the best GlideScores of the 1,000 compounds subjected to docking, together with the natural products of the COCONUT database that exhibited the highest similarity to each of these compounds. Although compounds with substructures similar to those of gefitinib were generated, most have relatively complex structures. Based on the NP Scores of these ten compounds, most of them are likely to be natural products, as NP Score ranges from -5 to 5. The SA Scores are relatively high, considering that many compounds in small molecule databases like ChEMBL [38] have values of 2-3, suggesting that they are difficult to synthesize. Furthermore, observing similar natural products reveals that the model has successfully learned to build scaffolds of natural products. Figure 8 shows the docking pose of the compound with the best GlideScore. We can see that the compound is well-adjusted in the EGFR binding pocket.

Fig. 7
figure 7

Distribution of GlideScore for 12,930 docked conformations of 1,000 molecules generated by fine-tuned smiles-gpt

Fig. 8
figure 8

Docking pose of the compound with the best GlideScore with EGFR

Furthermore, to verify whether natural product-likeness influences drug-likeness, we calculated the similarity between the natural products and the compounds subjected to docking and investigated the correlation. The Tanimoto index of the Morgan fingerprint with a radius of 2 and 2048 bits was used for similarity measures. For the 1000 compounds selected for docking, we calculated their mean similarity to all compounds in the COCONUT database and depicted the relationship with the GlideScore in Fig. 9. Note that the overall similarity is low because they are the mean values. For compounds with multiple stereoisomers, the one with the minimum GlideScore was chosen. The Pearson correlation coefficient between mean similarity to natural products and GlideScore was \(r=-0.313\), indicating a weak correlation. Although the difference in the similarity of the generated compounds is slight (0.02–0.14), it can be inferred that compounds with a certain degree of natural product-likeness tend to have better docking scores.

However, it should be noted that there is a tendency for docking scores to improve as the molecular weight increases [39, 40]. Figure 10 shows the relationship between molecular weight and GlideScore for the 1000 compounds. There is a weak correlation between molecular weight and GlideScore (\(r=-0.358\)), suggesting that larger molecules tend to have better docking scores. Therefore, the improvement in docking scores cannot be solely attributed to the resemblance to natural products.

Fig. 9
figure 9

Relationship between the mean similarity with all compounds in COCONUT and GlideScore for 1000 molecules generated by fine-tuned smiles-gpt

Fig. 10
figure 10

Relationship between molecular weight and GlideScore for 1000 molecules generated by fine-tuned smiles-gpt

Table 3 Structures, natural product-likeness scores and synthetic accessibility scores of the top ten compounds out of 1000 compounds, based on GlideScore in the docking experiment, and the natural products with the highest similarity to those compounds

4 Conclusion

In this research, we fine-tuned a language model pretrained on a natural product dataset to generate natural product-like compounds. We measured various metrics for the molecules generated by the fine-tuned model and demonstrated that they are closer to the distribution of natural products.

In the docking experiments with EGFR, we found that the molecules generated by the fine-tuned smiles-gpt model included viable drug candidates. This illustrates the effectiveness of the language model developed in this research in creating a collection of potential pharmaceutical candidate compounds.

Compared to the previous research by Tay et al. [12], we have been able to create a model that generates compounds that are closer to natural products. Furthermore, this study demonstrates the relationship between the similarity of natural products and the potential utility as drug candidates, which distinguishes it from the previous study. Moreover, there is a need to develop methodologies to extract knowledge from functional structures, such as the potential bioactivity of natural products. Visualization studies focusing on substructures [41, 42] may prove to be a valuable tool in this area.