The prediction of molecular toxicity based on BiGRU and GraphSAGE
Introduction
The prediction of molecular properties is one of the most critical tasks in drug discovery. Accurately predicting the properties of drug molecules enables rapid screening of drug candidates, saving a lot of time and money. During the drug candidate screening stage, pharmacokinetic properties (ADMET) are widely concerned [1]. ADMET is a comprehensive study of the five properties of drug absorption, distribution, metabolism, excretion and toxicity [2]. The ADMET property evaluation method in the early stage of drug discovery can effectively solve the problem of species differences, significantly improve the success rate of drug discovery, and reduce the cost of drug discovery. It takes more than 10 years and $200 million to bring an FDA drug to market [3,4]. Drug safety is a main reason for such high costs, accounting for 96% of drug failures [5]. Drug toxicity and side effects are a major practiced problem in the later stages of drug discovery [[6], [7], [8], [9]]. Therefore, the prediction of drug molecular toxicity is of great significance in the drug discovery stage, and it should be implemented as soon as possible to avoid high-cost consumption.
Traditionally, the study of drug toxicity is often carried out in the laboratory by some in vivo or in vitro biological experiments [10]. Although these biological experiments are very reliable, these techniques are inefficient and expensive, and sometimes even use some animals, causing some ethical problems. Accordingly the Quantitative Structure-Property/Activity Relationship (QSPR/QSAR) [11,12] method has gradually replaced biological experiments in the field of drug toxicity research. QSPR/QSAR mainly uses statistical methods and molecular structure parameters to study the relationship between the structure of compounds and various physical and chemical properties of molecules and biological activities. In recent years, Heuristics Method(HM) [13], Multivarate Linear Regression (MLR) [14], Artificial Neural Networks (ANN) [15], Support Vector Machines (SVM) [16], Projection Pursuit Regression (PPR) [17] and other methods had been used to build QSPR/QSAR models. QSPR/QSAR models rely heavily on molecular characterization in the field of molecular property prediction, so molecular expression is widely used in the realm of molecular toxicity prediction [18].
Traditional molecular characterization methods rely on experts to handcraft a set of rules to encode relevant structural information or physicochemical properties of molecules into fixed-length vectors. Molecular fingerprints [19,20] and molecular descriptors [21] are two typical expressions of molecular features. Thereinto, molecular fingerprint, an abstract expression of a molecule, converts the molecular into a series of bit vectors, which can provide certain help for the prediction of molecular properties. However, due to the sparseness of the encoding itself, it is difficult to obtain molecular-specific features in predicting molecular toxicity. Molecular descriptors are obtained by researchers through professional observation or manual extraction. It is a measure of molecular properties in a certain aspect, which can be either the physical and chemical properties of the molecule, or a numerical index deduced by various algorithms based on the molecular structure. Molecular descriptors can reduce properties irrelevant to property prediction to a certain extent, but in the process of acquisition, due to manual methods, it is prone to bias. In general, molecular fingerprints and molecular descriptors tend to produce some unnecessary errors in molecular toxicity prediction.
So far, deep learning has rapid development, causing extensive exertion not only in fields like natural language processing, computer vision, and artificial intelligence,but also in various other fields [[22], [23], [24], [25], [26]]. With the continuous development of deep learning, molecular representations have also appeared in some expressions that are different from molecular fingerprints and molecular descriptors. For example, the one-dimensional sequence-based Simplified Molecular Input Line Entry Specification (SMILES) expression method [27] and the two-dimensional molecular graph-based representation method [28,29] are widely used. It is very important to predict the properties of moleculars or the toxicity of moleculars by using different representations of molecules through deep learning. Some researchers have also paid attention to this problem.
BiGRU neural network was often used in the task of text sentiment classification [[30], [31], [32]]. Lin et al. [33] regarded the SMILES form of a molecule as a sentence in a text, and used the BiGRU neural network to propose a novel molecular representation learning framework to predict the properties of molecules. Peng et al. [34] leveraged the context structure of SMILES strings and the biochemical properties of the molecules themselves from another perspective and used deep learning to predict the toxicity of drug molecules. Zhang et al. [35] proposed a self-supervised learning method for molecular-related property prediction using the local information transfer mechanism of graph neural networks. Zhang et al. [36] used the graph neural network GraphSage to conduct related research on drugs in drug repositioning, and provided a new idea of graph neural network for molecular property prediction. Guo et al. [37] also provided a new method in the field of molecular characterization by means of using the recombination fusion of SMILES strings and molecular graph structures.
This article seeks to combine the benefits of the above techniques to propose a new methodology to predict molecular toxicity in the field of drug development, particularly because of the outstanding accomplishments listed above. This innovative method mainly uses the molecular graph representation and the SMILES representation to predict the drug molecular toxicity, which can not only use the context information of the molecular SMILES string, but also use the structural information of the molecule graphs. The SMILES strings are firstly one-hot [38] encoded when utilizing the SMILES strings of the molecular. Then the one-hot encoding is transferred into an embedding matrix by utilizing the SMILES string. The obtained embedding vector is used as the input of BiGRU to train it, and finally the context feature vector bn of the SMILES string is obtained through the pooling layer. On the other hand, the graph neural network GraphSAGE [39] is mainly used when utilizing the molecular graph structure. The neighbor vertices of each vertex are sampled, and then the information contained in the neighbor vertices is aggregated according to the aggregation function to obtain the molecular graph structure feature vector dg. Finally, the global feature vector y of the molecule is acquired through a fusion layer to perform the task of drug molecule toxicity prediction. In this article, the main contributions are as follows.
- (1)
We propose a molecular toxicity prediction model named MTBG using molecular SMILES strings and molecular graph structures. BiGRU and GraphSAGE are used to obtain the information of SMILES strings and molecular graphs, respectively, and the binding layer is used to integrate molecular feature information to predict the toxicity of drug molecules.
- (2)
Extensive experiments are conducted on Tox21 dataset to demonstrate the performance of the MTBG model. MTBG model achieved superior performance compared to widely used deep learning models.
The rest of this article is organized as follows: Section 2 presents molecular representation. Section 3 details our method. Section 4 analyzes the experimental results and analysis. Section 5 makes a conclusion of this article.
Section snippets
Molecular representation
In recent years, the SMILES form of molecules has been widely used for the prediction of relevant properties of molecules [40]. SMILES strings are commonly used to represent and store molecular data information, taking the form of single-line text composed of molecular symbols. SMILES is a one-dimensional representation of sequence consisting of letters and numbers called ASCII. Table 1 shows some examples of simple chemical molecules and the SMILES string representation. Compared with other
Methods
The MTBG model mainly utilized the context feature information of the SMILES string of the molecule and the feature information of the molecular graph structure. We present the overall framework of the MTBG model as depicted in Fig. 2. The whole MTBG model is mainly composed of three parts: extracting context feature information from SMILES includes Fig. 2(a) and (b), extracting molecular graph structure feature information from graph includes Fig. 2(c) and (d) and prediction part includes Fig.
Experimental results and analysis
In this section, we carry out the datasets, preprocessing, baselines, evaluation metrics and the experimental results.
Conclusion and discussion
In this article, we primarily propose an "end-to-end" model named MTBG to forecast the toxic properties of molecules. In this model we can capture not only the contextual feature information of molecular SMILES strings, but also the structural feature information of molecular graphs. And we performed toxicity prediction for 12 tasks on the tox21 dataset, the performance of our model was better than the seven baseline models.
In this article, the idea of dual pathway is proposed in the prediction
Funding
This work was supported by the National Natural Science Foundation of China (62272288, 61972451, 61902230, U22A2041), the Shenzhen Science and Technology Program (No. KQTD20200820113106007).
Data availability statement
The data of Tox21 and main code are located at https://github.com/jpliuhaha/jpliuhaha.git.
Declaration of competing interest
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled, “The prediction of molecular toxicity based on BiGRU and GraphSAGE”.
References (60)
- et al.
Pharmacophore screening, molecular docking, ADMET prediction and MD simulations for identification of ALK and MEK potential dual inhibitors
J. Mol. Struct.
(2021) - et al.
Drug repositioning: progress and challenges in drug discovery for various diseases
Eur. J. Med. Chem.
(2022) - et al.
In silico methods and tools for drug discovery
Comput. Biol. Med.
(2021) - et al.
ADMET tools: prediction and assessment of chemical ADMET properties of NCEs
- et al.
Predicting potential side effects of drugs by recommender methods and ensemble learning
Neurocomputing
(2016) - et al.
A unified frame of predicting side effects of drugs by using linear neighborhood similarity
BMC Syst. Biol.
(2017) - et al.
Feature-derived graph regularized matrix factorization for predicting drug side effects
Neurocomputing
(2018) - et al.
Multivariate linear QSPR/QSAR models: rigorous evaluation of variable selection for PLS
Comput. Struct. Biotechnol. J.
(2013) - et al.
Developing a support vector machine based QSPR model for prediction of half-life of some herbicides
Ecotoxicol. Environ. Saf.
(2016) - et al.
Introducing block design in graph neural networks for molecular properties prediction
Chem. Eng. J.
(2021)
A compendium of fingerprint-based ADMET prediction models
J. Cheminf.
The stages of drug discovery and development process
Asian J. Pharmaceut. Res. Dev.
In vitro and in vivo toxicity assessment of nanoparticles
Int. Nano Lett.
QSAR/QSPR modeling: introduction
QSPR/QSAR: state-of-art, weirdness, the future
Molecules
QSPR study of fluorescence wavelengths (λex/λem) based on the heuristic method and radial basis function neural networks
QSAR Comb. Sci.
QSAR/QSPR as an application of artificial neural networks
QSPR study on the melting points of a diverse set of potential ionic liquids by projection Pursuit regression
QSAR Comb. Sci.
Comparing predictive ability of QSAR/QSPR models using 2D and 3D molecular representations
J. Comput. Aided Mol. Des.
Extended-connectivity fingerprints
J. Chem. Inf. Model.
Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to ADME
Idrugs
Molecular descriptors
Deep learning vs. Traditional computer vision
Deep learning for predicting toxicity of chemicals: a mini review
J. Environ. Sci. Health, Part C.
Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions
Sn Comput. Sci.
Continuous training and deployment of deep learning models
Datenbank Spektrum
Artificial intelligence to deep learning: machine intelligence approach for drug discovery
Mol. Divers.
Learning to SMILES: BAN-based strategies to improve latent representation learning from molecules
Briefings Bioinf.
Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models
J. Cheminf.
Compressed graph representation for scalable molecular graph generation
J. Cheminf.
Cited by (10)
Molecular toxicity of nitrobenzene derivatives to tetrahymena pyriformis based on SMILES descriptors using Monte Carlo, docking, and MD simulations
2024, Computers in Biology and MedicineA deep learning framework for predicting molecular property based on multi-type features fusion
2024, Computers in Biology and MedicineAn improved multi-modal representation-learning model based on fusion networks for property prediction in drug discovery
2023, Computers in Biology and MedicineBiGRUD-SA: Protein S-sulfenylation sites prediction based on BiGRU and self-attention
2023, Computers in Biology and MedicineMultivariate spatio-temporal modeling of drought prediction using graph neural network
2024, Journal of Hydroinformatics