The prediction of molecular toxicity based on BiGRU and GraphSAGE

https://doi.org/10.1016/j.compbiomed.2022.106524Get rights and content

Highlights

  • The article presents a novel approach named MTBG, which is used to predict molecular toxicity. It plays an important role in drug discovery and can save a lot of time and money.

  • The article utilizes bidirectional recurrent gated (BiGRU) neural network and graph neural network GraphSAGE to extract features from molecular SMILES strings and molecular graphs, respectively.

  • The experimental results show that, for the prediction of molecular toxicity, our proposed approach can achieve competitive performance, compared against classic methods.

Abstract

The prediction of molecules toxicity properties plays an crucial role in the realm of the drug discovery, since it can swiftly screen out the expected drug moleculars. The conventional method for predicting toxicity is to use some in vivo or in vitro biological experiments in the laboratory, which can easily pose a threat significant time and financial waste and even ethical issues. Therefore, using computational approaches to predict molecular toxicity has become a common strategy in modern drug discovery. In this article, we propose a novel model named MTBG, which primarily makes use of both SMILES (Simplified molecular input line entry system) strings and graph structures of molecules to extract drug molecular feature in the field of drug molecular toxicity prediction. To verify the performance of the MTBG model, we opt the Tox21 dataset and several widely used baseline models. Experimental results demonstrate that our model can perform better than these baseline models.

Introduction

The prediction of molecular properties is one of the most critical tasks in drug discovery. Accurately predicting the properties of drug molecules enables rapid screening of drug candidates, saving a lot of time and money. During the drug candidate screening stage, pharmacokinetic properties (ADMET) are widely concerned [1]. ADMET is a comprehensive study of the five properties of drug absorption, distribution, metabolism, excretion and toxicity [2]. The ADMET property evaluation method in the early stage of drug discovery can effectively solve the problem of species differences, significantly improve the success rate of drug discovery, and reduce the cost of drug discovery. It takes more than 10 years and $200 million to bring an FDA drug to market [3,4]. Drug safety is a main reason for such high costs, accounting for 96% of drug failures [5]. Drug toxicity and side effects are a major practiced problem in the later stages of drug discovery [[6], [7], [8], [9]]. Therefore, the prediction of drug molecular toxicity is of great significance in the drug discovery stage, and it should be implemented as soon as possible to avoid high-cost consumption.

Traditionally, the study of drug toxicity is often carried out in the laboratory by some in vivo or in vitro biological experiments [10]. Although these biological experiments are very reliable, these techniques are inefficient and expensive, and sometimes even use some animals, causing some ethical problems. Accordingly the Quantitative Structure-Property/Activity Relationship (QSPR/QSAR) [11,12] method has gradually replaced biological experiments in the field of drug toxicity research. QSPR/QSAR mainly uses statistical methods and molecular structure parameters to study the relationship between the structure of compounds and various physical and chemical properties of molecules and biological activities. In recent years, Heuristics Method(HM) [13], Multivarate Linear Regression (MLR) [14], Artificial Neural Networks (ANN) [15], Support Vector Machines (SVM) [16], Projection Pursuit Regression (PPR) [17] and other methods had been used to build QSPR/QSAR models. QSPR/QSAR models rely heavily on molecular characterization in the field of molecular property prediction, so molecular expression is widely used in the realm of molecular toxicity prediction [18].

Traditional molecular characterization methods rely on experts to handcraft a set of rules to encode relevant structural information or physicochemical properties of molecules into fixed-length vectors. Molecular fingerprints [19,20] and molecular descriptors [21] are two typical expressions of molecular features. Thereinto, molecular fingerprint, an abstract expression of a molecule, converts the molecular into a series of bit vectors, which can provide certain help for the prediction of molecular properties. However, due to the sparseness of the encoding itself, it is difficult to obtain molecular-specific features in predicting molecular toxicity. Molecular descriptors are obtained by researchers through professional observation or manual extraction. It is a measure of molecular properties in a certain aspect, which can be either the physical and chemical properties of the molecule, or a numerical index deduced by various algorithms based on the molecular structure. Molecular descriptors can reduce properties irrelevant to property prediction to a certain extent, but in the process of acquisition, due to manual methods, it is prone to bias. In general, molecular fingerprints and molecular descriptors tend to produce some unnecessary errors in molecular toxicity prediction.

So far, deep learning has rapid development, causing extensive exertion not only in fields like natural language processing, computer vision, and artificial intelligence,but also in various other fields [[22], [23], [24], [25], [26]]. With the continuous development of deep learning, molecular representations have also appeared in some expressions that are different from molecular fingerprints and molecular descriptors. For example, the one-dimensional sequence-based Simplified Molecular Input Line Entry Specification (SMILES) expression method [27] and the two-dimensional molecular graph-based representation method [28,29] are widely used. It is very important to predict the properties of moleculars or the toxicity of moleculars by using different representations of molecules through deep learning. Some researchers have also paid attention to this problem.

BiGRU neural network was often used in the task of text sentiment classification [[30], [31], [32]]. Lin et al. [33] regarded the SMILES form of a molecule as a sentence in a text, and used the BiGRU neural network to propose a novel molecular representation learning framework to predict the properties of molecules. Peng et al. [34] leveraged the context structure of SMILES strings and the biochemical properties of the molecules themselves from another perspective and used deep learning to predict the toxicity of drug molecules. Zhang et al. [35] proposed a self-supervised learning method for molecular-related property prediction using the local information transfer mechanism of graph neural networks. Zhang et al. [36] used the graph neural network GraphSage to conduct related research on drugs in drug repositioning, and provided a new idea of graph neural network for molecular property prediction. Guo et al. [37] also provided a new method in the field of molecular characterization by means of using the recombination fusion of SMILES strings and molecular graph structures.

This article seeks to combine the benefits of the above techniques to propose a new methodology to predict molecular toxicity in the field of drug development, particularly because of the outstanding accomplishments listed above. This innovative method mainly uses the molecular graph representation and the SMILES representation to predict the drug molecular toxicity, which can not only use the context information of the molecular SMILES string, but also use the structural information of the molecule graphs. The SMILES strings are firstly one-hot [38] encoded when utilizing the SMILES strings of the molecular. Then the one-hot encoding is transferred into an embedding matrix by utilizing the SMILES string. The obtained embedding vector is used as the input of BiGRU to train it, and finally the context feature vector bn of the SMILES string is obtained through the pooling layer. On the other hand, the graph neural network GraphSAGE [39] is mainly used when utilizing the molecular graph structure. The neighbor vertices of each vertex are sampled, and then the information contained in the neighbor vertices is aggregated according to the aggregation function to obtain the molecular graph structure feature vector dg. Finally, the global feature vector y of the molecule is acquired through a fusion layer to perform the task of drug molecule toxicity prediction. In this article, the main contributions are as follows.

  • (1)

    We propose a molecular toxicity prediction model named MTBG using molecular SMILES strings and molecular graph structures. BiGRU and GraphSAGE are used to obtain the information of SMILES strings and molecular graphs, respectively, and the binding layer is used to integrate molecular feature information to predict the toxicity of drug molecules.

  • (2)

    Extensive experiments are conducted on Tox21 dataset to demonstrate the performance of the MTBG model. MTBG model achieved superior performance compared to widely used deep learning models.

The rest of this article is organized as follows: Section 2 presents molecular representation. Section 3 details our method. Section 4 analyzes the experimental results and analysis. Section 5 makes a conclusion of this article.

Section snippets

Molecular representation

In recent years, the SMILES form of molecules has been widely used for the prediction of relevant properties of molecules [40]. SMILES strings are commonly used to represent and store molecular data information, taking the form of single-line text composed of molecular symbols. SMILES is a one-dimensional representation of sequence consisting of letters and numbers called ASCII. Table 1 shows some examples of simple chemical molecules and the SMILES string representation. Compared with other

Methods

The MTBG model mainly utilized the context feature information of the SMILES string of the molecule and the feature information of the molecular graph structure. We present the overall framework of the MTBG model as depicted in Fig. 2. The whole MTBG model is mainly composed of three parts: extracting context feature information from SMILES includes Fig. 2(a) and (b), extracting molecular graph structure feature information from graph includes Fig. 2(c) and (d) and prediction part includes Fig.

Experimental results and analysis

In this section, we carry out the datasets, preprocessing, baselines, evaluation metrics and the experimental results.

Conclusion and discussion

In this article, we primarily propose an "end-to-end" model named MTBG to forecast the toxic properties of molecules. In this model we can capture not only the contextual feature information of molecular SMILES strings, but also the structural feature information of molecular graphs. And we performed toxicity prediction for 12 tasks on the tox21 dataset, the performance of our model was better than the seven baseline models.

In this article, the idea of dual pathway is proposed in the prediction

Funding

This work was supported by the National Natural Science Foundation of China (62272288, 61972451, 61902230, U22A2041), the Shenzhen Science and Technology Program (No. KQTD20200820113106007).

Data availability statement

The data of Tox21 and main code are located at https://github.com/jpliuhaha/jpliuhaha.git.

Declaration of competing interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled, “The prediction of molecular toxicity based on BiGRU and GraphSAGE”.

References (60)

  • V. Venkatraman et al.

    A compendium of fingerprint-based ADMET prediction models

    J. Cheminf.

    (2021)
  • A.B. Deore et al.

    The stages of drug discovery and development process

    Asian J. Pharmaceut. Res. Dev.

    (2019)
  • V. Kumar et al.

    In vitro and in vivo toxicity assessment of nanoparticles

    Int. Nano Lett.

    (2017)
  • K. Roy et al.

    QSAR/QSPR modeling: introduction

  • A.A. Toropov et al.

    QSPR/QSAR: state-of-art, weirdness, the future

    Molecules

    (2020)
  • J. Shi et al.

    QSPR study of fluorescence wavelengths (λex/λem) based on the heuristic method and radial basis function neural networks

    QSAR Comb. Sci.

    (2006)
  • N. Montañez-Godínez et al.

    QSAR/QSPR as an application of artificial neural networks

  • Y. Ren et al.

    QSPR study on the melting points of a diverse set of potential ionic liquids by projection Pursuit regression

    QSAR Comb. Sci.

    (2009)
  • A. Sato et al.

    Comparing predictive ability of QSAR/QSPR models using 2D and 3D molecular representations

    J. Comput. Aided Mol. Des.

    (2021)
  • D. Rogers et al.

    Extended-connectivity fingerprints

    J. Chem. Inf. Model.

    (2010)
  • R.C. Glem et al.

    Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to ADME

    Idrugs

    (2006)
  • V. Consonni et al.

    Molecular descriptors

  • N. O'Mahony et al.

    Deep learning vs. Traditional computer vision

  • W. Tang et al.

    Deep learning for predicting toxicity of chemicals: a mini review

    J. Environ. Sci. Health, Part C.

    (2018)
  • I.H. Sarker

    Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions

    Sn Comput. Sci.

    (2021)
  • I. Prapas et al.

    Continuous training and deployment of deep learning models

    Datenbank Spektrum

    (2021)
  • R. Gupta et al.

    Artificial intelligence to deep learning: machine intelligence approach for drug discovery

    Mol. Divers.

    (2021)
  • C.-K. Wu et al.

    Learning to SMILES: BAN-based strategies to improve latent representation learning from molecules

    Briefings Bioinf.

    (2021)
  • D. Jiang et al.

    Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models

    J. Cheminf.

    (2021)
  • Y. Kwon et al.

    Compressed graph representation for scalable molecular graph generation

    J. Cheminf.

    (2020)
  • Cited by (10)

    View all citing articles on Scopus
    View full text