GSAML-DTA: An interpretable drug-target binding affinity prediction model based on graph neural networks with self-attention mechanism and mutual information

https://doi.org/10.1016/j.compbiomed.2022.106145Get rights and content

Highlights

  • We develop GSAML-DTA, an interpretable deep learning framework for DTA prediction.

  • GSAML-DTA integrates a self-attention mechanism and graph neural networks (GNNs) to build representations of drugs and target proteins from the structural information.

  • In addition, mutual information is introduced to filter out redundant information and retain relevant information in the combined representations of drugs and targets.

  • Extensive experimental results demonstrate that GSAML-DTA outperforms state-of-the-art methods for DTA prediction on two benchmark datasets.

Abstract

Identifying drug-target affinity (DTA) has great practical importance in the process of designing efficacious drugs for known diseases. Recently, numerous deep learning-based computational methods have been developed to predict drug-target affinity and achieved impressive performance. However, most of them construct the molecule (drug or target) encoder without considering the weights of features of each node (atom or residue). Besides, they generally combine drug and target representations directly, which may contain irrelevant-task information. In this study, we develop GSAML-DTA, an interpretable deep learning framework for DTA prediction. GSAML-DTA integrates a self-attention mechanism and graph neural networks (GNNs) to build representations of drugs and target proteins from the structural information. In addition, mutual information is introduced to filter out redundant information and retain relevant information in the combined representations of drugs and targets. Extensive experimental results demonstrate that GSAML-DTA outperforms state-of-the-art methods for DTA prediction on two benchmark datasets. Furthermore, GSAML-DTA has the interpretation ability to analyze binding atoms and residues, which may be conducive to chemical biology studies from data. Overall, GSAML-DTA can serve as a powerful and interpretable tool suitable for DTA modelling.

Introduction

Developing a new drug generally takes more than ten years and costs billions of dollars, and less than 12% of the drugs are approved to enter the market [1,2]. The accuracy assessment of drug-target interaction is a crucial step in the early stage of drug development and uncovering their side effects [3]. Binding affinity is the strength of drug-target interaction, which is usually expressed in different metrics such as inhibition constant (Ki), dissociation constant (Kd), or the half-maximal inhibitory concentration (IC50) [4]. Although wet lab experiments to identify the drug-target binding affinity remain the most reliable and effective methods, they are time-consuming and resource-intensive. To mitigate this issue, numerous computational methods have been proposed to accelerate the speed of new drug development and reduce the cost [5].

The existing computational methods mainly fall into two categories: structure-based methods and structure-free methods. Structure-based methods mainly exploit three-dimensional (3D) structure information of small molecules and proteins to explore potential binding poses at the atom level and identify binding affinities. Molecular docking is one of the well-established structure-based methods that integrate various potential binding poses and scoring functions to minimize the free energy of the pose within binding sites [6,7]. Although these methods have achieved relatively attractive predictive performance and provided reasonable biological interpretation, their coverage is limiteddue to the high computational complexity of solving such 3D structures and the scarcity of small molecules and proteins with known 3D structures.

An alternative to structure-based methods is structure-free methods, including feature-based methods and deep learning methods, which only rely on sequence information and require fewer computational resources. Feature-based methods mainly explore primary sequence information to model the binding affinity. Concretely, they focus on extracting discriminative biological features of a drug-target pair and sending extracted features into a machine/deep learning model, such as Naïve Bayes (NB), logistic regression (LR), deep neural network (DNN), and other kernel-based methods, for predicting the binding affinity. For example, Lenselink et al. created and benchmarked a standardized dataset. Based on this dataset, they compared DNN with various traditional classifiers (e.g., NB and LR). It was shown that DNN produced the best results [8]. Rifaioglu et al. integrated multiple protein features, including physicochemical properties and sequential, structural, and evolutionary features, into numerous 2D vectors. They then fed the vectors to state-of-the-art pairwise input hybrid deep neural networks to predict the drug-target interactions [[9], [10], [11]].

Although feature-based methods have a high generalization and sequence sensitivity, they are limited by over-relying on expert knowledge-based hand-crafted feature engineering. Deep learning methods, that is, end-to-end differential models can potentially tackle the above limitations. Indeed, they can automatically learn features and invariances of given data and provide a satisfactory generalization despite a large number of parameters. Inspired by their successful application in various research fields [12,13], numerical deep learning methods are proposed for DTA prediction. For example, Öztürk et al. constructed a deep learning model DeepDTA that employed convolutional neural networks (CNNs) to extract high-latent features of drugs and proteins separately and concatenated the two learned features for final prediction through fully connected layers [14]. Moreover, they proposed another DTA model, WideDTA, which integrated different text-based information to better represent the interaction [15]. DeepCDA [16] proposed a bidirectional attention mechanism to encode the binding strength between each protein substructure-composite substructure pair. And then, a combination of CNN and Long Short Term Memory (LSTM) was built to get good representations of proteins and compounds.

Although CNN-based models have shown satisfactory performance in DTA prediction, these models ignore the structural information. They only use sequences (1-dimensional structure) to represent the input molecules, which may miss the critical spatial information to characterize the intrinsic properties of molecules. To solve this problem, graph neural networks (GNNs), which can extract structural features, are widely used in various DTA prediction models [[17], [18], [19], [20], [21], [22]]. For example, DeepGS [23] first proposed a method to learn the interaction between drugs and targets through the local chemical context and topology structure and then extensive experiments on both large and small benchmark datasets demonstrated the competitiveness and superiority of the proposed DeepGS. GraphDTA [19] represented drug features as graphs and adopted some GNNs, like Graph Convolutional Network (GCN), Graph Attention Network (GAT), and Graph Isomorphic Network (GIN), to extract drug features. The results confirm that deep learning models are beneficial for drug-target binding affinity prediction and representing drugs as graphs is beneficial for model performance improvement. Jiang et al. represented compounds as molecular graphs, utilized contact maps to gain protein graphs through protein sequences, and then built GNN networks to obtain feature representation. The experimental results show that representing proteins through contact maps can improve the prediction performance of the model [24].

Above all, most of the existing deep learning methods fail to consider the contribution of each drug atom and protein residue to the binding affinity and ignore the information hidden in different layers, which will lead to partial information loss during the feature learning process and cause poor prediction performance. Moreover, when concatenating the learned features of drugs and proteins directly, it may introduce much task-irrelevant information without further optimization. To overcome the above limitations, here we propose GSAML-DTA, an interpretable deep learning framework for predicting drug-target binding affinity. First, we construct drug graphs and protein graphs from drug SMILES (Simplified Molecular Input Line Entry System) strings and protein contact maps, respectively. Next, a hybrid network GAT-GCN with a self-attention mechanism is designed to extract layer-wise structural information from drug and protein graphs. The extracted layer-wise features of the drug and target are fused separately, and then fused features are concatenated to obtain a combined representation of a drug-target pair. Finally, the mutual information principle is applied to the combined representation, and the output is fed into fully connected layers to predict binding affinity. Through comprehensive evaluation on two benchmark datasets, we demonstrate that GSAML-DTA outperforms state-of-the-art methods. Additionally, our model can be employed to identify the important binding atoms and residues that contribute most to DTA prediction, thus providing biological interpretability.

Section snippets

Datasets

To perform head-to-head comparisons of GSAML-DTA to existing machine/deep learning-based methods, we evaluate our model on two publicly available DTA datasets, Davis dataset [25] and KIBA dataset [26]. The Davis dataset consists of 442 proteins and 68 compounds forming 30056 drug-target pairs, in which the binding affinity is measured by kinase dissociation constant (Kd) values. The higher value of Kd represent lower binding strength of a drug-target pair. These data are selected from the

Performance evaluation metrics

To assess the performance of the proposed GSAML-DTA, we adopt three commonly used statistical metrics: Concordance Index (CI) [36], Mean Squared Error (MSE), and rm2 [37]. CI is mainly employed to assess the difference between the predicted value and the actual value as follows:CI=1Zdxdyh(bxby),hx={1,ifx>00.5,ifx=00,ifx<0,where bx is the predicted value of the larger affinity dx, by is the predicted value of the smaller affinity dy, Z is the normalization constant, and h(x) is the step

Conclusion

In this study, we propose a novel deep-learning model, GSAML-DTA, to predict binding affinities of drug-target pairs, which is a crucial step for rapid virtual drug screening and drug development. We first generate graphs of the drug and target, and then employ a self-attention mechanism and a hybrid graph neural network GAT-GCN to extract structural information of them. Subsequently, to learn an informative representation of the drug-target pair, mutual information is applied to the combined

Funding

This study was supported by the Natural Science Foundation of China (No. 62071278).

Declaration of competing interest

There is no competing financial interest to declare.

References (46)

  • P.T. Lang et al.

    Dock 6

    Combining techniques to model RNA–small molecule complexes

    (2009)
  • G.M. Morris et al.

    AutoDock4 and AutoDockTools4: Automated Docking with Selective Receptor Flexibility

    Journal of Computational Chemistry

    (2009)
  • E.B. Lenselink et al.

    Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set

    Journal of Cheminformatics

    (2017)
  • A.S. Rifaioglu et al.

    MDeePred: novel multi-channel protein featurization for deep learning-based binding affinity prediction in

    drug discovery

    (2021)
  • J. Hong et al.

    Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning

    Briefings Bioinf.

    (2020)
  • J. Hong et al.

    Convolutional neural network-based annotation of bacterial type IV secretion system effectors with enhanced accuracy and reduced false discovery

    Briefings Bioinf.

    (2020)
  • H. Öztürk et al.

    DeepDTA: deep drug–target binding affinity prediction

    (2018)
  • H. Öztürk et al.

    WideDTA: prediction of drug-target binding affinity

    arXiv

    (2019 Feb 4)
  • K. Abbasi et al.

    DeepCDA: deep cross-domain compound–protein affinity prediction through

    LSTM and convolutional neural networks

    (2020)
  • M. Karimi et al.

    DeepAffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks

    (2019)
  • M. Karimi et al.

    Explainable Deep Relational Networks for Predicting Compound- Protein Affinities and Contacts

    Journal of Chemical Information and Modeling

    (2021)
  • T. Nguyen et al.

    Predicting drug–target binding affinity with graph neural networks

    (2021)
  • W. Torng et al.

    Graph Convolutional Neural Networks for Predicting Drug-Target Interactions

    Journal of Chemical Information and Modeling

    (2019)
  • Cited by (11)

    View all citing articles on Scopus
    View full text