GSAML-DTA: An interpretable drug-target binding affinity prediction model based on graph neural networks with self-attention mechanism and mutual information
Introduction
Developing a new drug generally takes more than ten years and costs billions of dollars, and less than 12% of the drugs are approved to enter the market [1,2]. The accuracy assessment of drug-target interaction is a crucial step in the early stage of drug development and uncovering their side effects [3]. Binding affinity is the strength of drug-target interaction, which is usually expressed in different metrics such as inhibition constant (), dissociation constant (), or the half-maximal inhibitory concentration () [4]. Although wet lab experiments to identify the drug-target binding affinity remain the most reliable and effective methods, they are time-consuming and resource-intensive. To mitigate this issue, numerous computational methods have been proposed to accelerate the speed of new drug development and reduce the cost [5].
The existing computational methods mainly fall into two categories: structure-based methods and structure-free methods. Structure-based methods mainly exploit three-dimensional (3D) structure information of small molecules and proteins to explore potential binding poses at the atom level and identify binding affinities. Molecular docking is one of the well-established structure-based methods that integrate various potential binding poses and scoring functions to minimize the free energy of the pose within binding sites [6,7]. Although these methods have achieved relatively attractive predictive performance and provided reasonable biological interpretation, their coverage is limiteddue to the high computational complexity of solving such 3D structures and the scarcity of small molecules and proteins with known 3D structures.
An alternative to structure-based methods is structure-free methods, including feature-based methods and deep learning methods, which only rely on sequence information and require fewer computational resources. Feature-based methods mainly explore primary sequence information to model the binding affinity. Concretely, they focus on extracting discriminative biological features of a drug-target pair and sending extracted features into a machine/deep learning model, such as Naïve Bayes (NB), logistic regression (LR), deep neural network (DNN), and other kernel-based methods, for predicting the binding affinity. For example, Lenselink et al. created and benchmarked a standardized dataset. Based on this dataset, they compared DNN with various traditional classifiers (e.g., NB and LR). It was shown that DNN produced the best results [8]. Rifaioglu et al. integrated multiple protein features, including physicochemical properties and sequential, structural, and evolutionary features, into numerous 2D vectors. They then fed the vectors to state-of-the-art pairwise input hybrid deep neural networks to predict the drug-target interactions [[9], [10], [11]].
Although feature-based methods have a high generalization and sequence sensitivity, they are limited by over-relying on expert knowledge-based hand-crafted feature engineering. Deep learning methods, that is, end-to-end differential models can potentially tackle the above limitations. Indeed, they can automatically learn features and invariances of given data and provide a satisfactory generalization despite a large number of parameters. Inspired by their successful application in various research fields [12,13], numerical deep learning methods are proposed for DTA prediction. For example, Öztürk et al. constructed a deep learning model DeepDTA that employed convolutional neural networks (CNNs) to extract high-latent features of drugs and proteins separately and concatenated the two learned features for final prediction through fully connected layers [14]. Moreover, they proposed another DTA model, WideDTA, which integrated different text-based information to better represent the interaction [15]. DeepCDA [16] proposed a bidirectional attention mechanism to encode the binding strength between each protein substructure-composite substructure pair. And then, a combination of CNN and Long Short Term Memory (LSTM) was built to get good representations of proteins and compounds.
Although CNN-based models have shown satisfactory performance in DTA prediction, these models ignore the structural information. They only use sequences (1-dimensional structure) to represent the input molecules, which may miss the critical spatial information to characterize the intrinsic properties of molecules. To solve this problem, graph neural networks (GNNs), which can extract structural features, are widely used in various DTA prediction models [[17], [18], [19], [20], [21], [22]]. For example, DeepGS [23] first proposed a method to learn the interaction between drugs and targets through the local chemical context and topology structure and then extensive experiments on both large and small benchmark datasets demonstrated the competitiveness and superiority of the proposed DeepGS. GraphDTA [19] represented drug features as graphs and adopted some GNNs, like Graph Convolutional Network (GCN), Graph Attention Network (GAT), and Graph Isomorphic Network (GIN), to extract drug features. The results confirm that deep learning models are beneficial for drug-target binding affinity prediction and representing drugs as graphs is beneficial for model performance improvement. Jiang et al. represented compounds as molecular graphs, utilized contact maps to gain protein graphs through protein sequences, and then built GNN networks to obtain feature representation. The experimental results show that representing proteins through contact maps can improve the prediction performance of the model [24].
Above all, most of the existing deep learning methods fail to consider the contribution of each drug atom and protein residue to the binding affinity and ignore the information hidden in different layers, which will lead to partial information loss during the feature learning process and cause poor prediction performance. Moreover, when concatenating the learned features of drugs and proteins directly, it may introduce much task-irrelevant information without further optimization. To overcome the above limitations, here we propose GSAML-DTA, an interpretable deep learning framework for predicting drug-target binding affinity. First, we construct drug graphs and protein graphs from drug SMILES (Simplified Molecular Input Line Entry System) strings and protein contact maps, respectively. Next, a hybrid network GAT-GCN with a self-attention mechanism is designed to extract layer-wise structural information from drug and protein graphs. The extracted layer-wise features of the drug and target are fused separately, and then fused features are concatenated to obtain a combined representation of a drug-target pair. Finally, the mutual information principle is applied to the combined representation, and the output is fed into fully connected layers to predict binding affinity. Through comprehensive evaluation on two benchmark datasets, we demonstrate that GSAML-DTA outperforms state-of-the-art methods. Additionally, our model can be employed to identify the important binding atoms and residues that contribute most to DTA prediction, thus providing biological interpretability.
Section snippets
Datasets
To perform head-to-head comparisons of GSAML-DTA to existing machine/deep learning-based methods, we evaluate our model on two publicly available DTA datasets, Davis dataset [25] and KIBA dataset [26]. The Davis dataset consists of 442 proteins and 68 compounds forming 30056 drug-target pairs, in which the binding affinity is measured by kinase dissociation constant () values. The higher value of represent lower binding strength of a drug-target pair. These data are selected from the
Performance evaluation metrics
To assess the performance of the proposed GSAML-DTA, we adopt three commonly used statistical metrics: Concordance Index () [36], Mean Squared Error (), and [37]. is mainly employed to assess the difference between the predicted value and the actual value as follows:where is the predicted value of the larger affinity , is the predicted value of the smaller affinity , is the normalization constant, and is the step
Conclusion
In this study, we propose a novel deep-learning model, GSAML-DTA, to predict binding affinities of drug-target pairs, which is a crucial step for rapid virtual drug screening and drug development. We first generate graphs of the drug and target, and then employ a self-attention mechanism and a hybrid graph neural network GAT-GCN to extract structural information of them. Subsequently, to learn an informative representation of the drug-target pair, mutual information is applied to the combined
Funding
This study was supported by the Natural Science Foundation of China (No. 62071278).
Declaration of competing interest
There is no competing financial interest to declare.
References (46)
- et al.
MMEASE: online meta-analysis of metabolomic data by enhanced metabolite annotation, marker selection and enrichment analysis
J. Proteonomics
(2021) - et al.
PFmulDL: a novel strategy enabling multi-class and multi-label protein function annotation by integrating diverse deep learning methods
Comput. Biol. Med.
(2022) - et al.
PubChem: Integrated Platform of Small Molecules and Biological Activities, Annual Reports in Computational Chemistry
(2008) - et al.
An expert system rulebase for identifying contact allergens
(1994) - et al.
Stress effects on FosB-and interleukin-8 (IL8)-driven ovarian cancer growth and metastasis
J Biol Chem.
(2010) - et al.
Natural Products as Sources of New Drugs over the Nearly Four Decades from 01/1981 to 09/2019
Journal of Natural Products
(2020) - et al.
The current status of drug discovery and development as originated in United States academia
the influence of industrial and academic collaboration on drug discovery and development
(2018) - et al.
Identifying drug-target interactions based on graph convolutional network and deep neural network
Briefings in Bioinformatics
(2021) - et al.
Making Sense of Large-Scale Kinase Inhibitor Bioactivity Data Sets: A Comparative and Integrative Analysis,
Journal of Chemical Information and Modeling
(2014) - et al.
Computational identification of the binding mechanism of a triple reuptake inhibitor amitifadine for the treatment of major depressive disorder
Phys. Chem. Chem. Phys.
(2018)
Dock 6
Combining techniques to model RNA–small molecule complexes
AutoDock4 and AutoDockTools4: Automated Docking with Selective Receptor Flexibility
Journal of Computational Chemistry
Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set
Journal of Cheminformatics
MDeePred: novel multi-channel protein featurization for deep learning-based binding affinity prediction in
drug discovery
Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning
Briefings Bioinf.
Convolutional neural network-based annotation of bacterial type IV secretion system effectors with enhanced accuracy and reduced false discovery
Briefings Bioinf.
DeepDTA: deep drug–target binding affinity prediction
WideDTA: prediction of drug-target binding affinity
arXiv
DeepCDA: deep cross-domain compound–protein affinity prediction through
LSTM and convolutional neural networks
DeepAffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks
Explainable Deep Relational Networks for Predicting Compound- Protein Affinities and Contacts
Journal of Chemical Information and Modeling
Predicting drug–target binding affinity with graph neural networks
Graph Convolutional Neural Networks for Predicting Drug-Target Interactions
Journal of Chemical Information and Modeling
Cited by (11)
Prediction of drug-target binding affinity based on deep learning models
2024, Computers in Biology and MedicineIIFS: An improved incremental feature selection method for protein sequence processing
2023, Computers in Biology and MedicineDrug–target affinity prediction method based on multi-scale information interaction and graph optimization
2023, Computers in Biology and MedicineGPCNDTA: Prediction of drug-target binding affinity through cross-attention networks augmented with graph features and pharmacophores
2023, Computers in Biology and MedicineColdDTA: Utilizing data augmentation and attention-based feature fusion for drug-target binding affinity prediction
2023, Computers in Biology and Medicine