Predicting drug–target interaction using positive-unlabeled learning

doi:10.1016/j.neucom.2016.03.080

Neurocomputing

Volume 206, 19 September 2016, Pages 50-57

https://doi.org/10.1016/j.neucom.2016.03.080 Get rights and content

Abstract

Identifying interactions between drug compounds and target proteins is an important process in drug discovery. It is time-consuming and expensive to determine interactions between drug compounds and target proteins with experimental methods. The computational methods provide an effective strategy to address this issue. The difficulties of drug–target interaction identification include the lack of known drug–target association and no experimentally verified negative samples. In this work, we present a method, called PUDT, to predict drug–target interactions. Instead of treating unknown interactions as negative samples, we set it as unlabeled samples. We use three strategies (Random walk with restarts, KNN and heat kernel diffusion) to part unlabeled samples into two groups: reliable negative samples (RN) and likely negative samples (LN) based on target similarity information. Then, majority voting method is used to aggregate these strategies to decide the final label of unlabeled samples. Finally, weighted support vector machine is employed to build a classifier. Four datasets (enzyme, ion channel, GPCR and nuclear receptor) are used to evaluate the performance of our method. The results demonstrate that the performance of our method is comparable or better than recent state-of-the-art approaches.

Introduction

The development of a new drug is a cost- and time-consuming process. According to the US Food and Drug Administrations (FDA) statistical data, the cost of new molecular entity discovery is approximately $1.8 billion and it takes averagely 13 years [1]. In addition, only about 20 new molecular entities are approved by FDA each year. Therefore, it is an important issue in reducing these expenses in drug discovery. The computational methods provide an effective strategy to address this issue [2].

With the development of high-throughput techniques, a great deal of drug–target interaction data has been generated [3], [4], [5]. Several databases have been established to store interaction information and provide relevant retrieval servers. For example, DrugBank [6] database is a popular web resource containing information on drugs and drug targets which contains 7740 drug entries in the present version. ChEMBL [7] maintained by the European Bioinformatics Institute (EBI) is a manually curated chemical database of bioactive molecules with drug-like properties. In version 19, it contains 10,579 targets and 1,637,862 compound records and 2,843,338 bioactivity evidences. Supertarget [8] is an online and freely accessible database which contains over 6000 target proteins.

The computational methods have been boosted to predict drug–target interactions on account of the availability of interaction data. The traditional computational methods for drug–target interaction identification can be classified into three categories: ligand-based methods [9], [10], docking-based methods [11], [12] and literature text mining methods [13]. These approaches have achieved great successful in drug target interaction prediction. However, these methods have some limitations: the ligand-based methods rely on the number of known ligands, the docking-based methods need the information of protein structure, and literature text mining based methods are unable to find unknown and interesting interactions.

Recently, more and more statistical methods have been proposed to predict drug target interactions by integrating biological knowledge such as drug chemical structures, target protein sequence, gene expression and known drug–target interactions [14], [15], [16], [17]. The assumption of these approaches is that similar drugs show similar patterns of interactions with targets in drug–target interaction network [18], [19]. Chen et al. [15] presented network-based random walk with restart method, called NRWRH, to predict relationships between drugs and targets by integrating drug–drug chemical structure similarity network, protein–protein sequence similarity network and known drug–target interaction network into a heterogeneous network. Cheng et al. [14] proposed three inferring methods including drug-based similarity inference (DBSI), target-based similarity inference (TBSI) and network-based inference (NBI) to predict drug–target interactions. Similar work has been accomplished by Alaimo et al. [20], they presented DT-hybrid approach which extends network-based inference method by domain-based knowledge to detect drug–target interactions. Emig et al. [21] integrated different network-based methods to predict drug targets of a specific disease. These methods are easy to be implemented. However, these methods are unable to apply to drugs without any targets information. In addition, Bleakley and Yamanishi [17] employed bipartite local models to predict relationships between drugs and targets. Further work has been completed by Mei et al. [22], they integrated neighbour information into bipartite local models for drug target interaction identification. The Gaussian interaction profile kernel and weighted nearest neighbour were integrated for drug–target interaction prediction [23]. The Bayesian matrix factorization and binary classification [24] and probabilistic matrix factorization [25] were proposed to detect drug–target interactions. The common limitation of these supervised learning approaches is to treat unknown drug–target interactions as negative samples, which may affect predictive accuracy. Xia et al. [16] developed a semi-supervised method (NetLapRLS) for drug–target interaction identification by using positive and unlabeled samples. Chen and Zhang [26] presented NetCBP method by maximizing the rank coherence with respect to known knowledge to identify associations between drugs and targets. These semi-supervised methods can make use of unlabeled information. But they need to combine two different classifiers in the final.

Despite these approaches have achieved good performance, there are some limitations and difficulties for drug–target interactions prediction. Firstly, most of the methods adopt sequence information to measure the similarity of two proteins. More studies demonstrate that the structure information is more conservative than sequence information. Therefore, the structure information of target protein may be better suited for drug–target interaction identification. Secondly, there are no experimentally verified negative samples. Traditional methods treat the non-interaction data as negative sample which is unreasonable as those non-interaction data may contain undetected drug–target interactions. Thirdly, some methods are unable to predict new drugs without any targets, which limits the application in practice.

In this paper, we propose a framework to predict drug–target interaction based on positive-unlabeled learning. Comparing with existing approaches, we integrated multiple target resources including target structure information, target function category information and target function annotation information. In addition, we treat unknown drug target interactions as unlabeled set U instead of negative set N. Three strategies (Random walk with restarts, KNN and heat kernel diffusion) are used to classify unlabeled samples into two groups: reliable negative samples (RN) and likely negative samples (LN) based on target similarity information and majority voting method is used to aggregate these strategies to decide the final label of unlabeled samples. The weighted support vector machines are employed to build a multi-level classifier to predict drug target interactions based on positive set, reliable negative set and likely negative set. The experiments are conducted on four datasets (including Enzyme, Ion Channel, GPCR and Nuclear Receptor). The experimental results demonstrate that our method outperforms state-of-the-art approaches.

Section snippets

Data preparation

In this paper, we use four drug–target interaction networks in human involving Enzyme, Ion Channel, GPCR and Nuclear Receptor which are first analysed by Yamanishi et al. [27]. These datasets can be downloaded from http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/. Table 1 show some information of four datasets. The drug–target interaction data are collected from the KEGG BRITE [28], BRENDA [29], SuperTarget [8] and DrugBank [6].

Drug chemical structure information is retrieved from the DRUG

Experiments and results

In this section, we first analyse degree distributions of drugs in four drug–target interaction networks. Then, we compare our method with five state-of-the-art approaches (DBSI [14], NetLapRLS [16], KBMF2K [24], NetCBP [26], WNN-GIP [23]) for drug–target interaction prediction. Last, we show the performance of our method in potential drug–target interaction identification.

Conclusion and discussion

To systematically understand the associations between chemical compounds and target proteins is conducive to new drug design and discovery. Due to the limitation of traditional experimental methods, it is common for biological scientists to predict for drug–target interaction prediction by computational methods. Many computational approaches have been developed to predict drug–target interactions. However, there are some limitations existing in these methods: (1) some methods treat unlabeled

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China under Grant nos. 61232001, 61428209 and 61420106009; the Program for New Century Excellent Talents in University (NCET-12-0547).

Wei Lan received his B.Sc. and M.Sc. degrees in Henan Polytechnical University and Guangxi University, China in 2009 and 2012, respectively. He is currently a Ph.D. Candidate in Bioinformatics at Central South University. His currently research interests including data mining, machine learning and bioinformatics especially in drug target, disease gene and noncoding RNA.

References (46)

J.R. Haanstra et al.
Drug target identification through systems biology
Drug Discov. Today: Technol.
(2015)
T.F. Smith et al.
Identification of common molecular subsequences
J. Mol. Biol.
(1981)
P. Bieganowski et al.
Synthetic lethal and biochemical analyses of nad and nadh kinases in Saccharomyces cerevisiae establish separation of cellular functions
J. Biol. Chem.
(2006)
C.R. Nolan et al.
Treatment of hyperphosphatemia in patients with chronic kidney disease on maintenance hemodialysis
Kidney Int.
(2005)
A.L. Hopkins
Drug discoverypredicting promiscuity
Nature
(2009)
L. Wu et al.
Network output controllability-based method for drug target identification
IEEE Trans. NanoBiosci.
(2015)
J.G. Moffat et al.
Phenotypic screening in cancer drug discovery [mdash] past, present and future
Nat. Rev. Drug Discov.
(2014)
A. Volkamer et al.
Exploiting structural information for drug-target assessment
Future Med. Chem.
(2014)
B. Chen et al.
A fast and high performance multiple data integration algorithm for identifying human disease genes
BMC Med. Genom.
(2015)
V. Law et al.
Drugbank 4.0shedding new light on drug metabolism
Nucl. Acids Res.
(2014)

A. Gaulton et al.

Chembla large-scale bioactivity database for drug discovery

Nucl. Acids Res.

(2012)

N. Hecker et al.

Supertarget goes quantitativeupdate on drug–target interactions

Nucl. Acids Res.

(2011)

M.J. Keiser et al.

Relating protein pharmacology by ligand chemistry

Nat. Biotechnol.

(2007)

S. Pérot et al.

Insights into an original pocket–ligand pair classificationa promising tool for ligand profile prediction

PLoS One

(2013)

A.C. Cheng et al.

Structure-based maximal affinity model predicts small-molecule druggability

Nat. Biotechnol.

(2007)

S.A. Combs et al.

Small-molecule ligand docking into comparative models with rosetta

Nat. Protoc.

(2013)

S. Zhu et al.

A probabilistic model for mining implicit ‘chemical compound–gene’ relations from literature

Bioinformatics

(2005)

F. Cheng et al.

Prediction of drug–target interactions and drug repositioning via network-based inference

PLoS Comput. Biol.

(2012)

X. Chen et al.

Drug–target interaction prediction by random walk on the heterogeneous network

Mol. BioSyst.

(2012)

Z. Xia et al.

Semi-supervised drug–protein interaction prediction from heterogeneous biological spaces

BMC Syst. Biol.

(2010)

K. Bleakley et al.

Supervised prediction of drug–target interactions using bipartite local models

Bioinformatics

(2009)

X. Chen et al.

Drug–target interaction predictiondatabases, web servers and computational models

Brief. Bioinform.

(2015)

S. Alaimo et al.

Drug–target interaction prediction through domain-tuned network-based inference

Bioinformatics

(2013)

Cited by (85)

JLONMFSC: Clustering scRNA-seq data based on joint learning of non-negative matrix factorization and subspace clustering
2024, Methods
The development of single cell RNA sequencing (scRNA-seq) has provided new perspectives to study biological problems at the single cell level. One of the key issues in scRNA-seq data analysis is to divide cells into several clusters for discovering the heterogeneity and diversity of cells. However, the existing scRNA-seq data are high-dimensional, sparse, and noisy, which challenges the existing single-cell clustering methods. In this study, we propose a joint learning framework (JLONMFSC) for clustering scRNA-seq data. In our method, the dimension of the original data is reduced to minimize the effect of noise. In addition, the graph regularized matrix factorization is used to learn the local features. Further, the Low-Rank Representation (LRR) subspace clustering is utilized to learn the global features. Finally, the joint learning of local features and global features is performed to obtain the results of clustering. We compare the proposed algorithm with eight state-of-the-art algorithms for clustering performance on six datasets, and the experimental results demonstrate that the JLONMFSC achieves better performance in all datasets. The code is avalable at https://github.com/lanbiolab/JLONMFSC.
DRGCNCDA: Predicting circRNA-disease interactions based on knowledge graph and disentangled relational graph convolutional network
2022, Methods
Emerging studies have shown that circular RNA (circRNA) plays a significant role in the diagnosis and prognosis of human disease. Some computational methods have been proposed to predict circRNA-disease associations. However, some methods only use circRNA-disease association and ignore the associations of other biological entities. In addition, these methods do not take into account the latent factors of different kinds of circRNAs and diseases. To solve these limitations of existing computational models, we propose a new computational model (DRGCNCDA) based on disentangled relational graph convolutional network. The circRNA-disease multi-relational graphs are constructed by collecting multiple relational data among circRNA, disease, miRNA and lncRNA. Then, the disentangled relational graph convolutional network is employed to obtain the feature vectors of circRNA and disease. Finally, knowledge graph model is applied to predict the affinity scores of circRNA-disease associations based on the embeddings of circRNA and disease. The 5-fold cross validation is utilized to evaluate the performance of the method. The experimental results show that the DRGCNCDA outperforms other existing models. Moreover, the case study demonstrates that the DRGCNCDA is effective to predict the circRNA-disease association and can provide reliable candidates for biological experiments.
DTIP-TC2A: An analytical framework for drug-target interactions prediction methods
2022, Computational Biology and Chemistry
Citation Excerpt :
In other words, this class of approaches, in a proper combination with other traditional categories, allows learning from a limited number of positive samples and a large number of unlabeled samples. The consequence of this appropriate combination can positively affect the prediction results and increase the accuracy of the final results (Lan et al., 2016). Their algorithm then has identified the unlabeled sample that has the largest total distance from the positive samples (P) and considers it as the first negative sample.
Identifying drug-target interactions through computational methods is raised an important and key step in the process of drug discovery and drug-oriented research during the last years. In addition to the advantages of existing computational methods, there are also challenges that affect methods' efficiency and provide obstacles in the direction of developing these computational methods. However, the literature suffers from lacking a comprehensive and comparative analysis concerning drug-target interactions prediction (DTIP) focusing on the analysis of technical and challenging aspects. It seems necessary to provide a comparative perspective and a different analysis on a macro level due to the importance of the DTIP problem. In this paper, we presented the quadruple framework of analytical, named DTIP-TC2A consists of four main components for DTIP. The first component, categorizing DTIP methods based on the technical aspect ahead and investigating the strengths and weaknesses of different DTIP methods. Second, classify DTIP challenges with a major focus on a well-organized and coherent investigation of challenges and presenting a macro view of the DTIP challenges by systematic identification of them. Third, recommending some general criteria to analyze DTIP methods in form of the proposed classifications. Suggesting a suitable set of qualitative criteria along with using quantitative criteria can lead to a more proper choice of DTIP methods. Fourth, performing a two-phase qualitative analysis and comparison between each class of DTIP approaches based on the proposed functional criteria and the identified challenges ahead in order to understand the superiority of each class of DTIP methods over the other class. We believed that the DTIP-TC2A framework can offer a proper context for efficient selection of DTIP methods, improving the efficiency of a DTIP system due to the nature of computational methods, upgrading DTIP methods by removing the barriers, and presenting new directions of research for further studies through systematic identification of DTIP challenges and purposeful evaluation of challenges and methods.
Drug-target interaction prediction using reliable negative samples and effective feature selection methods
2022, Journal of Pharmacological and Toxicological Methods
Citation Excerpt :
One of the limitations of network-based methods is that they essentially identify novel target proteins close to the known target proteins in the network. In recent years, machine learning-based methods have been widely used to overcome the problems of previous methods (Bagherian et al., 2020; Bahi & Batouche, 2018; Chen & Zhang, 2013; Hameed, Verspoor, Kusljic, & Halgamuge, 2017; Lan et al., 2016; Peng et al., 2017; Redkar, Mondal, Joseph, & Hareesha, 2020; Sachdev & Gupta, 2019; Wang et al., 2020; Wen et al., 2017). These methods assume that similar drugs are likely to interact with similar proteins.
Machine learning-based approaches in the field of drug discovery have dramatically reduced the time and cost of the laboratory process of detecting potential drug-target interactions (DTIs). Standard binary classifiers require both positive and negative samples in the training and validation phases. One of the major challenges in the DTI context is the lack of access to non-interacting pairs as negative samples in the learning process. Many recent studies in this field have randomly selected negative samples from unlabeled drug-target pairs. Therefore, due to the probability of the presence of unknown positive samples in a set considered as negative samples, the model results may be affected and appear with a high rate of false positive. In this study, an algorithm called Reliable Non-Interacting Drug-Target Pairs (RNIDTP) is proposed to select reliable negative samples and an efficient algorithm to select relevant features for drug-target interaction prediction. To validate the performance of the proposed RNIDTP algorithm in the selection of negative samples, a benchmark drug-target interactions dataset is used. The results demonstrate the superiority of the proposed algorithm compared with other algorithms in most cases. The results also indicate that by using an appropriate algorithm for the selection of negative samples, the performance of the learning process is significantly increased compared to random selection.
Predicting CircRNA disease associations using novel node classification and link prediction models on Graph Convolutional Networks
2022, Methods
Accumulated studies have discovered that circular RNAs (CircRNAs) are closely related to many complex human diseases. Due to this close relationship, CircRNAs can be used as good biomarkers for disease diagnosis and therapeutic targets for treatments. However, the number of experimentally verified circRNA-disease associations are still fewer and also conducting wet-lab experiments are constrained by the small scale and cost of time and labour. Therefore, effective computational methods are required to predict associations between circRNAs and diseases which will be promising candidates for small scale biological and clinical experiments. In this paper, we propose novel computational models based on Graph Convolution Networks (GCN) for the potential circRNA-disease association prediction. Currently most of the existing prediction methods use shallow learning algorithms. Instead, the proposed models combine the strengths of deep learning and graphs for the computation. First, they integrate multi-source similarity information into the association network. Next, models predict potential associations using graph convolution which explore this important relational knowledge of that network structure. Two circRNA-disease association prediction models, GCN based Node Classification (GCN-NC) and GCN based Link Prediction (GCN-LP) are introduced in this work and they demonstrate promising results in various experiments and outperforms other existing methods. Further, a case study proves that some of the predicted results of the novel computational models were confirmed by published literature and all top results could be verified using gene-gene interaction networks.
GANLDA: Graph attention network for lncRNA-disease associations prediction
2022, Neurocomputing
Increasing studies have indicated that long non-coding RNAs (lncRNAs) play important roles in many physiological and pathological pathways. Identifying lncRNA-disease associations not only contributes to the understanding of biological processes, but also provides new strategies for the diagnosis and prevention of diseases. In this article, an end to end computational model based on graph attention network (GANLDA) is proposed to predict associations between lncRNAs and diseases. In our method, it combines heterogeneous data of lncRNA and disease as original features. Then, the principal component analysis (PCA) is used to reduce the noise of the original features. Further, the graph attention network is utilized to extract the useful information from features of lncRNA and disease. Finally, the multi-layer perceptron is employed to infer lncRNA-disease associations. The experimental results show GANLDA outperforms than other four state-of-the-art methods in 10-fold cross validation and devono test. The case studies also demonstrate that GANLDA is an effective method for lncRNA-disease associations identification.

View all citing articles on Scopus

Jianxin Wang received the B.Eng. and M.Eng. degrees in Computer Engineering from Central South University, China, in 1992 and 1996, respectively, and the Ph.D. degree in computer science from Central South University, China, in 2001. He is the Vice Dean and a Professor in School of Information Science and Engineering, Central South University, Changsha, Hunan, PR China. His current research interests include algorithm analysis and optimization, parameterized algorithm, bioinformatics and computer network. He has published more than 150 papers in various International Journals and refereed Conferences.

Min Li received the B.S. in Communication Engineering from Central South University, China, in 2001, M.S. degrees in Traffic Information and Control Engineering from Central South University, China, in 2004 and the Ph.D. degree in Computer Science from Central South University, China, in 2008. She is the Professor in School of Information Science and Engineering, Central South University, Changsha, Hunan, PR China. Her current research interests include protein–protein interaction networks, essential proteins discovery, integrative analysis of molecular networks with other biological data and identifying dynamic network modules.

Jin Liu received his B.S. degree in Automation from East China Institute of Technology in 2010 and his M.S. degree in Computer Technology from University of Chinese Academy of Sciences in 2013. He is currently a Ph.D. Candidate in School of Information Science and Engineering, Central South University, Changsha, Hunan, PR China. His current research interests include medical image analysis, machine learning and pattern recognition.

Yaohang Li is an Associate Professor in the Department of Computer Science at Old Dominion University, Norfolk, VA, USA. His research interests are in Computational Biology and Scientific Computing. He received the M.S. and Ph.D. degrees in Computer Science from the Florida State University, Tallahassee, FL, USA, in 2000 and 2003, respectively. After graduation, he worked at Oak Ridge National Laboratory as a research associate for a short period of time. Before joining ODU, he was an Associate Professor in the Computer Science Department at North Carolina A&T State University, Greensboro, NC, USA.

Fang-Xiang Wu received the B.Sc. and M.Sc. degrees in Applied Mathematics, both from Dalian University of Technology, China, in 1990 and 1993, respectively, the first Ph.D. degree in Control Theory and its Applications from Northwestern Polytechnical University in 1998, and the second Ph.D. degree in Biomedical Engineering from the University of Saskatchewan, Canada, in 2004. Currently, he is working as an Associate Professor of Bioengineering with the Department of Mechanical Engineering and graduate chair of the Division of Biomedical Engineering at the University of Saskatchewan, Canada. His current research interests include systems biology, genomic and proteomic data analysis, biological system identification and parameter estimation, and applications of control theory to biological system.

Yi Pan is a Regents׳ Professor of Computer Science and an Interim Associate Dean and Chair of Biology at Georgia State University, USA. Dr. Pan joined Georgia State University in 2000 and was promoted to full professor in 2004, named a Distinguished University Professor in 2013 and designated a Regents׳ Professor (the highest recognition given to a faculty member by the University System of Georgia) in 2015. He served as the Chair of Computer Science Department from 2005–2013. He is also a visiting Changjiang Chair Professor at Central South University, China. Dr. Pan received his B.Eng. and M.Eng. degrees in computer engineering from Tsinghua University, China, in 1982 and 1984, respectively, and his Ph.D. degree in computer science from the University of Pittsburgh, USA, in 1991. His profile has been featured as a distinguished alumnus in both Tsinghua Alumni Newsletter and University of Pittsburgh CS Alumni Newsletter. Dr. Pan׳s research interests include parallel and cloud computing, wireless networks, and bioinformatics. Dr. Pan has published more than 330 papers including over 180 SCI journal papers and 60 IEEE/ACM Transactions papers. In addition, he has edited/authored 40 books. His work has been cited more than 6500 times. Dr. Pan has served as an editor-in-chief or editorial board member for 15 journals including 7 IEEE Transactions. He is the recipient of many awards including IEEE Transactions Best Paper Award, 4 other international conference or journal Best Paper Awards, 4 IBM Faculty Awards, 2 JSPS Senior Invitation Fellowships, IEEE BIBE Outstanding Achievement Award, NSF Research Opportunity Award, and AFOSR Summer Faculty Research Fellowship. He has organized many international conferences and delivered keynote speeches at over 50 international conferences around the world.

View full text

Predicting drug–target interaction using positive-unlabeled learning

Abstract

Introduction

Section snippets

Data preparation

Experiments and results

Conclusion and discussion

Acknowledgements

Drug Discov. Today: Technol.

J. Mol. Biol.

J. Biol. Chem.

Kidney Int.

Drug discoverypredicting promiscuity

Nature

Network output controllability-based method for drug target identification

IEEE Trans. NanoBiosci.

Phenotypic screening in cancer drug discovery [mdash] past, present and future

Nat. Rev. Drug Discov.

Exploiting structural information for drug-target assessment

Future Med. Chem.

A fast and high performance multiple data integration algorithm for identifying human disease genes

BMC Med. Genom.

Drugbank 4.0shedding new light on drug metabolism

Nucl. Acids Res.

Chembla large-scale bioactivity database for drug discovery

Nucl. Acids Res.

Supertarget goes quantitativeupdate on drug–target interactions

Nucl. Acids Res.

Relating protein pharmacology by ligand chemistry

Nat. Biotechnol.

Insights into an original pocket–ligand pair classificationa promising tool for ligand profile prediction

PLoS One

Structure-based maximal affinity model predicts small-molecule druggability

Nat. Biotechnol.

Small-molecule ligand docking into comparative models with rosetta

Nat. Protoc.

A probabilistic model for mining implicit ‘chemical compound–gene’ relations from literature

Bioinformatics

Prediction of drug–target interactions and drug repositioning via network-based inference

PLoS Comput. Biol.

Drug–target interaction prediction by random walk on the heterogeneous network

Mol. BioSyst.

Semi-supervised drug–protein interaction prediction from heterogeneous biological spaces

BMC Syst. Biol.

Supervised prediction of drug–target interactions using bipartite local models

Bioinformatics

Drug–target interaction predictiondatabases, web servers and computational models

Brief. Bioinform.

Drug–target interaction prediction through domain-tuned network-based inference

Bioinformatics