Predicting drug–target interaction using positive-unlabeled learning
Introduction
The development of a new drug is a cost- and time-consuming process. According to the US Food and Drug Administrations (FDA) statistical data, the cost of new molecular entity discovery is approximately $1.8 billion and it takes averagely 13 years [1]. In addition, only about 20 new molecular entities are approved by FDA each year. Therefore, it is an important issue in reducing these expenses in drug discovery. The computational methods provide an effective strategy to address this issue [2].
With the development of high-throughput techniques, a great deal of drug–target interaction data has been generated [3], [4], [5]. Several databases have been established to store interaction information and provide relevant retrieval servers. For example, DrugBank [6] database is a popular web resource containing information on drugs and drug targets which contains 7740 drug entries in the present version. ChEMBL [7] maintained by the European Bioinformatics Institute (EBI) is a manually curated chemical database of bioactive molecules with drug-like properties. In version 19, it contains 10,579 targets and 1,637,862 compound records and 2,843,338 bioactivity evidences. Supertarget [8] is an online and freely accessible database which contains over 6000 target proteins.
The computational methods have been boosted to predict drug–target interactions on account of the availability of interaction data. The traditional computational methods for drug–target interaction identification can be classified into three categories: ligand-based methods [9], [10], docking-based methods [11], [12] and literature text mining methods [13]. These approaches have achieved great successful in drug target interaction prediction. However, these methods have some limitations: the ligand-based methods rely on the number of known ligands, the docking-based methods need the information of protein structure, and literature text mining based methods are unable to find unknown and interesting interactions.
Recently, more and more statistical methods have been proposed to predict drug target interactions by integrating biological knowledge such as drug chemical structures, target protein sequence, gene expression and known drug–target interactions [14], [15], [16], [17]. The assumption of these approaches is that similar drugs show similar patterns of interactions with targets in drug–target interaction network [18], [19]. Chen et al. [15] presented network-based random walk with restart method, called NRWRH, to predict relationships between drugs and targets by integrating drug–drug chemical structure similarity network, protein–protein sequence similarity network and known drug–target interaction network into a heterogeneous network. Cheng et al. [14] proposed three inferring methods including drug-based similarity inference (DBSI), target-based similarity inference (TBSI) and network-based inference (NBI) to predict drug–target interactions. Similar work has been accomplished by Alaimo et al. [20], they presented DT-hybrid approach which extends network-based inference method by domain-based knowledge to detect drug–target interactions. Emig et al. [21] integrated different network-based methods to predict drug targets of a specific disease. These methods are easy to be implemented. However, these methods are unable to apply to drugs without any targets information. In addition, Bleakley and Yamanishi [17] employed bipartite local models to predict relationships between drugs and targets. Further work has been completed by Mei et al. [22], they integrated neighbour information into bipartite local models for drug target interaction identification. The Gaussian interaction profile kernel and weighted nearest neighbour were integrated for drug–target interaction prediction [23]. The Bayesian matrix factorization and binary classification [24] and probabilistic matrix factorization [25] were proposed to detect drug–target interactions. The common limitation of these supervised learning approaches is to treat unknown drug–target interactions as negative samples, which may affect predictive accuracy. Xia et al. [16] developed a semi-supervised method (NetLapRLS) for drug–target interaction identification by using positive and unlabeled samples. Chen and Zhang [26] presented NetCBP method by maximizing the rank coherence with respect to known knowledge to identify associations between drugs and targets. These semi-supervised methods can make use of unlabeled information. But they need to combine two different classifiers in the final.
Despite these approaches have achieved good performance, there are some limitations and difficulties for drug–target interactions prediction. Firstly, most of the methods adopt sequence information to measure the similarity of two proteins. More studies demonstrate that the structure information is more conservative than sequence information. Therefore, the structure information of target protein may be better suited for drug–target interaction identification. Secondly, there are no experimentally verified negative samples. Traditional methods treat the non-interaction data as negative sample which is unreasonable as those non-interaction data may contain undetected drug–target interactions. Thirdly, some methods are unable to predict new drugs without any targets, which limits the application in practice.
In this paper, we propose a framework to predict drug–target interaction based on positive-unlabeled learning. Comparing with existing approaches, we integrated multiple target resources including target structure information, target function category information and target function annotation information. In addition, we treat unknown drug target interactions as unlabeled set U instead of negative set N. Three strategies (Random walk with restarts, KNN and heat kernel diffusion) are used to classify unlabeled samples into two groups: reliable negative samples (RN) and likely negative samples (LN) based on target similarity information and majority voting method is used to aggregate these strategies to decide the final label of unlabeled samples. The weighted support vector machines are employed to build a multi-level classifier to predict drug target interactions based on positive set, reliable negative set and likely negative set. The experiments are conducted on four datasets (including Enzyme, Ion Channel, GPCR and Nuclear Receptor). The experimental results demonstrate that our method outperforms state-of-the-art approaches.
Section snippets
Data preparation
In this paper, we use four drug–target interaction networks in human involving Enzyme, Ion Channel, GPCR and Nuclear Receptor which are first analysed by Yamanishi et al. [27]. These datasets can be downloaded from http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/. Table 1 show some information of four datasets. The drug–target interaction data are collected from the KEGG BRITE [28], BRENDA [29], SuperTarget [8] and DrugBank [6].
Drug chemical structure information is retrieved from the DRUG
Experiments and results
In this section, we first analyse degree distributions of drugs in four drug–target interaction networks. Then, we compare our method with five state-of-the-art approaches (DBSI [14], NetLapRLS [16], KBMF2K [24], NetCBP [26], WNN-GIP [23]) for drug–target interaction prediction. Last, we show the performance of our method in potential drug–target interaction identification.
Conclusion and discussion
To systematically understand the associations between chemical compounds and target proteins is conducive to new drug design and discovery. Due to the limitation of traditional experimental methods, it is common for biological scientists to predict for drug–target interaction prediction by computational methods. Many computational approaches have been developed to predict drug–target interactions. However, there are some limitations existing in these methods: (1) some methods treat unlabeled
Acknowledgements
This work is supported in part by the National Natural Science Foundation of China under Grant nos. 61232001, 61428209 and 61420106009; the Program for New Century Excellent Talents in University (NCET-12-0547).
Wei Lan received his B.Sc. and M.Sc. degrees in Henan Polytechnical University and Guangxi University, China in 2009 and 2012, respectively. He is currently a Ph.D. Candidate in Bioinformatics at Central South University. His currently research interests including data mining, machine learning and bioinformatics especially in drug target, disease gene and noncoding RNA.
References (46)
- et al.
Drug target identification through systems biology
Drug Discov. Today: Technol.
(2015) - et al.
Identification of common molecular subsequences
J. Mol. Biol.
(1981) - et al.
Synthetic lethal and biochemical analyses of nad and nadh kinases in Saccharomyces cerevisiae establish separation of cellular functions
J. Biol. Chem.
(2006) - et al.
Treatment of hyperphosphatemia in patients with chronic kidney disease on maintenance hemodialysis
Kidney Int.
(2005) Drug discoverypredicting promiscuity
Nature
(2009)- et al.
Network output controllability-based method for drug target identification
IEEE Trans. NanoBiosci.
(2015) - et al.
Phenotypic screening in cancer drug discovery [mdash] past, present and future
Nat. Rev. Drug Discov.
(2014) - et al.
Exploiting structural information for drug-target assessment
Future Med. Chem.
(2014) - et al.
A fast and high performance multiple data integration algorithm for identifying human disease genes
BMC Med. Genom.
(2015) - et al.
Drugbank 4.0shedding new light on drug metabolism
Nucl. Acids Res.
(2014)
Chembla large-scale bioactivity database for drug discovery
Nucl. Acids Res.
Supertarget goes quantitativeupdate on drug–target interactions
Nucl. Acids Res.
Relating protein pharmacology by ligand chemistry
Nat. Biotechnol.
Insights into an original pocket–ligand pair classificationa promising tool for ligand profile prediction
PLoS One
Structure-based maximal affinity model predicts small-molecule druggability
Nat. Biotechnol.
Small-molecule ligand docking into comparative models with rosetta
Nat. Protoc.
A probabilistic model for mining implicit ‘chemical compound–gene’ relations from literature
Bioinformatics
Prediction of drug–target interactions and drug repositioning via network-based inference
PLoS Comput. Biol.
Drug–target interaction prediction by random walk on the heterogeneous network
Mol. BioSyst.
Semi-supervised drug–protein interaction prediction from heterogeneous biological spaces
BMC Syst. Biol.
Supervised prediction of drug–target interactions using bipartite local models
Bioinformatics
Drug–target interaction predictiondatabases, web servers and computational models
Brief. Bioinform.
Drug–target interaction prediction through domain-tuned network-based inference
Bioinformatics
Cited by (85)
DTIP-TC2A: An analytical framework for drug-target interactions prediction methods
2022, Computational Biology and ChemistryCitation Excerpt :In other words, this class of approaches, in a proper combination with other traditional categories, allows learning from a limited number of positive samples and a large number of unlabeled samples. The consequence of this appropriate combination can positively affect the prediction results and increase the accuracy of the final results (Lan et al., 2016). Their algorithm then has identified the unlabeled sample that has the largest total distance from the positive samples (P) and considers it as the first negative sample.
Drug-target interaction prediction using reliable negative samples and effective feature selection methods
2022, Journal of Pharmacological and Toxicological MethodsCitation Excerpt :One of the limitations of network-based methods is that they essentially identify novel target proteins close to the known target proteins in the network. In recent years, machine learning-based methods have been widely used to overcome the problems of previous methods (Bagherian et al., 2020; Bahi & Batouche, 2018; Chen & Zhang, 2013; Hameed, Verspoor, Kusljic, & Halgamuge, 2017; Lan et al., 2016; Peng et al., 2017; Redkar, Mondal, Joseph, & Hareesha, 2020; Sachdev & Gupta, 2019; Wang et al., 2020; Wen et al., 2017). These methods assume that similar drugs are likely to interact with similar proteins.
GANLDA: Graph attention network for lncRNA-disease associations prediction
2022, Neurocomputing
Wei Lan received his B.Sc. and M.Sc. degrees in Henan Polytechnical University and Guangxi University, China in 2009 and 2012, respectively. He is currently a Ph.D. Candidate in Bioinformatics at Central South University. His currently research interests including data mining, machine learning and bioinformatics especially in drug target, disease gene and noncoding RNA.
Jianxin Wang received the B.Eng. and M.Eng. degrees in Computer Engineering from Central South University, China, in 1992 and 1996, respectively, and the Ph.D. degree in computer science from Central South University, China, in 2001. He is the Vice Dean and a Professor in School of Information Science and Engineering, Central South University, Changsha, Hunan, PR China. His current research interests include algorithm analysis and optimization, parameterized algorithm, bioinformatics and computer network. He has published more than 150 papers in various International Journals and refereed Conferences.
Min Li received the B.S. in Communication Engineering from Central South University, China, in 2001, M.S. degrees in Traffic Information and Control Engineering from Central South University, China, in 2004 and the Ph.D. degree in Computer Science from Central South University, China, in 2008. She is the Professor in School of Information Science and Engineering, Central South University, Changsha, Hunan, PR China. Her current research interests include protein–protein interaction networks, essential proteins discovery, integrative analysis of molecular networks with other biological data and identifying dynamic network modules.
Jin Liu received his B.S. degree in Automation from East China Institute of Technology in 2010 and his M.S. degree in Computer Technology from University of Chinese Academy of Sciences in 2013. He is currently a Ph.D. Candidate in School of Information Science and Engineering, Central South University, Changsha, Hunan, PR China. His current research interests include medical image analysis, machine learning and pattern recognition.
Yaohang Li is an Associate Professor in the Department of Computer Science at Old Dominion University, Norfolk, VA, USA. His research interests are in Computational Biology and Scientific Computing. He received the M.S. and Ph.D. degrees in Computer Science from the Florida State University, Tallahassee, FL, USA, in 2000 and 2003, respectively. After graduation, he worked at Oak Ridge National Laboratory as a research associate for a short period of time. Before joining ODU, he was an Associate Professor in the Computer Science Department at North Carolina A&T State University, Greensboro, NC, USA.
Fang-Xiang Wu received the B.Sc. and M.Sc. degrees in Applied Mathematics, both from Dalian University of Technology, China, in 1990 and 1993, respectively, the first Ph.D. degree in Control Theory and its Applications from Northwestern Polytechnical University in 1998, and the second Ph.D. degree in Biomedical Engineering from the University of Saskatchewan, Canada, in 2004. Currently, he is working as an Associate Professor of Bioengineering with the Department of Mechanical Engineering and graduate chair of the Division of Biomedical Engineering at the University of Saskatchewan, Canada. His current research interests include systems biology, genomic and proteomic data analysis, biological system identification and parameter estimation, and applications of control theory to biological system.
Yi Pan is a Regents׳ Professor of Computer Science and an Interim Associate Dean and Chair of Biology at Georgia State University, USA. Dr. Pan joined Georgia State University in 2000 and was promoted to full professor in 2004, named a Distinguished University Professor in 2013 and designated a Regents׳ Professor (the highest recognition given to a faculty member by the University System of Georgia) in 2015. He served as the Chair of Computer Science Department from 2005–2013. He is also a visiting Changjiang Chair Professor at Central South University, China. Dr. Pan received his B.Eng. and M.Eng. degrees in computer engineering from Tsinghua University, China, in 1982 and 1984, respectively, and his Ph.D. degree in computer science from the University of Pittsburgh, USA, in 1991. His profile has been featured as a distinguished alumnus in both Tsinghua Alumni Newsletter and University of Pittsburgh CS Alumni Newsletter. Dr. Pan׳s research interests include parallel and cloud computing, wireless networks, and bioinformatics. Dr. Pan has published more than 330 papers including over 180 SCI journal papers and 60 IEEE/ACM Transactions papers. In addition, he has edited/authored 40 books. His work has been cited more than 6500 times. Dr. Pan has served as an editor-in-chief or editorial board member for 15 journals including 7 IEEE Transactions. He is the recipient of many awards including IEEE Transactions Best Paper Award, 4 other international conference or journal Best Paper Awards, 4 IBM Faculty Awards, 2 JSPS Senior Invitation Fellowships, IEEE BIBE Outstanding Achievement Award, NSF Research Opportunity Award, and AFOSR Summer Faculty Research Fellowship. He has organized many international conferences and delivered keynote speeches at over 50 international conferences around the world.