ABSTRACT
Knowledge graphs became a popular means for modeling complex biological systems where they model the interactions between biological entities and their effects on the biological system. They also provide support for relational learning models which are known to provide highly scalable and accurate predictions of associations between biological entities. Despite the success of the combination of biological knowledge graph and relation learning models in biological predictive tasks, there is a lack of unified biological knowledge graph resources. This forced all current efforts and studies for applying a relational learning model on biological data to compile and build biological knowledge graphs from open biological databases. This process is often performed inconsistently across such efforts, especially in terms of choosing the original resources, aligning identifiers of the different databases, and assessing the quality of included data. To make relational learning on biomedical data more standardised and reproducible, we propose a new biological knowledge graph which provides a compilation of curated relational data from open biological databases in a unified format with common, interlinked identifiers. We also provide a new module for mapping identifiers and labels from different databases which can be used to align our knowledge graph with biological data from other heterogeneous sources. Finally, to illustrate the practical relevance of our work, we provide a set of benchmarks based on the presented data that can be used to train and assess the relational learning models in various tasks related to pathway and drug discovery.
Supplemental Material
- Joanna S. Amberger, Carol A. Bocchini, François Schiettecatte, Alan F. Scott, and Ada Hamosh. 2015. OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Research , Vol. 43 (2015), D789 -- D798.Google ScholarCross Ref
- Amos Bairoch. 2018. The Cellosaurus, a Cell-Line Knowledge Resource. Journal of biomolecular techniques : JBT , Vol. 29 2 (2018), 25--38.Google ScholarCross Ref
- François Belleau, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault, and Jean Morissette. 2008. Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. Journal of biomedical informatics , Vol. 41 5 (2008), 706--16.Google ScholarDigital Library
- Antoine Bordes, Nicolas Usunier, Alberto Garc'i a-Durá n, Jason Weston, and Oksana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-relational Data. In NIPS. 2787--2795.Google ScholarDigital Library
- Gene Ontology Consortium. 2005. The Gene Ontology (GO) project in 2006. Nucleic Acids Research , Vol. 34 (2005), D322 -- D326.Google ScholarCross Ref
- The UniProt Consortium. 2010. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Research , Vol. 38 (2010), D142 -- D148.Google ScholarCross Ref
- The UniProt Consortium. 2019. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Research , Vol. 47 (2019), D506 -- D515.Google ScholarCross Ref
- David Croft and Gavin O'Kelly et. al. 2011. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Research , Vol. 39 (2011), D691 -- D697.Google ScholarCross Ref
- Nikolai Hecker, Jessica Ahmed, Joachim von Eichborn, Mathias Dunkel, Karel Macha, Andreas Eckert, Michael K. Gilson, Philip E. Bourne, and Robert Preissner. 2012. SuperTarget goes quantitative: update on drug--target interactions. Nucleic Acids Research , Vol. 40 (2012), D1113 -- D1117.Google ScholarCross Ref
- Micheal Hewett, Diane E. Oliver, Daniel L. Rubin, Katrina L. Easton, Joshua M. Stuart, Russ B. Altman, and Teri E. Klein. 2002. PharmGKB: the Pharmacogenetics Knowledge Base. Nucleic acids research, Vol. 30 1 (2002), 163--5.Google Scholar
- Maruan Hijazi, Ryan Smith, Vinothini Rajeeve, Conrad Bessant, and Pedro R. Cutillas. 2020. Reconstructing kinase network topologies from phosphoproteomics data reveals cancer-associated rewiring. Nature Biotechnology, Vol. 38 (2020), 493 -- 502.Google ScholarCross Ref
- Heiko Horn, Erwin Schoof, Jinho Kim, Xavier Robin, Martin L. Miller, Francesca Diella, Anita Palma, Gianni Cesareni, Lars Juhl Jensen, and Rune Linding. 2014. KinomeXplorer: an integrated platform for kinome biology studies. Nature Methods, Vol. 11 (2014), 603--604.Google ScholarCross Ref
- Peter V. Hornbeck, Jon M. Kornhauser, Sasha Tkachev, Bin Zhang, Elzbieta Skrzypek, Beth Murray, Vaughan Latham, and Michael Sullivan. 2012. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Research , Vol. 40 (2012), D261 -- D270.Google ScholarCross Ref
- Minoru Kanehisa, Yoko Sato, Masayuki Kawashima, Miho Furumichi, and Mao Tanabe. 2016. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Research , Vol. 44 (2016), D457 -- D462.Google ScholarCross Ref
- Craig Knox, Vivian Law, Timothy Jewison, Philip Liu, Son Ly, Alex Frolkis, Allison Pon, Kelly Banco, Christine Mak, Vanessa Neveu, Yannick Djoumbou, Roman Eisner, Anchi Guo, and David Scott Wishart. 2011. DrugBank 3.0: a comprehensive resource for 'Omics' research on drugs. Nucleic Acids Research , Vol. 39 (2011), D1035 -- D1041.Google ScholarCross Ref
- Michael Kuhn, Ivica Letunic, Lars Juhl Jensen, and Peer Bork. 2016. The SIDER database of drugs and side effects. Nucleic Acids Research , Vol. 44 (2016), D1075 -- D1079.Google ScholarCross Ref
- Xin Liu, Feng Zhu, Xiaohua Ma, Lin Tao, Jingxian Zhang, Shengyong Yang, Yuquan Wei, and Y. Z. Chen. 2011. The Therapeutic Target Database: an internet resource for the primary targets of approved, clinical trial and experimental drugs. Expert opinion on therapeutic targets , Vol. 15 8 (2011), 903--12.Google Scholar
- Farzaneh Mahdisoltani, Joanna Biega, and Fabian M. Suchanek. 2015. YAGO3: A Knowledge Base from Multilingual Wikipedias. In CIDR. www.cidrdb.org.Google Scholar
- Carolyn J. Mattingly, Glenn T. Colby, John N. Forrest, and James L. Boyer. 2003. The Comparative Toxicogenomics Database (CTD). Environmental Health Perspectives , Vol. 111 (2003), 793 -- 795.Google ScholarCross Ref
- George A. Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM, Vol. 38, 11 (1995), 39--41.Google ScholarDigital Library
- Alex L. Mitchell and Terri K. Attwood et. al. 2019. InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Research , Vol. 47 (2019), D351 -- D360.Google ScholarCross Ref
- Sameh K. Mohamed. 2020. Predicting tissue-specific protein functions using multi-part tensor decomposition. Information Sciences, Vol. 508 (2020), 343--357.Google ScholarCross Ref
- Sameh K Mohamed and Aayah Nounu. 2020. Predicting The Effects of Chemical-Protein Interactions On Proteins Using Tensor Factorisation. AMIA Summits on Translational Science Proceedings, Vol. 2020 (2020), 430.Google Scholar
- Sameh K Mohamed, Aayah Nounu, and V'i t Nová cek. 2020 a. Biological applications of knowledge graph embedding models. Briefings in Bioinformatics (02 2020). https://doi.org/10.1093/bib/bbaa012 bbaa012.Google Scholar
- Sameh K. Mohamed and V'i t Nová cek. 2019. Link Prediction Using Multi Part Embeddings. In ESWC (Lecture Notes in Computer Science, Vol. 11503). Springer, 240--254.Google Scholar
- Sameh K. Mohamed, V'i t Nová cek, and Aayah Nounu. 2020 b. Discovering protein drug targets using knowledge graph embeddings. Bioinformatics, Vol. 36, 2 (2020), 603--610.Google ScholarCross Ref
- Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2016. A Review of Relational Machine Learning for Knowledge Graphs. Proc. IEEE, Vol. 104 (2016), 11--33.Google ScholarCross Ref
- John C. Obenauer, Lewis C. Cantley, and Michael B. Yaffe. 2003. Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic acids research, Vol. 31 13 (2003), 3635--41.Google Scholar
- Rawan S. Olayan, Haitham Ashoor, and Vladimir B. Bajic. 2018. DDR: efficient computational method to predict drug--target interactions using graph mining and machine learning approaches. Bioinformatics, Vol. 34 (2018), 1164 -- 1173.Google ScholarCross Ref
- Sandra E. Orchard, Mais G. Ammari, and Bruno Aranda et. al. 2014. The MIntAct project?IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Research , Vol. 42 (2014), D358 -- D363.Google ScholarCross Ref
- Jiangning Song, Huilin Wang, Jiawei Wang, André Leier, Tatiana T. Marquez-Lago, Bingjiao Yang, Ziding Zhang, Tatsuya Akutsu, Geoffrey I. Webb, and Roger J. Daly. 2017. PhosphoPredict: A bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection. Scientific Reports, Vol. 7 (2017).Google Scholar
- Chris Stark, Bobby-Joe Breitkreutz, Teresa Reguly, Lorrie Boucher, Ashton Breitkreutz, and Mike Tyers. 2006. BioGRID: a general repository for interaction datasets. Nucleic Acids Research , Vol. 34 (2006), D535 -- D539.Google ScholarCross Ref
- Damian Szklarczyk, Andrea Franceschini, Michael Kuhn, Milan Simonovic, Alexander Roth, Pablo Mínguez, Tobias Doerks, Manuel Stark, Jean Muller, Peer Bork, Lars Juhl Jensen, and Christian von Mering. 2011. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Research , Vol. 39 (2011), D561 -- D568.Google ScholarCross Ref
- Nicholas P. Tatonetti, Patrick Ye, Roxana Daneshjou, and Russ B. Altman. 2012. Data-driven prediction of drug effects and interactions. Science translational medicine , Vol. 4 125 (2012), 125ra31.Google Scholar
- Thé o Trouillon, Johannes Welbl, Sebastian Riedel, É ric Gaussier, and Guillaume Bouchard. 2016. Complex Embeddings for Simple Link Prediction. In ICML (JMLR Workshop and Conference Proceedings, Vol. 48). JMLR.org, 2071--2080.Google Scholar
- Mathias Uhlén, Per Oksvold, Linn Fagerberg, Emma Lundberg, Kalle Jonasson, Mattias Forsberg, Martin Zwahlen, Caroline Kampf, Kenneth Wester, Sophia Hober, Henrik Wernérus, Lisa Björling, and Frederik Pontén. 2010. Towards a knowledge-based Human Protein Atlas. Nature Biotechnology, Vol. 28 (2010), 1248--1250.Google ScholarCross Ref
- Christian von Mering, Martijn A. Huynen, Daniel Jaeggi, Steffen Schmidt, Peer Bork, and Berend Snel. 2003. STRING: a database of predicted functional associations between proteins. Nucleic acids research, Vol. 31 1 (2003), 258--61.Google Scholar
- David S. Wishart, Craig Knox, An Chi Guo, Dean Cheng, Savita Shrivastava, Dan Tzur, Bijaya Gautam, and Murtaza Hassanali. 2008. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Research , Vol. 36 (2008), D901--D906.Google ScholarCross Ref
- Yoshihiro Yamanishi, Michihiro Araki, Alex Gutteridge, Wataru Honda, and Minoru Kanehisa. 2008. Prediction of drug--target interaction networks from the integration of chemical and genomic spaces. Bioinformatics, Vol. 24 (2008), i232 -- i240.Google ScholarDigital Library
- Marinka Zitnik, Monica Agrawal, and Jure Leskovec. 2018. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics, Vol. 34 (2018), i457 -- i466.Google ScholarCross Ref
Index Terms
- BioKG: A Knowledge Graph for Relational Learning On Biological Data
Recommendations
On the reproducibility of results of pathway analysis in genome-wide expression studies of colorectal cancers
One of the major problems in genomics and medicine is the identification of gene networks and pathways deregulated in complex and polygenic diseases, like cancer. In this paper, we address the problem of assessing the variability of results of pathways ...
Research Article: Bioinformatic analysis of molecular network of glucosinolate biosynthesis
Glucosinolates constitute a major group of secondary metabolites in Arabidopsis, which play an important role in plant interaction with pathogens and insects. Advances in glucosinolate research have defined the biosynthetic pathways. However, cross-talk ...
Gene interaction - An evolutionary biclustering approach
DNA Microarray experiments form a powerful tool for studying gene expression patterns, in large scale. Sharing of the regulatory mechanism among genes, in an organism, is predominantly responsible for their co-expression. Biclustering aims at finding a ...
Comments