Abstract
Current biological knowledge is buried in hundreds of proprietary and public life-science databases available on the World Wide Web (WWW) and millions of scientific publications. Gaining access to this knowledge can prove difficult as each database may provide different tools to query or show the data and may differ in their structure and user interface or uses a different interpretation of biological knowledge than others. Systems approaches to biological research require that existing biological knowledge (data) is made available to support on the one hand the analysis of experimental results and on the other hand the construction and enrichment of models. Data integration methods are being developed to address these issues by providing a consolidated view of molecular information fused together from multiple databases. However, a key challenge for data integration is the identification of links between closely related entries in different life sciences databases when there is no direct information that provides a reliable cross reference. Here we describe and evaluate three data integration methods to address this challenge in the context of a graph-based data integration framework (the Ondex system). We give a quantitative evaluation of their performance in two different situations: the integration and analysis of different metabolic pathways resources and the mapping of equivalent elements between the Gene Ontology and a nomenclature describing enzyme function.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Biotechnology and Biological Sciences Research Council (2007) Systems biology. http://www.bbsrc.ac.uk/publications/topic/systems-biology.aspx
Köhler J, Baumbach J, Taubert J, Specht M, Skusa A, Ruegg A, Rawlings C, Verrier P, Philippi S (2006) Graph-based analysis and visualization of experimental results with ONDEX. Bioinformatics 22(11):1383–1390
Gaylord M, Calley J, Qiang H, Su EW, Liao B (2006) A flexible integration and visualisation system for biomarker discovery. Appl Bioinformatics 5(4):219–223
Fischer HP (2005) Towards quantitative biology: integration of biological information to elucidate disease pathways and to guide drug discovery. Biotechnol Annu Rev 11:1–68
Köhler J, Rawlings C, Verrier P, Mitchell R, Skusa A, Ruegg A, Philippi S (2005) Linking experimental results, biological networks and sequence analysis methods using Ontologies and Generalised Data Structures. In Silico Biol 5(1):33–44
Taubert J, Hindle M, Lysenko A, Weile J, Köhler J, Rawlings CJ (2009) Linking life sciences data using graph-based mapping. Paper presented at the proceedings of the 6th international workshop on data integration in the life sciences, Manchester, UK
Taubert J, Sieren KP, Hindle M, Hoekman B, Winnenburg R, Philippi S, Rawlings C, Köhler J (2007) The OXL format for the exchange of integrated datasets. J Integr Bioinform 4(3):63
Taubert J (2011) ONDEX - a data integration framework for the life sciences. Bielefeld University, Bielefeld
Goble C, Stevens R (2008) State of the nation in data integration for bioinformatics. J Biomed Inform 41(5):687–693. doi:S1532-0464(08)00017-8 [pii] 10.1016/j.jbi.2008.01.008
Etzold T, Ulyanov A, Argos P (1996) SRS: information retrieval system for molecular biology data banks. Methods Enzymol 266:114–128
Baitaluk M, Qian X, Godbole S, Raval A, Ray A, Gupta A (2006) PathSys: integrating molecular interaction graphs for systems biology. BMC Bioinformatics 7:55
Küntzer J, Blum T, Gerasch A, Backes C, Hildebrandt A, Kaufmann M, Kohlbacher O, Lenhof H-P (2006) BN++ − a Biological Information System. J Integr Bioinform 3(2):34. doi:10.2390/biecoll-jib-2006-34
Smith B, Ceusters W, Klagges B, Kohler J, Kumar A, Lomax J, Mungall C, Neuhaus F, Rector AL, Rosse C (2005) Relations in biomedical ontologies. Genome Biol 6(5):R46
Lee D, Kim S, Kim Y (2007) BioCAD: an information fusion platform for bio-network inference and analysis. BMC Bioinformatics 8(Suppl 9):S2. doi:1471-2105-8-S9-S2 [pii] 10.1186/1471-2105-8-S9-S2
Birkland A, Yona G (2006) BIOZON: a system for unification, management and analysis of heterogeneous biological data. BMC Bioinformatics 7:70. doi:1471-2105-7-70 [pii] 10.1186/1471-2105-7-70
Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M, Bork P, von Mering C (2009) STRING 8 – a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res 37(Database issue):D412–D416. doi:gkn760 [pii] 10.1093/nar/gkn760
Pesch R, Lysenko A, Hindle M, Hassani-Pak K, Thiele R, Rawlings C, Köhler J, Taubert J (2008) Graph-based sequence annotation using a data integration approach. J Integr Bioinform 5(2):94. doi:10.2390/biecoll-jib-2008-94
Brohee S, Faust K, Lima-Mendez G, Sand O, Janky R, Vanderstocken G, Deville Y, van Helden J (2008) NeAT: a toolbox for the analysis of biological networks, clusters, classes and pathways. Nucleic Acids Res 36(Web Server issue):W444–W451. doi:gkn336 [pii] 10.1093/nar/gkn336
Dwyer T, Rolletschek H, Schreiber F (2004) Representing experimental biological data in metabolic networks. Paper presented at the proceedings of the second conference on Asia-Pacific bioinformatics, vol 29, Dunedin, New Zealand
Jeong H, Mason SP, Barabasi AL, Oltvai ZN (2001) Lethality and centrality in protein networks. Nature 411(6833):41–42. doi:10.1038/35075138
Ogata H, Goto S, Fujibuchi W, Kanehisa M (1998) Computation with the KEGG pathway database. Biosystems 47(1–2):119–128
Zhu H, Cabrera RM, Wlodarczyk BJ, Bozinov D, Wang D, Schwartz RJ, Finnell RH (2007) Differentially expressed genes in embryonic cardiac tissues of mice lacking Folr1 gene activity. BMC Dev Biol 7:128. doi:10.1186/1471-213X-7-128
Gardner SP (2005) Ontologies and semantic data integration. Drug Discov Today 10(14):1001–1007. doi:S1359-6446(05)03504-X [pii] 10.1016/S1359-6446(05)03504-X
Bairoch A (2000) The ENZYME database in 2000. Nucleic Acids Res 28(1):304–305
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet 25(1):25–29. doi:10.1038/75556
Jupe S, Akkerman JW, Soranzo N, Ouwehand WH (2012) Reactome – a curated knowledgebase of biological pathways: megakaryocytes and platelets. J Thromb Haemost. doi:10.1111/j.1538-7836.2012.04930.x
Caspi R, Altman T, Dreher K, Fulcher CA, Subhraveti P, Keseler IM, Kothari A, Krummenacker M, Latendresse M, Mueller LA, Ong Q, Paley S, Pujar A, Shearer AG, Travers M, Weerasinghe D, Zhang P, Karp PD (2012) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res 40(Database issue):D742–D753. doi:10.1093/nar/gkr1014
Smith B (2004) Beyond concepts: ontology as reality representation. In: Varzi A, Vieu L (eds) Proceedings of FOIS. IOS Press, Amsterdam
Schuemie MJ, Mons B, Weeber M, Kors JA (2007) Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification. J Biomed Inform 40(3):316–324. doi:S1532-0464(06)00097-9 [pii] 10.1016/j.jbi.2006.09.002
Knuth D (1997) Section 6.2.3: Balanced trees. In: The art of computer programming, vol 3, Sorting and searching, 2nd edn. Addison-Wesley, Reading, 1998. ISBN 0-201-89685-0
Pearson WR (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 183:63–98
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402. doi: 10.1093/nar/25.17.3389
Goutte C, Gaussier E (2005) A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In: Losada DE, Fernandez-Luna JM (eds) European Colloquium on IR Research (ECIR’05), 2005, Springer Berlin Heidelberg, pp 345–359. http://dx.doi.org/10.1007/978-3-540-31865-1_25
Stobbe MD, Houten SM, Jansen GA, van Kampen AH, Moerland PD (2011) Critical assessment of human metabolic pathway databases: a stepping stone for future integration. BMC Syst Biol 5:165. doi:10.1186/1752-0509-5-165
Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcantara R, Darsow M, Guedj M, Ashburner M (2008) ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res 36(Database issue):D344–D350. doi:10.1093/nar/gkm791
Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O’Donovan C, Redaschi N, Yeh LS (2004) UniProt: the universal protein knowledgebase. Nucleic Acids Res 32(Database issue):D115–D119. doi:10.1093/nar/gkh13132/suppl_1/D115 [pii]
Bader G, Cary M (2005) BioPAX – biological pathways exchange language. BioPAX workgroup. http://www.biopax.org/release/biopax-level2-documentation.pdf
Baldwin TK, Winnenburg R, Urban M, Rawlings C, Köhler J, Hammond-Kosack KE (2006) PHI-base provides insights into generic and novel themes of pathogenicity. Mol Plant Microbe Interact 19(12):1451–1462
Winnenburg R, Baldwin TK, Urban M, Rawlings C, Köhler J, Hammond-Kosack KE (2006) PHI-base: a new database for pathogen host interactions. Nucleic Acids Res 34(Database issue):D459–D464
Köhler J, Munn K, Rüegg A, Skusa A, Smith B (2006) Quality control for terms and definitions in ontologies and taxonomies. BMC Bioinformatics 7:212
Zhang L, Gu J-G (2005) Ontology based semantic mapping architecture. In: Fourth international conference on machine learning and cybernetics. IEEE
Acknowledgements
We would like to thank all current and previous contributors to the Ondex system (see www.ondex.org). The main part of this work has been carried out at Rothamsted Research. Rothamsted Research receives grant in aid from the Biotechnology and Biological Sciences Research Council (BBSRC). This work was supported by BBSRC SABR award BB/F006039/1 and TSB project TP 5082–33372. JT also would like to thank EMBL-EBI for allowing time to write this chapter.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
WWW Link List (In Order of First Occurrence)
WWW Link List (In Order of First Occurrence)
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Taubert, J., Köhler, J. (2014). Molecular Information Fusion in Ondex. In: Chen, M., Hofestädt, R. (eds) Approaches in Integrative Bioinformatics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41281-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-41281-3_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41280-6
Online ISBN: 978-3-642-41281-3
eBook Packages: Computer ScienceComputer Science (R0)