Elsevier

Journal of Web Semantics

Volume 14, July 2012, Pages 2-13
Journal of Web Semantics

Ontology paper
Emerging practices for mapping and linking life sciences data using RDF — A case series

https://doi.org/10.1016/j.websem.2012.02.003Get rights and content

Abstract

Members of the W3C Health Care and Life Sciences Interest Group (HCLS IG) have published a variety of genomic and drug-related data sets as Resource Description Framework (RDF) triples. This experience has helped the interest group define a general data workflow for mapping health care and life science (HCLS) data to RDF and linking it with other Linked Data sources. This paper presents the workflow along with four case studies that demonstrate the workflow and addresses many of the challenges that may be faced when creating new Linked Data resources. The first case study describes the creation of linked RDF data from microarray data sets while the second discusses a linked RDF data set created from a knowledge base of drug therapies and drug targets. The third case study describes the creation of an RDF index of biomedical concepts present in unstructured clinical reports and how this index was linked to a drug side-effect knowledge base. The final case study describes the initial development of a linked data set from a knowledge base of small molecules.

This paper also provides a detailed set of recommended practices for creating and publishing Linked Data sources in the HCLS domain in such a way that they are discoverable and usable by people, software agents, and applications. These practices are based on the cumulative experience of the Linked Open Drug Data (LODD) task force of the HCLS IG. While no single set of recommendations can address all of the heterogeneous information needs that exist within the HCLS domains, practitioners wishing to create Linked Data should find the recommendations useful for identifying the tools, techniques, and practices employed by earlier developers. In addition to clarifying available methods for producing Linked Data, the recommendations for metadata should also make the discovery and consumption of Linked Data easier.

Introduction

Data integration is challenging because it requires sufficient domain expertise to understand the meaning of the data which is often undocumented or implicit in human-readable labels. Linked Data is an approach to data integration that employs ontologies, terminologies, Uniform Resource Identifiers (URIs) and the Resource Description Framework (RDF) to connect pieces of data, information and knowledge on the Semantic Web [1]. RDF makes it possible to use terms and other resources from remote locations together with one’s own local data and terms. In effect, the ability to create assertions that mix local and remote namespaces makes it possible to publish and access knowledge distributed over the Web using common vocabularies. Expressing information as Linked Data shifts some of the integration burden from data consumers to publishers, which has the advantage that data publishers tend to be more knowledgeable about the intended semantics.

The benefits of Linked Data are particularly relevant in the life sciences, where there is often no agreement on a unique representation of a given entity. As a result, many life science entities are referred to by multiple labels, and some labels refer to multiple entities. For example, a query for Homo sapiens gene label “Alg2” in Entrez Gene (http://www.ncbi.nlm.nih.gov/gene) returns multiple results. Among them is one gene located on chromosome 5 (Entrez ID: 85365) and the other on chromosome 9 (Entrez ID: 313231), each with multiple aliases. While a geneticist might refer to ‘Alg2’ without confusion among her lab colleagues, doing so in published data would present a barrier to future data integration as a correct interpretation would require the context in which these two genes are identified (e.g., the chromosome). If instead a Linked Data approach is taken to ensure that these two labels are semantically precise a priori (i.e., during data publication), then the burden of integration would be reduced.

There are several motivations to publishing Linked Data sets as indicated by the following potential use cases:

  • Shareability: A data provider or publisher would like to make some existing data more openly accessible, through standard, programmatic interfaces such as SPARQL or resolvable URIs. A scientist wants to provide early access to data (pre-publication) to her research network.

  • Integration: A developer desires to create and maintain a list of links between different RDF data sets so that she can easily query across these data sets.

  • Semantic normalization: A computer science researcher is interested in indexing an existing RDF data set using a set of common ontologies, so that the data set can be queried using ontological terms.

  • Discoverability: A bench biologist would like to be able to discover what is available in the Semantic Web related to a set of proteins, genes or chemical components, either as published results, raw data, or tissue libraries.

  • Federation: A pharmaceutical company desires to retrieve data from sources distributed across its enterprise using SPARQL.

Participants of the World Wide Web Consortium, Health Care and Life Sciences Interest Group (HCLS IG) have been making health care and life sciences data available as Linked Data for several years. In 2009, members of the interest group published collectively more than 8.4 million RDF triples for a range of genomic and drug-related data sets and made them available as Linked Data [2]. More recently, members have begun enriching the LODD data sets with data spanning discovery research and drug development [3]. The interest group has found that publishing HCLS data sets as Linked Data is particularly challenging due to (1) highly heterogeneous and distributed data sets; (2) difficulty in assessing the quality of the data; (3) privacy concerns that force data publishers to de-identify portions of their data sets (e.g., from clinical research) [4]. Another challenge is to make it possible for the data consumer to discover, evaluate, and query the data. Would-be consumers of data from the Linked Open Data (LOD) cloud are confronted with these uncertainties and often resort to traditional data warehousing because of them.

This collective experience of publishing a wide range of HCLS data sets has led to the identification of a general data workflow for mapping HCLS data to RDF and linking it with other Linked Data sources (see Fig. 1). Briefly stated, the workflow includes the following steps for both structured and unstructured data sets:

  • 1.

    Select the data sources or portions thereof to be published as RDF

  • 2.

    Identify persistent URLs (PURLs) for concepts in existing ontologies and create a mapping from the structured data into an RDF view that can be used to transform data for SPARQL queries

  • 3.

    Customize the mapping manually if necessary

  • 4.

    Link concepts in the new RDF mapping to concepts in other RDF data sources relying as much as possible on URIs from ontologies

  • 5.

    Publish the RDF data through a SPARQL endpoint or as Linked Data

  • 6.

    Alternatively, if data is in a relational format, apply a Semantic Web toolkit such as SWObjects [5] that enables SPARQL queries over the relational schema

  • 7.

    Create Semantic Web applications using the published data.

HCLS Linked Data developers may face many challenges when creating new Linked Data resources using the above general workflow. As such, the identification of practices for addressing such challenges is a necessary step to enable integration of health care and life sciences data sets. The questions listed in Table 1 summarize many of these challenges. The purpose of this paper is to provide practices that address these questions.

Before presenting the recommendations, we present real world case studies intended to demonstrate both the data flow shown in Fig. 1 and how some of the questions in Table 1 have been addressed by HCLS IG participants. The first case study describes the creation of linked RDF data from microarray data sets while the second discusses a linked RDF data set created from a knowledge base of drug therapies and drug targets [6]. The third case study describes the creation of an RDF index of biomedical concepts present in unstructured clinical reports and how this index was linked to a drug side-effect knowledge base. The final case study describes the initial development of a linked data set from a knowledge base of small molecules [7].

Section snippets

Creating a representation of neurosciences microarray experiment results as RDF

Experimental biomedical results are often made available in semi-structured formats that can be used to assemble RDF representations. For example, microarray experiments typically include both the context in which the experiment was performed, including the anatomical location from which samples were collected, and the software used to extract and analyze the data as well as information regarding the list of genes that were deemed to be significant for the hypothesis under consideration.

As was

Emerging practices for handling issues that a Linked Data publisher may encounter

The variety of goals and issues presented in the four case studies suggest that no single set of rules would be able to address all of the heterogeneous information needs that exist within the HCLS domain. However, discussion within the HCLS IG has led to the following set of recommendations that address each of the 14 questions listed in Table 1.

Q1. What are the tools and approaches for mapping relational databases to RDF?

Relational databases (RDBs) are the mainstay of data management and a

Recommendations

We have proposed a set of practices that authors publishing HCLS data sets as Linked Data may find useful. Here, we highlight some of the most important points:

Create RDF views that anyone can use

  • Use a mapping language to create an RDF view of the data when possible, rather data conversion and migration.

  • When possible, use vocabularies that are openly available from an authoritative server like that provided by OBO and the NCBO for HCLS data.

  • When faced with uncertainty about the proper term from

Conclusions

We have supplied four case studies of creating and publishing RDF for life sciences data sets and proposed recommended practices. Although our suggestions to the questions that may arise during Linked Data creation (Table 1) are oriented towards the HCLS domain, there is no reason why such practices could not be applied in other domains. For example, efforts are underway for a general source of ontologies and terminologies called the Open Ontology Repository [101] that would be much like

Acknowledgments

We thank the reviewers and several colleagues for their comments on the manuscript, especially Claus Stie Kallesøe and Lee Harland. We acknowledge Mikel Egaña Aranguren for contributing ideas that were integrated into this paper while working on a W3C IG Note with a similar theme. We thank the participants of the Linked Open Drug Data task force and the W3C Health Care and Life Science (HCLS) Interest Group. Support for HCLS activities was provided by the World Wide Web Consortium (W3C). RB was

References (102)

  • F. Belleau et al.

    Bio2RDF: towards a mashup to build bioinformatics knowledge systems

    J. Biomed. Informatics

    (2008)
  • C. Bizer

    DBpedia—a crystallization point for the web of data

    Web Semantics: Science, Services and Agents on the World Wide Web

    (2009)
  • LinkedData, Linked Data—Connect Distributed Data across the Web, 2011. [Online] Available: http://linkeddata.org/...
  • A. Jentzsch, J. Zhao, O. Hassanzadeh, K.H. Cheung, M. Samwald, B. Andersson, Linking open drug data, in: Triplification...
  • J.S. Luciano

    The translational medicine ontology: driving personalized medicine by bridging the gap from bedside to bench

    J. Biomed. Semant.

    (2011)
  • H.F. Deus et al.

    S3QL: a distributed domain specific language for controlled semantic integration of life science data

    BMC Bioinformatics

    (2011)
  • E. Prud’hommeaux, H. Deus, M.S. Marshall, Tutorial: query federation with SWObjects, in: Semantic Web Applications and...
  • C. Knox

    DrugBank 3.0: a comprehensive resource for ‘Omics’ research on drugs

    Nucleic Acids Res.

    (2010)
  • W.A. Warr

    ChEMBL. An interview with John Overington, team leader, chemogenomics at the European Bioinformatics Institute outstation of the European Molecular Biology Laboratory (EMBL-EBI)

    J. Comput.-Aided Mol. Des.

    (2009)
  • H.F. Deus, et al. Provenance of microarray experiments for a better understanding of experiment results, in: ISWC 2010...
  • NCBO, NCBO BioPortal, 2012. [Online] Available: http://bioportal.bioontology.org/ [Accessed:...
  • NIFSTD, NIFSTD—Terms — NCBO BioPortal, 2011. [Online] Available:...
  • MAGE-TAB, MAGE-TAB model v1.1 prototype implementation, 2011. [Online] Available:...
  • DOID, DOID, 2011. [Online] Available: http://www.berkeleybop.org/ontologies/owl/DOID [Accessed:...
  • K. Alexander, R. Cyganiak, M. Hausenblas, J. Zhao, Describing Linked Datasets with the VoID Vocabulary, 2011. W3C...
  • O. Hartig et al.

    Publishing and Consuming Provenance Metadata on the Web of Linked Data, vol. 6378

    (2010)
  • Bioconductor, Bioconductor—Home, 2011. [Online] Available: http://www.bioconductor.org/ [Accessed:...
  • D.S. Wishart

    DrugBank: a comprehensive resource for in silico drug discovery and exploration

    Nucleic Acids Res.

    (2006)
  • DailyMed, DailyMed: about DailyMed, 2012. [Online] Available: http://dailymed.nlm.nih.gov/ [Accessed:...
  • LinkedCT, About LinkedCT, 2011. [Online] Available: http://linkedct.org/about/ [Accessed:...
  • C. Bizer, D2R Map—Database to RDF mapping language,...
  • Drugbank, Drugbank SPARQL Endpoint, 2011. [Online] Available: http://www4.wiwiss.fu-berlin.de/drugbank/sparql...
  • DrugBank, DrugBank RDF dump, 2011. [Online] Available: http://www4.wiwiss.fu-berlin.de/drugbank/drugbank_dump.nt.bz2...
  • S. Liu et al.

    RxNorm: prescription for electronic drug information exchange

    IT Prof.

    (2005)
  • UPitt, University of Pittsburgh NLP repository, 2011. [Online] Available: http://www.dbmi.pitt.edu/nlpfront [Accessed:...
  • C. Jonquet et al.

    A system for ontology-based annotation of biomedical data

    Med. Inf.

    (2008)
  • NCBO, NCBO virtual appliance—NCBO Wiki, 2011. [Online] Available:...
  • R. Boyce, Python Script to Convert a U of Pitt clinical note to linked-data RDF, 2011. [Online] Available:...
  • R. Boyce, SPARQL endpoint for the U of Pitt clinical notes linked semantic index, 02/2011, 2012. [Online] Available:...
  • BioPortal, BioPortal REST services—NCBO Wiki, 2012. [Online] Available:...
  • SIDER, SIDER LODD, 2011. [Online] Available:...
  • R. Isele, A. Jentzsch, C. Bizer, J. Volz, Silk—a link discovery framework for the Web of data, 2011. [Online]...
  • Banff, SourceForge.net: Banff manifesto—bio2rdf, 2011. [Online] Available:...
  • HCLSIG, HCLSIG Bio RDF subgroup/MinimalInformationAbout AGraph—W3C Wiki, 2011. [Online] Available:...
  • Python, python programming language—official website, 2012. [Online] Available: http://python.org/ [Accessed:...
  • RDFLib, RDFLib, 2012. [Online] Available: http://www.rdflib.net/ [Accessed:...
  • CC-SA, CC-SA unported, 2012. [Online] Available: http://creativecommons.org/licenses/by-sa/3.0/ [Accessed:...
  • Z. Beauvais, Featured dataset: ChEMBL-RDF, with Egon Willighagen, Kasabi Blog,...
  • M. Samwald et al.

    Integrating findings of traditional medicine with modern pharmaceutical research: the potential role of linked open data

    Chin. Med.

    (2010)
  • E. Willighagen

    Linking the resource description framework to cheminformatics and proteochemometrics

    J. Biomed. Semant.

    (2011)
  • ChEMBL, ChEMBL FTP directory, 2012. [Online] Available:...
  • E.L. Willighagen, chembl.rdf, 2012. [Online] Available: https://github.com/egonw/chembl.rdf [Accessed:...
  • E.L. Willighagen, chem-bla-ics, 2012. [Online] Available: http://chem-bla-ics.blogspot.com/ [Accessed:...
  • J. Hastings et al.

    The chemical information ontology: provenance and disambiguation for chemical data on the biological semantic web

    PLoS ONE

    (2011)
  • H. Stuckenschmidt

    Exploring large document repositories with RDF technology: the dope project

    IEEE Intell. Syst.

    (2004)
  • E.L. Willighagen, ChEMBL SPARQL endpoint, 2012. [Online] Available: http://rdf.farmbio.uu.se/chembl/sparql [Accessed:...
  • E.L. Willighagen, ChEMBL Snorql endpoint, 2012. [Online] Available: http://rdf.farmbio.uu.se/chembl/snorql/ [Accessed:...
  • SNORQL, SNORQL—GitHub, 2012. [Online] Available: https://github.com/kurtjx/SNORQL [Accessed:...
  • E.L. Willighagen, ChEMBL-RDF on Kasabi, 2012. [Online] Available: http://beta.kasabi.com/dataset/chembl-rdf [Accessed:...
  • S.S. Sahoo, et al. A survey of current approaches for mapping of relational databases to RDF, w3org,...
  • Cited by (53)

    • A knowledge-based system to find over-the-counter medicines for self-medication

      2020, Journal of Biomedical Informatics
      Citation Excerpt :

      Audiences can take advantage of data by sending a SPARQL query to the endpoint of datasets. Several studies report that they have benefited from reusing LODD datasets [11,16,24]. In this study, we take advantage of the medicinal product dataset of Taiwan FDA to build the knowledge base.

    • Integration among databases and data sets to support productive nanotechnology: Challenges and recommendations

      2018, NanoImpact
      Citation Excerpt :

      A formalization of this approach in terms of Semantic Web technologies has been recently proposed through the introduction of lenses that allow users to turn on and off such equivalents based on which links they deem suited for their research question (Batchelor et al., 2014; Brenninkmeijer et al., 2012). This approach merges the worlds of ontologies and data by using Internationalized Resource Identifiers (IRIs), such as those found in the set of Semantic Web technologies (Berners-Lee et al., 2001; Marshall et al., 2012). The Open PHACTS project has taken this approach and developed an Identifier Mapping Service (IMS) that links databases using IRI-based identifiers (Batchelor et al., 2014).

    • Associating ω-automata to path queries on Webs of Linked Data

      2016, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Moreover, in recent years, studies are encouraged to have an interdisciplinary approach. The World Wide Web Consortium (W3C) in 2012 summarized emerging practices for creating and publishing data as Linked Data in such a way that they are discoverable and usable by collaborative research teams, semantic web agents, and applications (Marshall et al., 2012). These new capabilities on the connections of people and data within the Web involve the use of specific languages and frameworks, such as RDF and OWL.

    • A novel approach to optimize workflow in grid-based teleradiology applications

      2016, Computer Methods and Programs in Biomedicine
      Citation Excerpt :

      However, it can be concluded that Regional PACS solutions, which are usually utilized in nation-wide studies produce vendor dependent infrastructures [5]. Therefore, the recent trends in teleradiology has been towards Virtual PACS solutions interconnecting several vendors and facilities on a cloud platform [10] and towards quest for standards [11–13] in order to integrate patient data into a complete electronic health record utilizing Digital Imaging and Communications in Medicine (DICOM) and Health Level Seven (HL7) standards defined in Integrating the Healthcare Enterprise (IHE) profiles or non-standard formats such as Resource Description Framework (RDF), Extensible Markup Language (XML) or Portable Document Format (PDF). The requirements have also evolved from accessibility to interoperability, compatibility [14] and workflow in the overall process [15].

    View all citing articles on Scopus
    View full text