Research paper
Automatic knowledge graph population with model-complete text comprehension for pre-clinical outcomes in the field of spinal cord injury

https://doi.org/10.1016/j.artmed.2023.102491Get rights and content

Highlights

  • Model-complete text comprehension to extract complex nested structured knowledge.

  • Evidence aggregation of preclinical therapies to support the translation into clinic.

  • Automated knowledge graph population from preclinical spinal cord injury studies.

  • Comprehensive system evaluation for extracting information from natural language.

  • Openly available data set, source code and constructed domain knowledge graph.

Abstract

The paradigm of evidence-based medicine requires that medical decisions are made on the basis of the best available knowledge published in the literature. Existing evidence is often summarized in the form of systematic reviews and/or meta-reviews and is rarely available in a structured form. Manual compilation and aggregation is costly, and conducting a systematic review represents a high effort. The need to aggregate evidence arises not only in the context of clinical trials, but is also important in the context of pre-clinical animal studies. In this context, evidence extraction is important to support translation of the most promising pre-clinical therapies into clinical trials or to optimize clinical trial design. Aiming at developing methods that facilitate the task of aggregating evidence published in pre-clinical studies, in this paper a new system is presented that automatically extracts structured knowledge from such publications and stores it in a so-called domain knowledge graph. The approach follows the paradigm of model-complete text comprehension by relying on guidance from a domain ontology creating a deep relational data-structure that reflects the main concepts, protocol, and key findings of studies. Focusing on the domain of spinal cord injuries, a single outcome of a pre-clinical study is described by up to 103 outcome parameters. Since the problem of extracting all these variables together is intractable, we propose a hierarchical architecture that incrementally predicts semantic sub-structures according to a given data model in a bottom-up fashion. At the heart of our approach is a statistical inference method that relies on conditional random fields to infer the most likely instance of the domain model given the text of a scientific publication as input. This approach allows modeling dependencies between the different variables describing a study in a semi-joint fashion. We present a comprehensive evaluation of our system to understand the extent to which our system can capture a study in the depth required to enable the generation of new knowledge. We conclude the article with a brief description of some applications of the populated knowledge graph and show the potential implications of our work for supporting evidence-based medicine.

Introduction

The principle of evidence-based medicine [1] requires that medical decisions are made on the basis of the best available evidence published in the literature. Existing evidence is often summarized in the form of systematic reviews and/or meta-reviews and rarely available in a structured form. The manual compilation and aggregation of evidence is expensive, and conducting a systematic review represents a large effort, so there are whole organizations and communities dedicated to this task, e.g. the Cochrane Foundation.1 The need to aggregate evidence does not only arise in the context of clinical trials, but is also important in the pre-clinical context, largely comprised of trials involving animal studies. Here, evidence is important to support the translation of pre-clinical findings into clinical practice, but also to inform the design of new therapies, particularly for areas in which no effective treatments exist [2]. With the goal of facilitating the task of aggregating evidence published in the form of (pre-)clinical studies, we present a new methodology that automatically extracts structured knowledge from such publications. Our goal is to populate a domain knowledge graph with a sufficient level of detail to support the exploration and automatic aggregation of existing evidence. A knowledge graph stores information as nodes (key informational units) and edges (relations between these nodes) to represent arbitrarily complex knowledge structures [3]. Knowledge graphs are not only used by large companies like Google [4], Facebook [5], or Amazon [6], but are also widely applied in scientific projects [7]. The main obstacle to the (automatic) construction of a domain knowledge graph for (pre-)clinical evidence lies in the fact that existing knowledge is typically described in articles written in natural language and is not available in structured form. While there are a few registries that aim to collect structured knowledge,2 in most cases they only aim at the protocol, but not at the actual study results as required in evidence-based medicine. The main contribution of our paper is to present a comprehensive and deep domain-adapted Information Extraction (IE) system that extracts the full details of a study with respect to a data model described by a domain ontology. We show that this is a complex information extraction task that so far has not been considered in its whole complexity in the literature. While there are many approaches extracting the PICO (Population, Intervention, Comparator, Outcome) elements without further analysis [8], [9] and approaches that extract only single aspects of a study [10], we propose the first system that extracts the outcomes of a study in full detail as needed to support (i) exploration of the available evidence, (ii) answering competency questions, and (iii) grading of therapies.

This work has been developed as part of the PSINK project (Automatically Populating a Preclinical Spinal Cord Injury Knowledge Base to Support Clinical Translation,3) which is concerned with constructing a knowledge graph for pre-clinical studies in the domain of Spinal Cord Injury (SCI). SCIs describe accidentally or experimentally inflicted damages to the spine, which often lead the affected to be partially or completely paralyzed, inter alia, impairing their walking ability, sensation and autonomic function. The annual incidence for spinal cord injury lies between 4 and 9 cases per 100.000 people and the prevalence is 1:1000 [11]. Nevertheless, the prevalence for SCI is very high because injured patients survive the initial impact due to advanced medical care in developed countries and live impaired for many decades. This also results in high costs for the healthcare system as a whole. Despite the growing interest and increasing research published, summing up to 75.475 peer-reviewed Pubmed-listed publications in 2020,4 there has been no controlled clinical study so far demonstrating therapy success in a reproducible manner, despite some single case reports [2].

Motivated by this, our main goal is to automatically extract results from the pre-clinical literature to support the systematic and automatic aggregation of evidence in SCI. The knowledge contained in a publication can be described by a complex structure made up of many variables that together describe the study design, protocol and results. To model this in a comprehensive and machine-readable way, we build on the Spinal Cord Injury Ontology (SCIO) [12], which describes all important key aspects of a pre-clinical SCI study at a level of detail sufficient to support the aggregation of evidence. SCIO was developed bottom-up based on an extensive literature review, resulting in a well-defined structure consisting of more than 670 classes and 100 relations to describe a study in terms of all relevant information required for the purpose of evidence aggregation.

The methodology presented in this paper follows the Model-complete Text Comprehension (MCTC) [13] paradigm, which uses a given ontology-based data model to endow a system with knowledge to systematically search for all relevant information in a text. The main advantage is that the information extraction system “understands” what knowledge is required and can infer information even though it is not explicitly mentioned. This is in contrast to approaches that are bottom-up and not guided by an underlying schema, which are often limited to extracting (binary) relationships between so-called named entities. However, in complex domains such as we consider here, there are many unnamed entities that are of utmost relevance and could never be recognized by methods that rely solely on named entity recognition (NER). The prime example of unnamed entities are the experimental results of a study, which are typically not named in the sense that each result is given a name. Nevertheless, it is crucial to extract all main results and their parameters for the purpose of evidence aggregation.

The basic methodology of ter Horst and Cimiano [13] focuses on the problem of detecting the correct cardinality for entity classes and presenting and comparing a number of approximative inference strategies to approach the task of predicting the various variables as a joint inference problem. In this paper, we extend this methodology to extract all the relevant aspects as defined by SCIO and provide a comprehensive evaluation that clearly shows the potential and limitations of the system.

The benefit of constructing a knowledge graph automatically is that it supports on-demand aggregation of pre-clinical results. We show that the populated knowledge graph supports domain knowledge exploration via a convenient tool, which we call the SCIExplorer [14]. Furthermore, we show how key competency-questions can be answered and that our knowledge graph supports automatic grading of therapies, an effort that is usually done manually [15]. Consequently, our method offers the advantage that the experimental design of a planned study can be optimized by developing new hypotheses through a systematic overview of the experiments carried out so far, and redundant studies can be avoided. While our method has been developed and tested on pre-clinical publications, it can equally be applied to clinical trials or other therapeutic areas requiring a replacement of the underlying ontology containing structured knowledge of the desired domain, e.g. clinical studies (e.g. C-TrO [16]) and the annotation of sufficient training data. The main contributions of the paper can be summarized as follows:

  • We present an information extraction approach rooted in the model-complete text comprehension paradigm [13] to automatically extract the key evidence from a pre-clinical spinal cord injury study. We tackle in particular the challenge of extracting complex relational and nested structures as required to populate a deep domain knowledge graph.

  • We motivate the necessity for such a complex structure as predicted by our system by a real world example and show the challenges involved in performing information extraction from published pre-clinical studies.

  • We present a comprehensive evaluation of our method on all aspects of a pre-clinical study that are relevant for the aggregation of the evidence across publications as described along the spinal cord injury ontology that we have developed in previous work [12].

  • We make the developed data set, source code and constructed knowledge graph openly available so that other researchers can work with our data.5

  • We briefly demonstrate a few key applications that are supported by our domain knowledge graph.

The reminder of this paper is structured as follows. In the next section, we (i) provide information about the domain ontology, SCIO, and the derived data-model used for MCTC to describe the protocol, structure and results of a pre-clinical study, (ii) describe our effort to construct an annotated corpus that our information extraction system is trained on, and (iii) provide a motivating real-world example that shows the inherent complexity of extracting the key aspects of a study. In Section 3, we describe our methodology for automatically extracting information from scientific articles written in natural language. In Section 4, we describe the evaluation of the system and the results for the key parameters of a study. A deeper analysis and further information regarding the evaluation are provided in Appendix B. We provide an informative error analysis that shows the limitations of the system and can inspire future work on the task in Section 5. Finally, in Section 5.2, we present a few key applications that are supported by the constructed knowledge graph.

Section snippets

Preliminaries

In this section, we describe the data model derived from our ontology that builds the backbone of our system in the sense that it defines the structure of the underlying knowledge graph and guides the MCTC-based IE system. We introduce a motivating real world example that demonstrates the complexity and challenges involved in capturing relevant aspects of a pre-clinical study needed for proper evidence aggregation. Finally, we present the annotated data set we use to train and evaluate our

Method

information extraction task we address can be regarded as a complex structured prediction problem that consists of the prediction of nested structures describing the protocol and results of a pre-clinical study as shown in Fig. 8. The number of variables to be predicted is not fixed a priori as it depends on the particular setting of the pre-clinical study as well as the number of experimental groups and results involved, etc. and needs to be determined during inference. In our data set, the

Experimental evaluation

In this section, we describe the evaluation of our system providing results and error analyses for each of the classes described in Section 2.2. The systems’ performance reflects the results in a real world application scenario which includes the prediction throughout all levels of the system’s hierarchy, starting with the entity recognition and ending at the prediction of complete results. We denote this setting as joint as we consider entity recognition and relation extraction within the

Discussion

We have presented an approach that decomposes the task of extracting the design and results of a pre-clinical study in the field of spinal cord injury into a hierarchy of components. We have presented results of a thorough evaluation on 11 main classes of the spinal cord injury ontology. Our work shows that the faithful extraction and representation of the design and results of pre-clinical studies is a challenging problem. On average, the description of a single study requires 1.132 variables

Conclusion

We have presented a domain model-based information extraction system that can extract information from pre-clinical studies written in natural language at a detailed level as a basis to support aggregation of evidence in order to populate a domain knowledge graph with the evidence extracted. The system is guided by an ontology that defines the classes and their properties to be extracted. The system design relies on a complex processing architecture that as core component relies on a

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors gratefully acknowledge Tarek Kirchhoffer for preparing a preliminary version of the “Annotation Guidelines” for SCIO.

This work has been funded by the Federal Ministry of Education and Research (BMBF, Germany) in the PSINK project (project number 031L0028A & 031L0028B).

References (64)

  • CurtissM. et al.

    Unicorn: A system for searching the social graph

    Proc VLDB Endow

    (2013)
  • Dong XL. Challenges and innovations in building a product knowledge graph. In: Proceedings of the 24th ACM SIGKDD...
  • Auer S, Kovtun V, Prinz M, Kasprzik A, Stocker M, Vidal ME. Towards a knowledge graph for science. In: Proceedings of...
  • SummerscalesR. et al.

    Identifying treatments, groups, and outcomes in medical abstracts

  • De BruijnB. et al.

    Automated information extraction of key trial design elements from clinical trial publications

  • TrentaA. et al.

    Extraction of evidence tables from abstracts of randomized clinical trials using a maximum entropy classifier and global constraints

    (2015)
  • Brazda N, ter Horst H, Hartung M, Wiljes C, Estrada V, Klinger R, et al. SCIO: An ontology to support the formalization...
  • ter Horst H, Cimiano P. Structured Prediction for Joint Class Cardinality and Entity Property Inference in...
  • BorowiA. et al.

    Ontology-driven visual exploration of preclinical research data in the spinal cord injury domain

  • KwonB.K. et al.

    A grading system to evaluate objectively the strength of pre-clinical data of acute neuroprotective therapies for clinical translation in spinal cord injury

    J Neurotrauma

    (2011)
  • Sanchez-Graillet O, Cimiano P, Witte C, Ell B. C-TrO: An Ontology for Summarization and Aggregation of the Level of...
  • McGuinnessD.L. et al.

    OWL web ontology language overview

    W3C Recomm

    (2004)
  • StearnsM.Q. et al.

    SNOMED clinical terms: Overview of the development process and project status.

  • McCrayA.T.

    An upper-level ontology for the biomedical domain

    Comp Funct Genomics

    (2003)
  • LipscombC.E.

    Medical subject headings (MeSH)

    Bull Med Library Assoc

    (2000)
  • HuangX. et al.

    Evaluation of PICO as a knowledge representation for clinical questions

  • BrickleyD. et al.

    Resource description framework (RDF) schema specification

    (1999)
  • Hartung M, ter Horst H, Grimm F, Diekmann T, Klinger R, Cimiano P. SANTO: A web-based annotation tool for...
  • LandisJ.R. et al.

    The measurement of observer agreement for categorical data

    Biometrics

    (1977)
  • Luo G, Huang X, Lin C-Y, Nie Z. Joint entity recognition and disambiguation. In: Proc. of the 2015 conference on...
  • Singh S, Riedel S, Martin B, Zheng J, McCallum A. Joint inference of entities, relations, and coreference. In:...
  • Hajishirzi H, Zilles L, Weld DS, Zettlemoyer L. Joint coreference resolution and named-entity linking with multi-pass...
  • Cited by (3)

    View full text