Research paperAutomatic knowledge graph population with model-complete text comprehension for pre-clinical outcomes in the field of spinal cord injury
Introduction
The principle of evidence-based medicine [1] requires that medical decisions are made on the basis of the best available evidence published in the literature. Existing evidence is often summarized in the form of systematic reviews and/or meta-reviews and rarely available in a structured form. The manual compilation and aggregation of evidence is expensive, and conducting a systematic review represents a large effort, so there are whole organizations and communities dedicated to this task, e.g. the Cochrane Foundation.1 The need to aggregate evidence does not only arise in the context of clinical trials, but is also important in the pre-clinical context, largely comprised of trials involving animal studies. Here, evidence is important to support the translation of pre-clinical findings into clinical practice, but also to inform the design of new therapies, particularly for areas in which no effective treatments exist [2]. With the goal of facilitating the task of aggregating evidence published in the form of (pre-)clinical studies, we present a new methodology that automatically extracts structured knowledge from such publications. Our goal is to populate a domain knowledge graph with a sufficient level of detail to support the exploration and automatic aggregation of existing evidence. A knowledge graph stores information as nodes (key informational units) and edges (relations between these nodes) to represent arbitrarily complex knowledge structures [3]. Knowledge graphs are not only used by large companies like Google [4], Facebook [5], or Amazon [6], but are also widely applied in scientific projects [7]. The main obstacle to the (automatic) construction of a domain knowledge graph for (pre-)clinical evidence lies in the fact that existing knowledge is typically described in articles written in natural language and is not available in structured form. While there are a few registries that aim to collect structured knowledge,2 in most cases they only aim at the protocol, but not at the actual study results as required in evidence-based medicine. The main contribution of our paper is to present a comprehensive and deep domain-adapted Information Extraction (IE) system that extracts the full details of a study with respect to a data model described by a domain ontology. We show that this is a complex information extraction task that so far has not been considered in its whole complexity in the literature. While there are many approaches extracting the PICO (Population, Intervention, Comparator, Outcome) elements without further analysis [8], [9] and approaches that extract only single aspects of a study [10], we propose the first system that extracts the outcomes of a study in full detail as needed to support (i) exploration of the available evidence, (ii) answering competency questions, and (iii) grading of therapies.
This work has been developed as part of the PSINK project (Automatically Populating a Preclinical Spinal Cord Injury Knowledge Base to Support Clinical Translation,3) which is concerned with constructing a knowledge graph for pre-clinical studies in the domain of Spinal Cord Injury (SCI). SCIs describe accidentally or experimentally inflicted damages to the spine, which often lead the affected to be partially or completely paralyzed, inter alia, impairing their walking ability, sensation and autonomic function. The annual incidence for spinal cord injury lies between 4 and 9 cases per 100.000 people and the prevalence is 1:1000 [11]. Nevertheless, the prevalence for SCI is very high because injured patients survive the initial impact due to advanced medical care in developed countries and live impaired for many decades. This also results in high costs for the healthcare system as a whole. Despite the growing interest and increasing research published, summing up to 75.475 peer-reviewed Pubmed-listed publications in 2020,4 there has been no controlled clinical study so far demonstrating therapy success in a reproducible manner, despite some single case reports [2].
Motivated by this, our main goal is to automatically extract results from the pre-clinical literature to support the systematic and automatic aggregation of evidence in SCI. The knowledge contained in a publication can be described by a complex structure made up of many variables that together describe the study design, protocol and results. To model this in a comprehensive and machine-readable way, we build on the Spinal Cord Injury Ontology (SCIO) [12], which describes all important key aspects of a pre-clinical SCI study at a level of detail sufficient to support the aggregation of evidence. SCIO was developed bottom-up based on an extensive literature review, resulting in a well-defined structure consisting of more than 670 classes and 100 relations to describe a study in terms of all relevant information required for the purpose of evidence aggregation.
The methodology presented in this paper follows the Model-complete Text Comprehension (MCTC) [13] paradigm, which uses a given ontology-based data model to endow a system with knowledge to systematically search for all relevant information in a text. The main advantage is that the information extraction system “understands” what knowledge is required and can infer information even though it is not explicitly mentioned. This is in contrast to approaches that are bottom-up and not guided by an underlying schema, which are often limited to extracting (binary) relationships between so-called named entities. However, in complex domains such as we consider here, there are many unnamed entities that are of utmost relevance and could never be recognized by methods that rely solely on named entity recognition (NER). The prime example of unnamed entities are the experimental results of a study, which are typically not named in the sense that each result is given a name. Nevertheless, it is crucial to extract all main results and their parameters for the purpose of evidence aggregation.
The basic methodology of ter Horst and Cimiano [13] focuses on the problem of detecting the correct cardinality for entity classes and presenting and comparing a number of approximative inference strategies to approach the task of predicting the various variables as a joint inference problem. In this paper, we extend this methodology to extract all the relevant aspects as defined by SCIO and provide a comprehensive evaluation that clearly shows the potential and limitations of the system.
The benefit of constructing a knowledge graph automatically is that it supports on-demand aggregation of pre-clinical results. We show that the populated knowledge graph supports domain knowledge exploration via a convenient tool, which we call the SCIExplorer [14]. Furthermore, we show how key competency-questions can be answered and that our knowledge graph supports automatic grading of therapies, an effort that is usually done manually [15]. Consequently, our method offers the advantage that the experimental design of a planned study can be optimized by developing new hypotheses through a systematic overview of the experiments carried out so far, and redundant studies can be avoided. While our method has been developed and tested on pre-clinical publications, it can equally be applied to clinical trials or other therapeutic areas requiring a replacement of the underlying ontology containing structured knowledge of the desired domain, e.g. clinical studies (e.g. C-TrO [16]) and the annotation of sufficient training data. The main contributions of the paper can be summarized as follows:
- •
We present an information extraction approach rooted in the model-complete text comprehension paradigm [13] to automatically extract the key evidence from a pre-clinical spinal cord injury study. We tackle in particular the challenge of extracting complex relational and nested structures as required to populate a deep domain knowledge graph.
- •
We motivate the necessity for such a complex structure as predicted by our system by a real world example and show the challenges involved in performing information extraction from published pre-clinical studies.
- •
We present a comprehensive evaluation of our method on all aspects of a pre-clinical study that are relevant for the aggregation of the evidence across publications as described along the spinal cord injury ontology that we have developed in previous work [12].
- •
We make the developed data set, source code and constructed knowledge graph openly available so that other researchers can work with our data.5
- •
We briefly demonstrate a few key applications that are supported by our domain knowledge graph.
The reminder of this paper is structured as follows. In the next section, we (i) provide information about the domain ontology, SCIO, and the derived data-model used for MCTC to describe the protocol, structure and results of a pre-clinical study, (ii) describe our effort to construct an annotated corpus that our information extraction system is trained on, and (iii) provide a motivating real-world example that shows the inherent complexity of extracting the key aspects of a study. In Section 3, we describe our methodology for automatically extracting information from scientific articles written in natural language. In Section 4, we describe the evaluation of the system and the results for the key parameters of a study. A deeper analysis and further information regarding the evaluation are provided in Appendix B. We provide an informative error analysis that shows the limitations of the system and can inspire future work on the task in Section 5. Finally, in Section 5.2, we present a few key applications that are supported by the constructed knowledge graph.
Section snippets
Preliminaries
In this section, we describe the data model derived from our ontology that builds the backbone of our system in the sense that it defines the structure of the underlying knowledge graph and guides the MCTC-based IE system. We introduce a motivating real world example that demonstrates the complexity and challenges involved in capturing relevant aspects of a pre-clinical study needed for proper evidence aggregation. Finally, we present the annotated data set we use to train and evaluate our
Method
information extraction task we address can be regarded as a complex structured prediction problem that consists of the prediction of nested structures describing the protocol and results of a pre-clinical study as shown in Fig. 8. The number of variables to be predicted is not fixed a priori as it depends on the particular setting of the pre-clinical study as well as the number of experimental groups and results involved, etc. and needs to be determined during inference. In our data set, the
Experimental evaluation
In this section, we describe the evaluation of our system providing results and error analyses for each of the classes described in Section 2.2. The systems’ performance reflects the results in a real world application scenario which includes the prediction throughout all levels of the system’s hierarchy, starting with the entity recognition and ending at the prediction of complete results. We denote this setting as joint as we consider entity recognition and relation extraction within the
Discussion
We have presented an approach that decomposes the task of extracting the design and results of a pre-clinical study in the field of spinal cord injury into a hierarchy of components. We have presented results of a thorough evaluation on 11 main classes of the spinal cord injury ontology. Our work shows that the faithful extraction and representation of the design and results of pre-clinical studies is a challenging problem. On average, the description of a single study requires 1.132 variables
Conclusion
We have presented a domain model-based information extraction system that can extract information from pre-clinical studies written in natural language at a detailed level as a basis to support aggregation of evidence in order to populate a domain knowledge graph with the evidence extracted. The system is guided by an ontology that defines the classes and their properties to be extracted. The system design relies on a complex processing architecture that as core component relies on a
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors gratefully acknowledge Tarek Kirchhoffer for preparing a preliminary version of the “Annotation Guidelines” for SCIO.
This work has been funded by the Federal Ministry of Education and Research (BMBF, Germany) in the PSINK project (project number 031L0028A & 031L0028B).
References (64)
Spinal cord lesions
- et al.
Co-transplantation of olfactory ensheathing glia and mesenchymal stromal cells does not have synergistic effects after spinal cord injury in the rat
Cytotherapy
(2010) - et al.
Openie-based approach for knowledge graph construction from text
Expert Syst Appl
(2018) - et al.
ATOLL—A framework for the automatic induction of ontology lexica
Data Knowl Eng
(2014) - et al.
Clinical information extraction applications: A literature review
J Biomed Inform
(2018) - et al.
Clinical concept extraction: A methodology review
J Biomed Inform
(2020) - et al.
Evidence based medicine: What it is and what it isn’t
BMJ
(1996) - et al.
Functional regeneration of supraspinal connections in a patient with transected spinal cord following transplantation of bulbar olfactory ensheathing cells with peripheral nerve bridging
Cell Transpl
(2014) Knowledge graph refinement: A survey of approaches and evaluation methods
Semant Web
(2017)- Dong X, Gabrilovich E, Heitz G, Horn W, Lao N, Murphy K, et al. Knowledge vault: A web-scale approach to probabilistic...
Unicorn: A system for searching the social graph
Proc VLDB Endow
Identifying treatments, groups, and outcomes in medical abstracts
Automated information extraction of key trial design elements from clinical trial publications
Extraction of evidence tables from abstracts of randomized clinical trials using a maximum entropy classifier and global constraints
Ontology-driven visual exploration of preclinical research data in the spinal cord injury domain
A grading system to evaluate objectively the strength of pre-clinical data of acute neuroprotective therapies for clinical translation in spinal cord injury
J Neurotrauma
OWL web ontology language overview
W3C Recomm
SNOMED clinical terms: Overview of the development process and project status.
An upper-level ontology for the biomedical domain
Comp Funct Genomics
Medical subject headings (MeSH)
Bull Med Library Assoc
Evaluation of PICO as a knowledge representation for clinical questions
Resource description framework (RDF) schema specification
The measurement of observer agreement for categorical data
Biometrics
Cited by (3)
CuPe-KG: Cultural perspective–based knowledge graph construction of tourism resources via pretrained language models
2024, Information Processing and ManagementKnowledge Graph: A Survey
2023, TechRxiv