Real-world data medical knowledge graph: construction and applications

https://doi.org/10.1016/j.artmed.2020.101817Get rights and content

Highlights

  • This paper proposes a systematic approach to construct medical KG from EMRs.

  • The constructed KG contains 9 entity types, totally 22,508 entities and 579,094 quadruplets.

  • Quadruplet structure is proposed to represent a fact in the real-world KG.

  • The proposed ranking function PSR achieves the best performance under all relations.

  • The obtained KG is successfully applied to several practical applications.

Abstract

Objective

Medical knowledge graph (KG) is attracting attention from both academic and healthcare industry due to its power in intelligent healthcare applications. In this paper, we introduce a systematic approach to build medical KG from electronic medical records (EMRs) with evaluation by both technical experiments and end to end application examples.

Materials and Methods

The original data set contains 16,217,270 de-identified clinical visit data of 3,767,198 patients. The KG construction procedure includes 8 steps, which are data preparation, entity recognition, entity normalization, relation extraction, property calculation, graph cleaning, related-entity ranking, and graph embedding respectively. We propose a novel quadruplet structure to represent medical knowledge instead of the classical triplet in KG. A novel related-entity ranking function considering probability, specificity and reliability (PSR) is proposed. Besides, probabilistic translation on hyperplanes (PrTransH) algorithm is used to learn graph embedding for the generated KG.

Results

A medical KG with 9 entity types including disease, symptom, etc. was established, which contains 22,508 entities and 579,094 quadruplets. Compared with term frequency - inverse document frequency (TF/IDF) method, the normalized discounted cumulative gain (NDCG@10) increased from 0.799 to 0.906 with the proposed ranking function. The embedding representation for all entities and relations were learned, which are proven to be effective using disease clustering.

Conclusion

The established systematic procedure can efficiently construct a high-quality medical KG from large-scale EMRs. The proposed ranking function PSR achieves the best performance under all relations, and the disease clustering result validates the efficacy of the learned embedding vector as entity’s semantic representation. Moreover, the obtained KG finds many successful applications due to its statistics-based quadruplet.

where Ncomin is a minimum co-occurrence number and R is the basic reliability value. The reliability value can measure how reliable is the relationship between Si and Oij. The reason for the definition is the higher value of Nco(Si, Oij), the relationship is more reliable. However, the reliability values of the two relationships should not have a big difference if both of their co-occurrence numbers are very big. In our study, we finally set Ncomin = 10 and R = 1 after some experiments. For instance, if co-occurrence numbers of three relationships are 1, 100 and 10000, their reliability values are 1, 2.96 and 5 respectively.

Introduction

Knowledge graph (KG) has received a lot of attention in recent years. In 2012, Google applied the KG in search engines; since then, the knowledge graph has been used in many application fields [1]. In the medical domain, the knowledge graph is the fundamental component for artificial intelligence (AI) aided medical systems, such as clinical decision support systems (CDSSs) for diagnosis and treatment [[2], [3], [4], [5]], self-diagnosis utilities to assist patient evaluating health condition based on symptoms [6,7].

The KG is a graph-based knowledge representation and organization method, which uses a set of subject-predicate-object triplets to represent the various entities and their relationships in a domain. Each triplet is called as a fact as well. In KG, nodes represent entities and edges represents relationships between entities. For example, ‘Parkinson's Disease’ and ‘Tremor’ are concrete entities of type Disease and Symptom in the medical domain. Given that ‘disease_related_symptom’ is a relationship between disease and symptom entities, ‘Parkinson's Disease disease_related_symptom Tremor’ is a triplet to represent that ‘Tremor’ is a related symptom with ‘Parkinson's Disease’.

Most previous works tried to construct the KG from medical articles, some of them are constructing manually and others are automatically. However, manually constructing KG requires tremendous clinical expert time and effort. For example, it was reported that about fifteen person-years are required to build the Internist-1/QMR knowledge base [8]. Automatically constructing KG from articles is a challenging work as the materials are almost unstructured, which is difficult to understand by computer.

In recent years, thanks to the rapid progress of big data and natural language processing (NLP) technologies, automatically mining knowledge from electronic medical records (EMRs) becomes a promising research trend [[9], [10], [11], [12], [13], [14], [15], [16],20,21],. Learning medical KG from EMRs is less labor-consuming and more feasible than learning from articles. More importantly, the statistical properties of real-world data based KG make it easier to use.

A lot of papers introduces EMR processing algorithms, including named entity recognition (NER) [[9], [10], [11], [12], [13]], entity normalization [[14], [15], [16]], relation extraction/ranking [26] and graph embedding [20,21]. However, there still lacks an efficient and systematic procedure to build medical KG from EMR data end to end. This paper aims to establish a systematic procedure to construct the medical KG from large-scale EMRs. The study is performed on a big-data platform of a 3A-class hospital in China and the constructed KG contains 9 entity types, totally 22,508 entities. Based on the data, we propose a new quadruplet structure to represent a KG fact, in contrast to the classical triplet structure, and build a total of 579,094 quadruplets. PrTransH [21] is used to train embedding vector for each entity and relation. To evaluate the effectiveness, the obtained KG is applied to several practical applications including CDSS, information retrieval and knowledge transfer with neural networks. At the end of this paper, conclusion of this paper and the prospect of the future work are drawn.

Section snippets

Entity recognition and normalization

Luo L et al. [11] proposes an attention-based bidirectional long short-term memory with a conditional random field layer (Att-BiLSTM-CRF), to document level chemical NER. The approach leverages document-level global information obtained by attention mechanism to enforce tagging consistency across multiple instances of the same token in a document. Zhang Y et al. [12] implements the BiLSTM-CRF model to simultaneously recognize five types of clinical entities on Chinese EHR corpus. Ji B et al. [13

METHOD

In this section, we develop a systematic procedure to build the medical KG from large-scale EMRs. The procedure, as shown in Fig. 1, involves 8 main steps, which are 1) data preparation, 2) entity recognition, 3) entity normalization, 4) relation extraction, 5) property calculation, 6) graph cleaning, 7) related-entity ranking, and 8) graph embedding, respectively. Here we emphasize that the steps 4), 5), 7) and 8) usually require much practical experience on large-scale EMRs and thus are

Entity Recognition

We compared the performance of single BiLSTM-CRF and the proposed hybrid model on symptom recognition, on a dataset which is composed of 1000 present illness history and their labeled symptoms by physicians. The recall, precision and F1-score of the single BiLSTM-CRF are 0.9368, 0.9482 and 0.9425. The results of the hybrid model are improved to 0.9689, 0.9727 and 0.9708, respectively.

Related-entity Ranking

A data-set is built to evaluate the performance of the proposed PSR function. For each disease, all original

APPLICATIONS

The constructed knowledge graph can be applied in many medical problems. In this paper we demonstrated the knowledge graph to three typical problems: clinical decision support system, medical information retrieval and knowledge transferring with neural networks.

CONCLUSIONS AND DISCUSSIONS

This paper establishes a systematic procedure to construct a quadruplet-based medical KG from large-scale EMRs. The evaluation result shows that the constructed KG is high-quality, and the KG is applied successfully due to the real-world statistical properties of the quadruplet. The proposed ranking function PSR outperforms other algorithms under all relations, and the disease clustering results validate the efficacy of the learned embedding vector as entity’s semantic representation.

It is

FUNDING

This work is supported in part by the National Key Research Program under Grant 2018YFC0116700.

CONTRIBUTORS

LL, PW, JY and JJ were responsible for the design of the systematic procedure. LL, PW, YW, SL, JJ, SW and YL were responsible for data processing, algorithms and experiments. KG applications are developed and evaluated by LL, YW, SL and JJ.

LL wrote the first draft of the manuscript, and PW, JY, ZS, BT, THC, SW and YL made the revisions to it.

All authors, LL, PW, JY, YW, SL, JJ, ZS, BT, THC, SW and YL approved the version of the manuscript to be published, and agreed to be accountable for all

Declaration of Competing Interest

None

References (34)

  • Wang M, Liu M, Liu J, et al. Safe medicine recommendation via medical knowledge graph embedding. arXiv preprint...
  • H. Tang et al.

    Googling for a diagnosis--use of Google as a diagnostic aid: internet based study

    BMJ

    (2006)
  • B. Gann

    Giving patients choice and control: health informatics on the patient journey

    Yearbook of medical informatics

    (2012)
  • M.A. hwe et al.

    Probabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base

    Methods of information in Medicine

    (1991)
  • A. Kovačević et al.

    Combining rules and machine learning for extraction of temporal expressions and events from clinical narratives

    Journal of the American Medical Informatics Association

    (2013)
  • B. Tang et al.

    Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features

    BMC Medical Informatics and Decision Making

    (2013)
  • L. Luo et al.

    An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition

    Bioinformatics

    (2017)
  • Cited by (171)

    • Entity recognition method for airborne products metrological traceability knowledge graph construction

      2024, Measurement: Journal of the International Measurement Confederation
    View all citing articles on Scopus
    1

    Linfeng Li and Peng Wang contributed equally.

    View full text