Real-world data medical knowledge graph: construction and applications

doi:10.1016/j.artmed.2020.101817

Artificial Intelligence in Medicine

Volume 103, March 2020, 101817

https://doi.org/10.1016/j.artmed.2020.101817 Get rights and content

Highlights

•
This paper proposes a systematic approach to construct medical KG from EMRs.
•
The constructed KG contains 9 entity types, totally 22,508 entities and 579,094 quadruplets.
•
Quadruplet structure is proposed to represent a fact in the real-world KG.
•
The proposed ranking function PSR achieves the best performance under all relations.
•
The obtained KG is successfully applied to several practical applications.

Abstract

Objective

Medical knowledge graph (KG) is attracting attention from both academic and healthcare industry due to its power in intelligent healthcare applications. In this paper, we introduce a systematic approach to build medical KG from electronic medical records (EMRs) with evaluation by both technical experiments and end to end application examples.

Materials and Methods

The original data set contains 16,217,270 de-identified clinical visit data of 3,767,198 patients. The KG construction procedure includes 8 steps, which are data preparation, entity recognition, entity normalization, relation extraction, property calculation, graph cleaning, related-entity ranking, and graph embedding respectively. We propose a novel quadruplet structure to represent medical knowledge instead of the classical triplet in KG. A novel related-entity ranking function considering probability, specificity and reliability (PSR) is proposed. Besides, probabilistic translation on hyperplanes (PrTransH) algorithm is used to learn graph embedding for the generated KG.

Results

A medical KG with 9 entity types including disease, symptom, etc. was established, which contains 22,508 entities and 579,094 quadruplets. Compared with term frequency - inverse document frequency (TF/IDF) method, the normalized discounted cumulative gain (NDCG@10) increased from 0.799 to 0.906 with the proposed ranking function. The embedding representation for all entities and relations were learned, which are proven to be effective using disease clustering.

Conclusion

The established systematic procedure can efficiently construct a high-quality medical KG from large-scale EMRs. The proposed ranking function PSR achieves the best performance under all relations, and the disease clustering result validates the efficacy of the learned embedding vector as entity’s semantic representation. Moreover, the obtained KG finds many successful applications due to its statistics-based quadruplet.

where $N_{c o}^{m i n}$ is a minimum co-occurrence number and R is the basic reliability value. The reliability value can measure how reliable is the relationship between S_i and O_ij. The reason for the definition is the higher value of N_co(S_i, O_ij), the relationship is more reliable. However, the reliability values of the two relationships should not have a big difference if both of their co-occurrence numbers are very big. In our study, we finally set $N_{c o}^{m i n}$ = 10 and R = 1 after some experiments. For instance, if co-occurrence numbers of three relationships are 1, 100 and 10000, their reliability values are 1, 2.96 and 5 respectively.

Introduction

Knowledge graph (KG) has received a lot of attention in recent years. In 2012, Google applied the KG in search engines; since then, the knowledge graph has been used in many application fields [1]. In the medical domain, the knowledge graph is the fundamental component for artificial intelligence (AI) aided medical systems, such as clinical decision support systems (CDSSs) for diagnosis and treatment [[2], [3], [4], [5]], self-diagnosis utilities to assist patient evaluating health condition based on symptoms [6,7].

The KG is a graph-based knowledge representation and organization method, which uses a set of subject-predicate-object triplets to represent the various entities and their relationships in a domain. Each triplet is called as a fact as well. In KG, nodes represent entities and edges represents relationships between entities. For example, ‘Parkinson's Disease’ and ‘Tremor’ are concrete entities of type Disease and Symptom in the medical domain. Given that ‘disease_related_symptom’ is a relationship between disease and symptom entities, ‘Parkinson's Disease disease_related_symptom Tremor’ is a triplet to represent that ‘Tremor’ is a related symptom with ‘Parkinson's Disease’.

Most previous works tried to construct the KG from medical articles, some of them are constructing manually and others are automatically. However, manually constructing KG requires tremendous clinical expert time and effort. For example, it was reported that about fifteen person-years are required to build the Internist-1/QMR knowledge base [8]. Automatically constructing KG from articles is a challenging work as the materials are almost unstructured, which is difficult to understand by computer.

In recent years, thanks to the rapid progress of big data and natural language processing (NLP) technologies, automatically mining knowledge from electronic medical records (EMRs) becomes a promising research trend [[9], [10], [11], [12], [13], [14], [15], [16],20,21],. Learning medical KG from EMRs is less labor-consuming and more feasible than learning from articles. More importantly, the statistical properties of real-world data based KG make it easier to use.

A lot of papers introduces EMR processing algorithms, including named entity recognition (NER) [[9], [10], [11], [12], [13]], entity normalization [[14], [15], [16]], relation extraction/ranking [26] and graph embedding [20,21]. However, there still lacks an efficient and systematic procedure to build medical KG from EMR data end to end. This paper aims to establish a systematic procedure to construct the medical KG from large-scale EMRs. The study is performed on a big-data platform of a 3A-class hospital in China and the constructed KG contains 9 entity types, totally 22,508 entities. Based on the data, we propose a new quadruplet structure to represent a KG fact, in contrast to the classical triplet structure, and build a total of 579,094 quadruplets. PrTransH [21] is used to train embedding vector for each entity and relation. To evaluate the effectiveness, the obtained KG is applied to several practical applications including CDSS, information retrieval and knowledge transfer with neural networks. At the end of this paper, conclusion of this paper and the prospect of the future work are drawn.

Section snippets

Entity recognition and normalization

Luo L et al. [11] proposes an attention-based bidirectional long short-term memory with a conditional random field layer (Att-BiLSTM-CRF), to document level chemical NER. The approach leverages document-level global information obtained by attention mechanism to enforce tagging consistency across multiple instances of the same token in a document. Zhang Y et al. [12] implements the BiLSTM-CRF model to simultaneously recognize five types of clinical entities on Chinese EHR corpus. Ji B et al. [13

METHOD

In this section, we develop a systematic procedure to build the medical KG from large-scale EMRs. The procedure, as shown in Fig. 1, involves 8 main steps, which are 1) data preparation, 2) entity recognition, 3) entity normalization, 4) relation extraction, 5) property calculation, 6) graph cleaning, 7) related-entity ranking, and 8) graph embedding, respectively. Here we emphasize that the steps 4), 5), 7) and 8) usually require much practical experience on large-scale EMRs and thus are

Entity Recognition

We compared the performance of single BiLSTM-CRF and the proposed hybrid model on symptom recognition, on a dataset which is composed of 1000 present illness history and their labeled symptoms by physicians. The recall, precision and F1-score of the single BiLSTM-CRF are 0.9368, 0.9482 and 0.9425. The results of the hybrid model are improved to 0.9689, 0.9727 and 0.9708, respectively.

Related-entity Ranking

A data-set is built to evaluate the performance of the proposed PSR function. For each disease, all original

APPLICATIONS

The constructed knowledge graph can be applied in many medical problems. In this paper we demonstrated the knowledge graph to three typical problems: clinical decision support system, medical information retrieval and knowledge transferring with neural networks.

CONCLUSIONS AND DISCUSSIONS

This paper establishes a systematic procedure to construct a quadruplet-based medical KG from large-scale EMRs. The evaluation result shows that the constructed KG is high-quality, and the KG is applied successfully due to the real-world statistical properties of the quadruplet. The proposed ranking function PSR outperforms other algorithms under all relations, and the disease clustering results validate the efficacy of the learned embedding vector as entity’s semantic representation.

It is

FUNDING

This work is supported in part by the National Key Research Program under Grant 2018YFC0116700.

CONTRIBUTORS

LL, PW, JY and JJ were responsible for the design of the systematic procedure. LL, PW, YW, SL, JJ, SW and YL were responsible for data processing, algorithms and experiments. KG applications are developed and evaluated by LL, YW, SL and JJ.

LL wrote the first draft of the manuscript, and PW, JY, ZS, BT, THC, SW and YL made the revisions to it.

All authors, LL, PW, JY, YW, SL, JJ, ZS, BT, THC, SW and YL approved the version of the manuscript to be published, and agreed to be accountable for all

Declaration of Competing Interest

None

References (34)

G. Salton et al.
Term-weighting approaches in automatic text retrieval
Information processing & management
(1988)
O. Vechtomova et al.
A domain-independent approach to finding related entities
Information Processing & Management
(2012)
C. Kang et al.
Learning to rank related entities in Web search
Neurocomputing
(2015)
C. Zhao et al.
A study of EMR-based medical knowledge network and its applications
Computer methods and programs in biomedicine
(2017)
C. Zhao et al.
EMR-based medical knowledge representation and inference via Markov random fields and distributed representation learning
Artificial intelligence in medicine
(2018)
Y. Shen et al.
CBN: Constructing a clinical Bayesian network based on data from the electronic medical record
Journal of biomedical informatics
(2018)
Z. Zhao et al.
Architecture of knowledge graph construction techniques
International Journal of Pure and Applied Mathematics
(2018)
G.O. Barnett et al.
DXplain. An evolving diagnostic decision-support system
Jama
(1987)
L.J. Bisson et al.
Accuracy of a computer-based diagnostic program for ambulatory patients with knee pain
The American journal of sports medicine
(2014)
R.A. Miller
Medical diagnostic decision support systems—past, present, and future: a threaded bibliography and brief commentary
Journal of the American Medical Informatics Association
(1994)

Wang M, Liu M, Liu J, et al. Safe medicine recommendation via medical knowledge graph embedding. arXiv preprint...

H. Tang et al.

Googling for a diagnosis--use of Google as a diagnostic aid: internet based study

BMJ

(2006)

B. Gann

Giving patients choice and control: health informatics on the patient journey

Yearbook of medical informatics

(2012)

M.A. hwe et al.

Probabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base

Methods of information in Medicine

(1991)

A. Kovačević et al.

Combining rules and machine learning for extraction of temporal expressions and events from clinical narratives

Journal of the American Medical Informatics Association

(2013)

B. Tang et al.

Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features

BMC Medical Informatics and Decision Making

(2013)

L. Luo et al.

An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition

Bioinformatics

(2017)

Cited by (171)

A novel joint extraction model based on cross-attention mechanism and global pointer using context shield window
2024, Computer Speech and Language
Relational triple extraction is a critical step in knowledge graph construction. Compared to pipeline-based extraction, joint extraction is gaining more attention because it can better utilize entity and relation information without causing error propagation issues. Yet, the challenge with joint extraction lies in handling overlapping triples. Existing approaches adopt sequential steps or multiple modules, which often accumulate errors and interfere with redundant data. In this study, we propose an innovative joint extraction model with cross-attention mechanism and global pointers with context shield window. Specifically, our methodology begins by inputting text data into a pre-trained RoBERTa model to generate word vector representations. Subsequently, these embeddings are passed through a modified cross-attention layer along with entity type embeddings to address missing entity type information. Next, we employ the global pointer to transform the extraction problem into a quintuple extraction problem, which skillfully solves the issue of overlapping triples. It is worth mentioning that we design a context shield window on the global pointer, which facilitates the identification of correct entities within a limited range during the entity extraction process. Finally, the capability of our model against malicious samples is improved by adding adversarial training during the training process. Demonstrating superiority over mainstream models, our approach achieves impressive results on three publicly available datasets.
A PiRNA-disease association model incorporating sequence multi-source information with graph convolutional networks
2024, Applied Soft Computing
There is growing evidence that PIWI-interacting RNA (piRNA) is widely involved in the proliferation, invasion, and metastasis of malignant tumors, playing an important regulatory role in numerous human physiological and pathological processes. Disease-associated piRNAs are expected to be biomarkers and novel therapeutic targets for early diagnosis and prognosis of malignant tumors. However, most previous computational models did not fully focus on the rich representation ability of multiple sources of information in piRNA sequences, which affected their performance in predicting piRNA-disease associations (PDAs). In this work, we propose a model, iSG-PDA, which combines the multi-source information of piRNA sequences with graph convolutional neural networks to predict potential PDAs. More specifically, we first fuse multi-source information including piRNA sequences and disease semantics to enhance the expressiveness of data, then deeply mine the advanced hidden features of PDA using graph convolutional networks, and finally exploit random forest to accurately determine the associations between piRNAs and diseases. In the golden standard dataset, the proposed model realized a prediction accuracy of 91.96% at the AUC of 0.9184. In ablation experiments and comparisons with other different models, iSG-PDA exhibits strong competitiveness. Moreover, the results of the case study indicate that 17 of the top 20 PDAs in the proposed model predictive score were confirmed. These preliminary results reveal that iSG-PDA is an effective computational method for predicting PDAs and can provide reliable disease candidate piRNAs for biological experiments.
Knowledge-enhanced online doctor recommendation framework based on knowledge graph and joint learning
2024, Information Sciences
A well-performed doctor recommendation system is significant for both patients and Online Medical Consultation Platforms (OMCPs). Though previous studies have proposed many doctor recommendation methods, some are overly personalized for implementation in large-scale OMCPs, while some other machine learning-based approaches perform poorly due to the simplistic information available about patients and doctors on OMCPs. This research proposes an online doctor recommendation framework based on knowledge graph (KG) and joint learning to address these problems. The framework first constructs a comprehensive medical KG, including details about doctors on the platform and a wealth of medical knowledge, to better extract features of doctors and patients. It obtains feature representations of doctors from the medical KG and extracts features from patients’ consultation texts at both sentence and word levels using word embedding and KG embedding. Finally, these features are fed into a deep neural network to calculate the recommendation probability. All processes are learned simultaneously within an overall framework. Extensive experiments conducted on four real datasets illustrate the superior performance of our model and the effectiveness of incorporating KG into doctor recommendation in providing interpretations for the recommendation results.
Entity recognition method for airborne products metrological traceability knowledge graph construction
2024, Measurement: Journal of the International Measurement Confederation
The airborne system, as one of the complex and extensive subsystems of an aircraft, primarily performs critical flight assurance functions. The quality of its components has a direct impact on the aircraft's safety and reliability. Metrology documents comprehensively document the performance parameters throughout the entire product life cycle. The Metrological Traceability Knowledge Graph (MTKG) for airborne products offers decision support to engineers engaged in metrological tasks, ensuring the continuous high quality of the products. This paper introduces an entity recognition method for airborne product metrological traceability knowledge graph construction. First, the ontology for MTKG is developed. Next, a fine-tuned multi-network model is proposed. Named entities in the field of metrology are recognized through three stages: word vector representation, sentence feature extraction, and optimal label assignment. Meanwhile, active learning methods are incorporated to reduce the expense of data annotation. The proposed model is validated using an actual metrology corpus, and the experimental results demonstrate its superior performance compared to the other four baseline methods. Finally, the MTKG is developed using this approach, offering engineers intelligent applications, including metrological traceability analysis and traceability path reasoning within the process of product metrology. This enhances the metrology capabilities of airborne products and demonstrates the extensive potential of knowledge graphs in metrology.
LMKG: A large-scale and multi-source medical knowledge graph for intelligent medicine applications
2024, Knowledge-Based Systems
Medical Knowledge Graph (KG) has shown great potential in various healthcare scenarios, such as drug recommendation and clinical decision support system. The factors that determine the role of a medical KG in practical applications are the scale, coverage, and quality of the medical knowledge it can provide. Most existing medical KGs are extracted from a single or a few information sources. However, medical knowledge extracted from insufficient information sources is usually highly incomplete or even biased, which results in a lack of data completeness and may lessen their effectiveness in real-world scenarios. Besides, the coverage of entity and relation types is inadequate in most previous works, which also might restrict their potential usage in future applications. In this paper, we build a unified system that can extract and manage medical knowledge from heterogeneous information sources. We first employ named entity recognition and relation extraction methods to extract knowledge triplets from medical texts. Then we propose a hierarchical entity alignment framework for further knowledge refinement. Based on our system, we construct a large-scale, high-quality, multi-source, and multi-lingual medical KG named LMKG, which includes 13 entity types and 17 relation types, and contains 403,784 entity and 1,225,097 relation instances. We conduct extensive experiments to evaluate the quality of LMKG. Experimental results show that LMKG can effectively enhance the performance of both upstream and downstream intelligent medicine applications. We have publicly released the KG resources and corresponding management service interface to facilitate research and applications in the medical field.
Improving few-shot relation extraction through semantics-guided learning
2024, Neural Networks
Few-shot relation extraction (few-shot RE) aims to recognize relations between the entity pair in a given text by utilizing very few annotated instances. As a simple yet efficient approach, prototype network-based methods often directly incorporate relation information to enhance prototype representation or leverage contrastive learning to mitigate prediction confusion. Despite achieving good results, the above methods are still susceptible to false judgments of outlier samples and confusion of similar classes. To address these issues, we propose a novel Semantics-Guided Learning (SemGL) method that more effectively utilizes relation information to enhance both the representations of instances and prototypes for improving the performance of few-shot RE. First, SemGL employs the prompt encoder to encode various prompt templates of instances and relation information and obtains more accurate semantic representations of instances, instance prototypes, and concept prototypes via the prompt enhancement from large language models. Then, SemGL introduces a novel technique called relation graph learning, which leverages concept prototypes to cluster homogeneous instances together, emphasizing relation-specific features of concrete instances. Simultaneously, SemGL employs instance-level contrastive learning between instance prototypes and support instances to distinguish between intra-class instances and inter-class instances to promote shared features among intra-class instances. Additionally, prototype-level contrastive learning leverages concept prototypes to pull closer relation-specific features of the concept prototype and shared features of the instance prototype from the same relation. Finally, SemGL utilizes new relation prototypes that integrate interpretable features of concept prototypes and shared features of instance prototypes for prediction. Experimental results on two publicly available few-shot RE datasets demonstrate the effectiveness and efficiency of SemGL in introducing relation information, with particularly promising results for the domain adaptation challenge task.

View all citing articles on Scopus

¹: Linfeng Li and Peng Wang contributed equally.

View full text

Real-world data medical knowledge graph: construction and applications

Highlights

Abstract

Objective

Materials and Methods

Results

Conclusion

Introduction

Section snippets

Entity recognition and normalization

METHOD

Entity Recognition

Related-entity Ranking

APPLICATIONS

CONCLUSIONS AND DISCUSSIONS

FUNDING

CONTRIBUTORS

Declaration of Competing Interest

Information processing & management

Information Processing & Management

Neurocomputing

Computer methods and programs in biomedicine

Artificial intelligence in medicine

Journal of biomedical informatics

Architecture of knowledge graph construction techniques

International Journal of Pure and Applied Mathematics

DXplain. An evolving diagnostic decision-support system

Jama

Accuracy of a computer-based diagnostic program for ambulatory patients with knee pain

The American journal of sports medicine

Medical diagnostic decision support systems—past, present, and future: a threaded bibliography and brief commentary

Journal of the American Medical Informatics Association

Googling for a diagnosis--use of Google as a diagnostic aid: internet based study

BMJ

Giving patients choice and control: health informatics on the patient journey

Yearbook of medical informatics

Probabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base

Methods of information in Medicine

Combining rules and machine learning for extraction of temporal expressions and events from clinical narratives

Journal of the American Medical Informatics Association

Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features

BMC Medical Informatics and Decision Making

An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition

Bioinformatics