Improving prediction for medical institution with limited patient data: Leveraging hospital-specific data based on multicenter collaborative research network

doi:10.1016/j.artmed.2021.102024

Artificial Intelligence in Medicine

Volume 113, March 2021, 102024

https://doi.org/10.1016/j.artmed.2021.102024 Get rights and content

Highlights

•
A multisource deep transfer learning model is proposed for improving predictive performance on a single institution with limited patient data.
•
The proposed approach enables better feature adaptation by incorporating hospital-specific features across source data and limited target data.
•
The unlabeled target data is integrated into the model updating approach to enhance the training process for single center with insufficient labels.
•
The case study shows the better discrimination and calibration ability of proposed model learning process than baseline models with limited EHR data.

Abstract

Background and objective

Clinical decision support assisted by prediction models usually faces the challenges of limited clinical data and a lack of labels when the model is developed with data from a single medical institution. Accordingly, research on multicenter clinical collaborative networks, which can provide external medical data, has received increasing attention. With the increasing availability of machine learning techniques such as transfer learning, leveraging large-scale patient data from multiple hospitals to build data-driven predictive models with clinical application potential provides an alternative solution to address the problem of limited patient data.

Methods

A multicenter hybrid semi-supervised transfer learning model (MHSTL) is proposed in this study on the basis of unified common data model to ensure multicenter data standardized representation. Then the hospital-specific features, along with the co-occurrence features across domains, are aligned through a representation learning architecture that is built based on deep neural networks and the newly proposed neural decision forest model. In this process, limited patient data from the target hospital, both labeled and unlabeled, are incorporated during the feature adaptation process, thereby contributing to better model performance. Without patient-level data sharing, the proposed model learning strategy which overcomes feature misalignment and distribution divergence, enables the multi-source transfer learning process in the case of insufficient and unlabeled patient data at target hospital.

Results

The effectiveness of the proposed transfer learning model was evaluated on a collaborative research network of colorectal cancer patients in the US and China. The results demonstrate that the proposed model can achieve much better performance for predicting target risk with limited resources on patient data than baseline models      . Better discrimination and calibration ability are also observed when sufficient labeled data are not available in the target hospital for prognosis prediction tasks      . Further exploratory experiments show that the proposed approach exhibits good model generalizability regardless of the data heterogeneity. With the help of the SHapley Additive exPlanations for model interpretation, the effectiveness of incorporating hospital-specific features in the transfer learning model is shown.

Conclusions

In this study, the proposed method can develop prediction models from multiple source hospitals and exhibit good performance by leveraging cross-domain hospital-specific feature information, therefore enhancing the model prediction when applied to single medical institution with limited patient data.

Introduction

The quality of healthcare decision-making depends heavily on the validity and reliability of clinical decision support (CDS) models [1], [2], while the statistical significance outcome of which hinges upon the quality of medical data that support model establishment [3]. Therefore, a data-driven prediction model developed based on high-quality large-scale electronic health record (EHR) patient data can provide a more accurate basis for clinical decision-makers than other models. However, single clinical institution is often unable to collect sufficient sample sizes and adequate amounts of labeled data due to the limited scale. Poor-quality data have led to the inadequacy of single-center models to support a persuasive and powerful conclusion, along with an inability to obtain improved predictive modeling results of patient outcomes [4], [5].

To overcome this challenge, an increasing amount of research has sought to expand the scope from one institution to multiple centers to mitigate the limitation caused by the lack of available training samples [6], [7], [8]. The accessibility of multi-source data has been improved by using health information exchange standards such as Fast Healthcare Interoperability Resources (FHIR) [9] or the Observational Medical Outcomes Partnership (OMOP) common data model (CDM) [10], which solve the interoperability problem of healthcare information systems caused by the diversity of clinical decision support systems (CDSS) and EHR [11], [12]. After the patient-level data are accessed through a standardized information sharing interface by concurrently using the ‘same language’, the data from multiple sources can be used to support the clinical research in the multicenter clinical research network.

In multicenter collaborative modeling, the recently proposed transfer learning (TL) technique [13] has demonstrated its effectiveness in transferring model (s) learned from external institution(s) to local system. Transfer learning is a machine learning method that leverages the evidence from the ‘sources’ in one or more related tasks to a different but related task in the ‘target’. TL aims to reduce the distribution discrepancy across domains, thereby ensuring effective prediction on the target domain after the transfer. It appears to be a widely used solution in various applications, such as image annotation [14] and concept extraction [15], etc.

However, there are two aspects that distinguish healthcare-related problems from applications in other domains during model transfer. First, most research on transfer learning has assumed that all data are under the same feature space [16], [17], [18]; however, in medical practice, such an assumption does not always hold because the features often lie in the distinct but overlapping space [19] (see Fig. 1). Many of the observational variables can be hospital-specific due to factors such as different healthcare facility levels among different medical institutions [20] or missing records in the electronic medical system, etc., which are all included in the EHR data. The discrepancy of feature space is commonly neglected in multicenter studies [21], which prevents the hospital-specific features from the source and target domains from being effectively used and thus lowers the potential predictive capabilities of the model.

Another characteristic in model transfer for medical-related issues is that the available labeled data is typically limited. A large amount of unlabeled data is often found in the training data from the target hospital [22]. Collecting labeled patient data is difficult and expensive, especially when the presence of the outcome of interest is hardly observable. For example, in some fast-growing malignant cancers, symptom detection is arduous and the progression of the disease is rapid, so consistent data collection and expensive expert labeling in prognosis prediction are needed [23], [24]. This will impose difficulty on model learning in the fully supervised fashion and thus hinder the model from achieving good performance when predicting on its own cohort [25].

In this study, we propose a multicenter hybrid semi-supervised transfer learning model (MHSTL) to improve prediction performance for medical institution with limited patient data. The hospital-specific data are used through hybrid transfer learning to implement the feature alignment and further solve the problems arising from feature space heterogeneity and data distribution divergence. Meanwhile, a model updating approach that integrates the unlabeled target data is used to enhance the training process and resolve the issue of poor prediction due to the insufficient labels on a single center. The efficient SHapley Additive exPlanations (SHAP) approach [26] was used in this study to improve the interpretability of predictive models. The predictive model acquired from multicenter collaborative model learning provides an alternative solution to enhance the predictive power of CDS models for medical institutions with limited patient data, thereby overcoming the challenges for clinical data analytics in low-resource scenarios.

Section snippets

Methods

The definition of hybrid domain adaptation (DA) was first proposed by Wei et al. [27] in which the source domain and target domain share co-occurrence features but at the same time own their specific feature sets. The hybrid DA demonstrates its differences from the homogeneous DA, which adapts the models into the same feature space but with different data distributions by minimizing the discrepancies of feature distributions through techniques such as sample importance weighting or feature

Data sets

The goal of this case study is to develop a prognosis prediction model for patients with colorectal cancer (CRC) based on the collaborative multi-institution clinical research network. We are interested in the 5-year prognostic survival status after the patient has been diagnosed with CRC. The availability of a well-calibrated prognosis prediction model for malignant cancer could provide a reference for oncologists to make proactive clinical decisions. To validate our model, we used CRC data

Model validation on hybrid transfer learning

The performance of the MHSTL and two baseline models under different ratios between the training data from SAHZU and SEER is shown in Table 3. As r decreases from 0.20 to 0.05 — that is, the patient data for fully supervised model training in the target hospital become scarce — the effectiveness of the proposed MHSTL is revealed. Especially when r is less than 0.15, the MHSTL model reveals its superiority over baseline model 2, indicating that under the low-resource scenario, the target

Discussion

This study proposes an MHSTL model that aims to solve the low-resource medical problems of the target hospital by fully exploiting the knowledge provided by shared features and hospital-specific features across domains in a semi-supervised learning scheme. The effectiveness of the proposed model is validated based on the CRC data from the US and China. As shown in the Results section, the performance of the proposed model, which leverages the predictive power from a large-scale multicenter

Conclusion

The MHSTL model leverages the co-occurrence features and hospital-specific features simultaneously to provide improved prediction for medical institutions with limited patient data. The prediction model construction framework proposed in this study is able to solve the problem of the lack of labeled patient data in the target hospital which is required under the supervised learning scenario. The results of a CRC case study from the US and China demonstrated the superior performance of the

Funding

This work was supported by Major Scientific Project of Zhejiang LabNo. 2020ND8AD01, the National Natural Science Foundation of China (No. 81771936, No.81801796 and No.81672916), the National Key Research and Development Program of China (No. 2018YFC0116901), and the Fundamental Research Funds for the Central Universities, China (No.2020QNA5031).

Conflicts of interest

The authors have no conflicts of interest to declare.

Acknowledgement

This work was supported by Major Scientific Project of Zhejiang Lab (No. 2020ND8AD01), the National Natural Science Foundation of China (No. 81771936, No. 81801796 and No. 81672916), the National Key Research and Development Program of China (No. 2018YFC0116901), and the Fundamental Research Funds for the Central Universities, China (No. 2020QNA5031).

We owe thanks to the staff of the National Cancer Institute (NCI) and each member involved in the Surveillance, Epidemiology and End Results

References (56)

K. Hajian-Tilaki
Sample size estimation in diagnostic test studies of biomedical informatics
J Biomed Inform
(2014)
C. Mazo et al.
Transfer learning for classification of cardiovascular tissues in histological images
Comput Methods Programs Biomed
(2018)
L. Han et al.
Semi-supervised segmentation of lesion from breast ultrasound images with attentional generative adversarial network
Comput Methods Programs Biomed
(2020)
W. Sun et al.
Computerized breast cancer analysis system using three stage semi-supervised learning method
Comput Methods Programs Biomed
(2016)
S. Uppu et al.
A deep hybrid model to detect multi-locus interacting snps in the presence of noise
Int J Med Inform
(2018)
Y. Guo et al.
Bcdforest: a boosting cascade deep forest model towards the classification of cancer subtypes based on gene expression data
BMC Bioinform
(2018)
J. Li et al.
A multicenter random forest model for effective prognosis prediction in collaborative clinical research network
Artif Intell Med
(2020)
K. Van Hoorde et al.
A spline-based tool to assess and visualize the calibration of multiclass risk predictions
J Biomed Inform
(2015)
X. Lv et al.
Transfer learning based clinical concept extraction on data from multiple sources
J Biomed Inform
(2014)
T.P.A. Debray et al.
A new framework to enhance the interpretation of external validation studies of clinical prediction models
J Clin Epidemiol
(2015)

E.H. Shortliffe et al.

Clinical decision support in the era of artificial intelligence

JAMA

(2018)

S.-Y. Kim

Effects of sample size on robustness and prediction accuracy of a prognostic gene signature

BMC Bioinform

(2009)

E.W. Orenstein et al.

Development and dissemination of clinical decision support across institutions: standardization and sharing of refugee health screening modules

J Am Med Inform Assoc

(2019)

R. Duan et al.

Learning from electronic health records across multiple sites: a communication-efficient and privacy-preserving distributed algorithm

J Am Med Inform Assoc

(2020)

J.C. Jakobsen et al.

Thresholds for statistical and clinical significance in systematic reviews with meta-analytic methods

BMC Med Res Methodol

(2014)

J.C. Jakobsen et al.

Power estimations for non-primary outcomes in randomised clinical trials

BMJ Open

(2019)

D. Arterburn et al.

Comparative effectiveness and safety of bariatric procedures for weight loss: a pcornet cohort study

Ann Intern Med

(2018)

D. Bender et al.

Hl7 fhir: an agile and restful approach to healthcare information exchange

G. Hripcsak et al.

Observational health data sciences and informatics (ohdsi): opportunities for observational researchers

Stud Health Technol Inform

(2015)

Z.S.N. Reis et al.

Is there evidence of cost benefits of electronic medical records, standards, or interoperability in hospital information systems? Overview of systematic reviews

JMIR Med Inform

(2017)

C. Lubamba et al.

Cyber-healthcare cloud computing interoperability using the hl7-cda standard

S.J. Pan et al.

A survey on transfer learning

IEEE Trans knowl Data Eng

(2009)

B. Azarkhalili et al.

Deepathology: deep multi-task learning for inferring molecular pathology from cancer transcriptome

Sci Rep

(2019)

H. Liu et al.

Transfer learning from bert to support insertion of new concepts into snomed ct

G. Lee et al.

Adapting surgical models to individual hospitals using transfer learning

T. Scott et al.

Adapted deep embeddings: a synthesis of methods for k-shot inductive transfer learning

Advances in neural information processing systems

(2018)

J. Wiens et al.

A study in transfer learning: leveraging data from multiple hospitals to enhance hospital-specific predictions

J Am Med Inform Assoc

(2014)

M.A. Clapp et al.

Patient and hospital factors associated with unexpected newborn complications among term neonates in us hospitals

JAMA Netw Open

(2020)

Cited by (10)

Advancing Italian biomedical information extraction with transformers-based models: Methodological insights and multicenter practical application
2023, Journal of Biomedical Informatics
The introduction of computerized medical records in hospitals has reduced burdensome activities like manual writing and information fetching. However, the data contained in medical records are still far underutilized, primarily because extracting data from unstructured textual medical records takes time and effort. Information Extraction, a subfield of Natural Language Processing, can help clinical practitioners overcome this limitation by using automated text-mining pipelines. In this work, we created the first Italian neuropsychiatric Named Entity Recognition dataset, PsyNIT, and used it to develop a Transformers-based model. Moreover, we collected and leveraged three external independent datasets to implement an effective multicenter model, with overall F1-score 84.77 %, Precision 83.16 %, Recall 86.44 %. The lessons learned are: (i) the crucial role of a consistent annotation process and (ii) a fine-tuning strategy that combines classical methods with a “low-resource” approach. This allowed us to establish methodological guidelines that pave the way for Natural Language Processing studies in less-resourced languages.
Intelligent oncology: The convergence of artificial intelligence and oncology
2023, Journal of the National Cancer Center
Citation Excerpt :
A recent study has incorporated data of colorectal cancer patients from different hospitals while overcoming feature misalignment and distribution divergence by employing hybrid semi-supervised transfer learning models from multi-centers, in combination with the DNN and neural decision forest model. They demonstrate that the approach has superior generalizability regardless of the data heterogeneity.92 Through concerted efforts in above-mentioned aspects, intelligent oncology is expected to contribute significantly to the future success of basic, translational and clinical oncology.
With increasingly explored ideologies and technologies for potential applications of artificial intelligence (AI) in oncology, we here describe a holistic and structured concept termed intelligent oncology. Intelligent oncology is defined as a cross-disciplinary specialty which integrates oncology, radiology, pathology, molecular biology, multi-omics and computer sciences, aiming to promote cancer prevention, screening, early diagnosis and precision treatment. The development of intelligent oncology has been facilitated by fast AI technology development such as natural language processing, machine/deep learning, computer vision, and robotic process automation. While the concept and applications of intelligent oncology is still in its infancy, and there are still many hurdles and challenges, we are optimistic that it will play a pivotal role for the future of basic, translational and clinical oncology.
CATNet: Cross-event attention-based time-aware network for medical event prediction
2022, Artificial Intelligence in Medicine
Citation Excerpt :
The medical event prediction (MEP) task aims to provide a set of medical candidates according to the patient's historical EMR. Many machine learning methods, especially deep learning methods, have been used for MEP tasks [9–15] (as shown in Fig. 1) to support clinical decisions. Formally, given historical medical records of a patient p with demographic information S and T visits, each vt of which includes various kinds of medical events (usually denoted by medical codes) such as diagnoses xtd, procedures xtp, laboratory tests (lab tests) xtl, medications xtm and other medical events xt?,
Medical event prediction (MEP) is a fundamental task in the healthcare domain, which needs to predict medical events, including medications, diagnosis codes, laboratory tests, procedures, outcomes, and so on, according to historical medical records of patients. Many researchers have tried to build MEP models to overcome the challenges caused by the heterogeneous and irregular temporal characteristics of EHR data. However, most of them consider the heterogenous and temporal medical events separately and ignore the correlations among different types of medical events, especially relations between heterogeneous historical medical events and target medical events. In this paper, we propose a novel neural network based on attention mechanism called Cross-event Attention-based Time-aware Network (CATNet) for MEP. It is a time-aware, event-aware and task-adaptive method with the following advantages: 1) modeling heterogeneous information and temporal information in a unified way and considering irregular temporal characteristics locally and globally respectively, 2) taking full advantage of correlations among different types of events via cross-event attention. Experiments on two public datasets (MIMIC-III and eICU) show CATNet outperforms other state-of-the-art methods on various MEP tasks. The source code of CATNet is released at https://github.com/sherry6247/CATNet.git.
Advancing Italian Biomedical Information Extraction with Large Language Models: Methodological Insights and Multicenter Practical Application
2023, SSRN
Creating an Open Data City for Healthcare: A Critical Review of Data Management Strategy and Development in China
2023, Journal of Management in Engineering
OMOP CDM Can Facilitate Data-Driven Studies for Cancer Prediction: A Systematic Review
2022, International Journal of Molecular Sciences

View all citing articles on Scopus

¹: Contributed equally to this paper.

View full text

Improving prediction for medical institution with limited patient data: Leveraging hospital-specific data based on multicenter collaborative research network

Highlights

Abstract

Background and objective

Methods

Results

Conclusions

Introduction

Section snippets

Methods

Data sets

Model validation on hybrid transfer learning

Discussion

Conclusion

Funding

Conflicts of interest

Acknowledgement

J Biomed Inform

Comput Methods Programs Biomed

Comput Methods Programs Biomed

Comput Methods Programs Biomed

Int J Med Inform

BMC Bioinform

Artif Intell Med

J Biomed Inform

J Biomed Inform

J Clin Epidemiol

Clinical decision support in the era of artificial intelligence

JAMA

Effects of sample size on robustness and prediction accuracy of a prognostic gene signature

BMC Bioinform

Development and dissemination of clinical decision support across institutions: standardization and sharing of refugee health screening modules

J Am Med Inform Assoc

Learning from electronic health records across multiple sites: a communication-efficient and privacy-preserving distributed algorithm

J Am Med Inform Assoc

Thresholds for statistical and clinical significance in systematic reviews with meta-analytic methods

BMC Med Res Methodol

Power estimations for non-primary outcomes in randomised clinical trials

BMJ Open

Comparative effectiveness and safety of bariatric procedures for weight loss: a pcornet cohort study

Ann Intern Med

Hl7 fhir: an agile and restful approach to healthcare information exchange

Observational health data sciences and informatics (ohdsi): opportunities for observational researchers

Stud Health Technol Inform

Is there evidence of cost benefits of electronic medical records, standards, or interoperability in hospital information systems? Overview of systematic reviews

JMIR Med Inform

Cyber-healthcare cloud computing interoperability using the hl7-cda standard

A survey on transfer learning

IEEE Trans knowl Data Eng

Deepathology: deep multi-task learning for inferring molecular pathology from cancer transcriptome

Sci Rep

Transfer learning from bert to support insertion of new concepts into snomed ct

Adapting surgical models to individual hospitals using transfer learning

Adapted deep embeddings: a synthesis of methods for k-shot inductive transfer learning

Advances in neural information processing systems

A study in transfer learning: leveraging data from multiple hospitals to enhance hospital-specific predictions

J Am Med Inform Assoc

Patient and hospital factors associated with unexpected newborn complications among term neonates in us hospitals

JAMA Netw Open