Improving prediction for medical institution with limited patient data: Leveraging hospital-specific data based on multicenter collaborative research network

https://doi.org/10.1016/j.artmed.2021.102024Get rights and content

Highlights

  • A multisource deep transfer learning model is proposed for improving predictive performance on a single institution with limited patient data.

  • The proposed approach enables better feature adaptation by incorporating hospital-specific features across source data and limited target data.

  • The unlabeled target data is integrated into the model updating approach to enhance the training process for single center with insufficient labels.

  • The case study shows the better discrimination and calibration ability of proposed model learning process than baseline models with limited EHR data.

Abstract

Background and objective

Clinical decision support assisted by prediction models usually faces the challenges of limited clinical data and a lack of labels when the model is developed with data from a single medical institution. Accordingly, research on multicenter clinical collaborative networks, which can provide external medical data, has received increasing attention. With the increasing availability of machine learning techniques such as transfer learning, leveraging large-scale patient data from multiple hospitals to build data-driven predictive models with clinical application potential provides an alternative solution to address the problem of limited patient data.

Methods

A multicenter hybrid semi-supervised transfer learning model (MHSTL) is proposed in this study on the basis of unified common data model to ensure multicenter data standardized representation. Then the hospital-specific features, along with the co-occurrence features across domains, are aligned through a representation learning architecture that is built based on deep neural networks and the newly proposed neural decision forest model. In this process, limited patient data from the target hospital, both labeled and unlabeled, are incorporated during the feature adaptation process, thereby contributing to better model performance. Without patient-level data sharing, the proposed model learning strategy which overcomes feature misalignment and distribution divergence, enables the multi-source transfer learning process in the case of insufficient and unlabeled patient data at target hospital.

Results

The effectiveness of the proposed transfer learning model was evaluated on a collaborative research network of colorectal cancer patients in the US and China. The results demonstrate that the proposed model can achieve much better performance for predicting target risk with limited resources on patient data than baseline models      . Better discrimination and calibration ability are also observed when sufficient labeled data are not available in the target hospital for prognosis prediction tasks      . Further exploratory experiments show that the proposed approach exhibits good model generalizability regardless of the data heterogeneity. With the help of the SHapley Additive exPlanations for model interpretation, the effectiveness of incorporating hospital-specific features in the transfer learning model is shown.

Conclusions

In this study, the proposed method can develop prediction models from multiple source hospitals and exhibit good performance by leveraging cross-domain hospital-specific feature information, therefore enhancing the model prediction when applied to single medical institution with limited patient data.

Introduction

The quality of healthcare decision-making depends heavily on the validity and reliability of clinical decision support (CDS) models [1], [2], while the statistical significance outcome of which hinges upon the quality of medical data that support model establishment [3]. Therefore, a data-driven prediction model developed based on high-quality large-scale electronic health record (EHR) patient data can provide a more accurate basis for clinical decision-makers than other models. However, single clinical institution is often unable to collect sufficient sample sizes and adequate amounts of labeled data due to the limited scale. Poor-quality data have led to the inadequacy of single-center models to support a persuasive and powerful conclusion, along with an inability to obtain improved predictive modeling results of patient outcomes [4], [5].

To overcome this challenge, an increasing amount of research has sought to expand the scope from one institution to multiple centers to mitigate the limitation caused by the lack of available training samples [6], [7], [8]. The accessibility of multi-source data has been improved by using health information exchange standards such as Fast Healthcare Interoperability Resources (FHIR) [9] or the Observational Medical Outcomes Partnership (OMOP) common data model (CDM) [10], which solve the interoperability problem of healthcare information systems caused by the diversity of clinical decision support systems (CDSS) and EHR [11], [12]. After the patient-level data are accessed through a standardized information sharing interface by concurrently using the ‘same language’, the data from multiple sources can be used to support the clinical research in the multicenter clinical research network.

In multicenter collaborative modeling, the recently proposed transfer learning (TL) technique [13] has demonstrated its effectiveness in transferring model (s) learned from external institution(s) to local system. Transfer learning is a machine learning method that leverages the evidence from the ‘sources’ in one or more related tasks to a different but related task in the ‘target’. TL aims to reduce the distribution discrepancy across domains, thereby ensuring effective prediction on the target domain after the transfer. It appears to be a widely used solution in various applications, such as image annotation [14] and concept extraction [15], etc.

However, there are two aspects that distinguish healthcare-related problems from applications in other domains during model transfer. First, most research on transfer learning has assumed that all data are under the same feature space [16], [17], [18]; however, in medical practice, such an assumption does not always hold because the features often lie in the distinct but overlapping space [19] (see Fig. 1). Many of the observational variables can be hospital-specific due to factors such as different healthcare facility levels among different medical institutions [20] or missing records in the electronic medical system, etc., which are all included in the EHR data. The discrepancy of feature space is commonly neglected in multicenter studies [21], which prevents the hospital-specific features from the source and target domains from being effectively used and thus lowers the potential predictive capabilities of the model.

Another characteristic in model transfer for medical-related issues is that the available labeled data is typically limited. A large amount of unlabeled data is often found in the training data from the target hospital [22]. Collecting labeled patient data is difficult and expensive, especially when the presence of the outcome of interest is hardly observable. For example, in some fast-growing malignant cancers, symptom detection is arduous and the progression of the disease is rapid, so consistent data collection and expensive expert labeling in prognosis prediction are needed [23], [24]. This will impose difficulty on model learning in the fully supervised fashion and thus hinder the model from achieving good performance when predicting on its own cohort [25].

In this study, we propose a multicenter hybrid semi-supervised transfer learning model (MHSTL) to improve prediction performance for medical institution with limited patient data. The hospital-specific data are used through hybrid transfer learning to implement the feature alignment and further solve the problems arising from feature space heterogeneity and data distribution divergence. Meanwhile, a model updating approach that integrates the unlabeled target data is used to enhance the training process and resolve the issue of poor prediction due to the insufficient labels on a single center. The efficient SHapley Additive exPlanations (SHAP) approach [26] was used in this study to improve the interpretability of predictive models. The predictive model acquired from multicenter collaborative model learning provides an alternative solution to enhance the predictive power of CDS models for medical institutions with limited patient data, thereby overcoming the challenges for clinical data analytics in low-resource scenarios.

Section snippets

Methods

The definition of hybrid domain adaptation (DA) was first proposed by Wei et al. [27] in which the source domain and target domain share co-occurrence features but at the same time own their specific feature sets. The hybrid DA demonstrates its differences from the homogeneous DA, which adapts the models into the same feature space but with different data distributions by minimizing the discrepancies of feature distributions through techniques such as sample importance weighting or feature

Data sets

The goal of this case study is to develop a prognosis prediction model for patients with colorectal cancer (CRC) based on the collaborative multi-institution clinical research network. We are interested in the 5-year prognostic survival status after the patient has been diagnosed with CRC. The availability of a well-calibrated prognosis prediction model for malignant cancer could provide a reference for oncologists to make proactive clinical decisions. To validate our model, we used CRC data

Model validation on hybrid transfer learning

The performance of the MHSTL and two baseline models under different ratios between the training data from SAHZU and SEER is shown in Table 3. As r decreases from 0.20 to 0.05 — that is, the patient data for fully supervised model training in the target hospital become scarce — the effectiveness of the proposed MHSTL is revealed. Especially when r is less than 0.15, the MHSTL model reveals its superiority over baseline model 2, indicating that under the low-resource scenario, the target

Discussion

This study proposes an MHSTL model that aims to solve the low-resource medical problems of the target hospital by fully exploiting the knowledge provided by shared features and hospital-specific features across domains in a semi-supervised learning scheme. The effectiveness of the proposed model is validated based on the CRC data from the US and China. As shown in the Results section, the performance of the proposed model, which leverages the predictive power from a large-scale multicenter

Conclusion

The MHSTL model leverages the co-occurrence features and hospital-specific features simultaneously to provide improved prediction for medical institutions with limited patient data. The prediction model construction framework proposed in this study is able to solve the problem of the lack of labeled patient data in the target hospital which is required under the supervised learning scenario. The results of a CRC case study from the US and China demonstrated the superior performance of the

Funding

This work was supported by Major Scientific Project of Zhejiang LabNo. 2020ND8AD01, the National Natural Science Foundation of China (No. 81771936, No.81801796 and No.81672916), the National Key Research and Development Program of China (No. 2018YFC0116901), and the Fundamental Research Funds for the Central Universities, China (No.2020QNA5031).

Conflicts of interest

The authors have no conflicts of interest to declare.

Acknowledgement

This work was supported by Major Scientific Project of Zhejiang Lab (No. 2020ND8AD01), the National Natural Science Foundation of China (No. 81771936, No. 81801796 and No. 81672916), the National Key Research and Development Program of China (No. 2018YFC0116901), and the Fundamental Research Funds for the Central Universities, China (No. 2020QNA5031).

We owe thanks to the staff of the National Cancer Institute (NCI) and each member involved in the Surveillance, Epidemiology and End Results

References (56)

  • E.H. Shortliffe et al.

    Clinical decision support in the era of artificial intelligence

    JAMA

    (2018)
  • S.-Y. Kim

    Effects of sample size on robustness and prediction accuracy of a prognostic gene signature

    BMC Bioinform

    (2009)
  • E.W. Orenstein et al.

    Development and dissemination of clinical decision support across institutions: standardization and sharing of refugee health screening modules

    J Am Med Inform Assoc

    (2019)
  • R. Duan et al.

    Learning from electronic health records across multiple sites: a communication-efficient and privacy-preserving distributed algorithm

    J Am Med Inform Assoc

    (2020)
  • J.C. Jakobsen et al.

    Thresholds for statistical and clinical significance in systematic reviews with meta-analytic methods

    BMC Med Res Methodol

    (2014)
  • J.C. Jakobsen et al.

    Power estimations for non-primary outcomes in randomised clinical trials

    BMJ Open

    (2019)
  • D. Arterburn et al.

    Comparative effectiveness and safety of bariatric procedures for weight loss: a pcornet cohort study

    Ann Intern Med

    (2018)
  • D. Bender et al.

    Hl7 fhir: an agile and restful approach to healthcare information exchange

  • G. Hripcsak et al.

    Observational health data sciences and informatics (ohdsi): opportunities for observational researchers

    Stud Health Technol Inform

    (2015)
  • Z.S.N. Reis et al.

    Is there evidence of cost benefits of electronic medical records, standards, or interoperability in hospital information systems? Overview of systematic reviews

    JMIR Med Inform

    (2017)
  • C. Lubamba et al.

    Cyber-healthcare cloud computing interoperability using the hl7-cda standard

  • S.J. Pan et al.

    A survey on transfer learning

    IEEE Trans knowl Data Eng

    (2009)
  • B. Azarkhalili et al.

    Deepathology: deep multi-task learning for inferring molecular pathology from cancer transcriptome

    Sci Rep

    (2019)
  • H. Liu et al.

    Transfer learning from bert to support insertion of new concepts into snomed ct

  • G. Lee et al.

    Adapting surgical models to individual hospitals using transfer learning

  • T. Scott et al.

    Adapted deep embeddings: a synthesis of methods for k-shot inductive transfer learning

    Advances in neural information processing systems

    (2018)
  • J. Wiens et al.

    A study in transfer learning: leveraging data from multiple hospitals to enhance hospital-specific predictions

    J Am Med Inform Assoc

    (2014)
  • M.A. Clapp et al.

    Patient and hospital factors associated with unexpected newborn complications among term neonates in us hospitals

    JAMA Netw Open

    (2020)
  • Cited by (10)

    • Intelligent oncology: The convergence of artificial intelligence and oncology

      2023, Journal of the National Cancer Center
      Citation Excerpt :

      A recent study has incorporated data of colorectal cancer patients from different hospitals while overcoming feature misalignment and distribution divergence by employing hybrid semi-supervised transfer learning models from multi-centers, in combination with the DNN and neural decision forest model. They demonstrate that the approach has superior generalizability regardless of the data heterogeneity.92 Through concerted efforts in above-mentioned aspects, intelligent oncology is expected to contribute significantly to the future success of basic, translational and clinical oncology.

    • CATNet: Cross-event attention-based time-aware network for medical event prediction

      2022, Artificial Intelligence in Medicine
      Citation Excerpt :

      The medical event prediction (MEP) task aims to provide a set of medical candidates according to the patient's historical EMR. Many machine learning methods, especially deep learning methods, have been used for MEP tasks [9–15] (as shown in Fig. 1) to support clinical decisions. Formally, given historical medical records of a patient p with demographic information S and T visits, each vt of which includes various kinds of medical events (usually denoted by medical codes) such as diagnoses xtd, procedures xtp, laboratory tests (lab tests) xtl, medications xtm and other medical events xt?,

    View all citing articles on Scopus
    1

    Contributed equally to this paper.

    View full text