Building machine learning models without sharing patient data: A simulation-based analysis of distributed learning by ensembling

https://doi.org/10.1016/j.jbi.2020.103424Get rights and content
Under an Elsevier user license
open archive

Highlights

  • The difficulty of sharing patient data hinders training machine learning models.

  • Distributed learning trains models locally, sidestepping the need to share data.

  • Ensembling is a simple way to perform distributed learning.

  • ANNs, SVMs, and RF can be used for distributed learning by ensembling.

  • Small local datasets, such as the case with rare diseases, can be used.

Abstract

The development of machine learning solutions in medicine is often hindered by difficulties associated with sharing patient data. Distributed learning aims to train machine learning models locally without requiring data sharing. However, the utility of distributed learning for rare diseases, with only a few training examples at each contributing local center, has not been investigated. The aim of this work was to simulate distributed learning models by ensembling with artificial neural networks (ANN), support vector machines (SVM), and random forests (RF) and evaluate them using four medical datasets. Distributed learning by ensembling locally trained agents improved performance compared to models trained using the data from a single institution, even in cases where only a very few training examples are available per local center. Distributed learning improved when more locally trained models were added to the ensemble. Local class imbalance reduced distributed SVM performance but did not impact distributed RF and ANN classification. Our results suggest that distributed learning by ensembling can be used to train machine learning models without sharing patient data and is suitable to use with small datasets.

Keywords

Artificial neural networks
Distributed learning
Machine learning
Medical information systems
Random forest
Support vector machines

Abbreviations

ADNI
Alzheimer’s Disease Neuroimaging Initiative
ANN
artificial neural network
ANOVA
analysis of variance
RF
random forest
SVM
support vector machine
SD
standard deviation

Data availability

Data used in preparation of this article was obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wpcontent/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

Cited by (0)