Improving condition severity classification with an efficient active learning based framework

https://doi.org/10.1016/j.jbi.2016.03.016Get rights and content
Under an Elsevier user license
open archive

Highlights

  • The challenge of efficient condition severity classification is formalized as an active learning task.

  • We present and compare several active learning (AL) strategies within CAESAR-ALE.

  • With our AL methods we reduced medical experts’ labeling efforts by 48–64%.

  • CAESAR-ALE improves the predictive performance of severe condition classification.

  • CAESAR-ALE can learn from labeled data obtained using labelers without clinical training.

Abstract

Classification of condition severity can be useful for discriminating among sets of conditions or phenotypes, for example when prioritizing patient care or for other healthcare purposes. Electronic Health Records (EHRs) represent a rich source of labeled information that can be harnessed for severity classification. The labeling of EHRs is expensive and in many cases requires employing professionals with high level of expertise. In this study, we demonstrate the use of Active Learning (AL) techniques to decrease expert labeling efforts. We employ three AL methods and demonstrate their ability to reduce labeling efforts while effectively discriminating condition severity. We incorporate three AL methods into a new framework based on the original CAESAR (Classification Approach for Extracting Severity Automatically from Electronic Health Records) framework to create the Active Learning Enhancement framework (CAESAR-ALE). We applied CAESAR-ALE to a dataset containing 516 conditions of varying severity levels that were manually labeled by seven experts. Our dataset, called the “CAESAR dataset,” was created from the medical records of 1.9 million patients treated at Columbia University Medical Center (CUMC). All three AL methods decreased labelers’ efforts compared to the learning methods applied by the original CAESER framework in which the classifier was trained on the entire set of conditions; depending on the AL strategy used in the current study, the reduction ranged from 48% to 64% that can result in significant savings, both in time and money. As for the PPV (precision) measure, CAESAR-ALE achieved more than 13% absolute improvement in the predictive capabilities of the framework when classifying conditions as severe. These results demonstrate the potential of AL methods to decrease the labeling efforts of medical experts, while increasing accuracy given the same (or even a smaller) number of acquired conditions. We also demonstrated that the methods included in the CAESAR-ALE framework (Exploitation and Combination_XA) are more robust to the use of human labelers with different levels of professional expertise.

Graphical abstract

(1) Inducing the initial classification model from a small initial training set of conditions randomly selected from the CAESAR dataset. (2) Evaluating the classification model using the test set (containing new unseen conditions) to measure its initial performance. (3) Introducing the pool of unlabeled conditions to the sampling methods. During each trial, a defined number of the most informative conditions are selected according to the AL method’s preferences (or randomly selected by the baseline Random method), and their labels are revealed by the single gold standard labeler used in the original CAESAR system (in a real system the selected conditions will be labeled by an expert, but in our dataset all of the conditions are already labeled). (4) Adding the acquired conditions to the training set and removing them from the pool. (5) Inducing an updated classification model using the updated training set. (6) This process (stages 2–6) iterates until the entire pool is acquired.

  1. Download : Download high-res image (77KB)
  2. Download : Download full-size image

Keywords

Active learning
Electronic Health Records
Phenotyping
Condition
Severity

Abbreviations

CAESAR
Classification Approach for Extracting Severity Automatically from Electronic Health Records
CAESAR-ALE
Classification Approach for Extracting Severity Automatically from Electronic Health Records – Active Learning Enhancement
EHR
Electronic Health Record
AL
Active Learning
SVM
Support Vector Machines
VS
Version Space
SNOMED-CT
Systemized Nomenclature of Medicine-Clinical Terms
ICD-9
International Classification of Diseases – Version 9
SVM-Margin
Support Vector Machines-Margin Method – an existing AL method oriented towards acquiring informative conditions that lie closest to the separating hyperplane (inside the margin).
Exploitation
an AL method included in the CAESAR-ALE framework that is oriented towards acquisition of severe conditions.
Combination_XA
an AL method included in the CAESAR-ALE framework that combines elements of the Exploitation method and the SVM-Margin method, so that it applies a hybrid acquisition strategy for enhanced improvement of the CAESER method

Cited by (0)