Text mining applied to electronic cardiovascular procedure reports to identify patients with trileaflet aortic stenosis and coronary artery disease

doi:10.1016/j.jbi.2017.06.016

Journal of Biomedical Informatics

Volume 72, August 2017, Pages 77-84

https://doi.org/10.1016/j.jbi.2017.06.016 Get rights and content

Under an Elsevier user license

open archive

Highlights

•
The Oracle product Endeca was adapted for clinical text mining purposes.
•
Text mining on cardiovascular procedure text was compared to ICD-9 for two phenotypes.
•
Text mining exhibited higher performance metrics for both aortic stenosis and coronary artery disease.

Abstract

Background

Interrogation of the electronic health record (EHR) using billing codes as a surrogate for diagnoses of interest has been widely used for clinical research. However, the accuracy of this methodology is variable, as it reflects billing codes rather than severity of disease, and depends on the disease and the accuracy of the coding practitioner. Systematic application of text mining to the EHR has had variable success for the detection of cardiovascular phenotypes. We hypothesize that the application of text mining algorithms to cardiovascular procedure reports may be a superior method to identify patients with cardiovascular conditions of interest.

Methods

We adapted the Oracle product Endeca, which utilizes text mining to identify terms of interest from a NoSQL-like database, for purposes of searching cardiovascular procedure reports and termed the tool “PennSeek”. We imported 282,569 echocardiography reports representing 81,164 individuals and 27,205 cardiac catheterization reports representing 14,567 individuals from non-searchable databases into PennSeek. We then applied clinical criteria to these reports in PennSeek to identify patients with trileaflet aortic stenosis (TAS) and coronary artery disease (CAD). Accuracy of patient identification by text mining through PennSeek was compared with ICD-9 billing codes.

Results

Text mining identified 7115 patients with TAS and 9247 patients with CAD. ICD-9 codes identified 8272 patients with TAS and 6913 patients with CAD. 4346 patients with AS and 6024 patients with CAD were identified by both approaches. A randomly selected sample of 200–250 patients uniquely identified by text mining was compared with 200–250 patients uniquely identified by billing codes for both diseases. We demonstrate that text mining was superior, with a positive predictive value (PPV) of 0.95 compared to 0.53 by ICD-9 for TAS, and a PPV of 0.97 compared to 0.86 for CAD.

Conclusion

These results highlight the superiority of text mining algorithms applied to electronic cardiovascular procedure reports in the identification of phenotypes of interest for cardiovascular research.

Graphical abstract

Keywords

Valvular heart disease

Coronary artery disease

Text mining

Administrative

Billing codes

Cited by (0)

¹: These authors have contributed equally.