Elsevier

Journal of Biomedical Informatics

Volume 73, September 2017, Pages 76-83
Journal of Biomedical Informatics

A cascaded approach for Chinese clinical text de-identification with less annotation effort

https://doi.org/10.1016/j.jbi.2017.07.017Get rights and content
Under an Elsevier user license
open archive

Highlights

  • Pattern matching can accurately locate sentences including PHI.

  • Construction of dense PHI corpus reduces manual annotation cost.

  • A cascaded method can enhance performance of de-identification, especially sensitivity.

Abstract

With rapid adoption of Electronic Health Records (EHR) in China, an increasing amount of clinical data has been available to support clinical research. Clinical data secondary use usually requires de-identification of personal information to protect patient privacy. Since manually de-identification of free clinical text requires significant amount of human work, developing an automated de-identification system is necessary. While there are many de-identification systems available for English clinical text, designing a de-identification system for Chinese clinical text faces many challenges such as unavailability of necessary lexical resources and sparsity of patient health information (PHI) in Chinese clinical text. In this paper, we designed a de-identification pipeline taking advantage of both rule-based and machine learning techniques. Our method, in particular, can effectively construct a data set with dense PHI information, which saves annotation time significantly for subsequent supervised learning. We experiment on a dataset of 3000 heterogeneous clinical documents to evaluate the annotation cost and the de-identification performance. Our approach can increase the efficiency of the annotation effort by over 60% while reaching performance as high as over 90% measured by F score. We demonstrate that combing rule-based and machine learning is an effective way to reduce the annotation cost and achieve high performance in Chinese clinical text de-identification task.

Keywords

De-identification
Clinical natural language processing
Chinese NLP
Annotation cost

Cited by (0)