Statistical semantic and clinician confidence analysis for correcting abbreviations and spelling errors in clinical progress notes

https://doi.org/10.1016/j.artmed.2011.08.003Get rights and content

Abstract

Motivation

Progress notes are narrative summaries about the status of patients during the course of treatment or care. Time and efficiency pressures have ensured clinicians’ continued preference for unstructured text over entering data in forms when composing progress notes. The ability to extract meaningful data from the unstructured text contained within the notes is invaluable for retrospective analysis and decision support. The automatic extraction of data from unstructured notes, however, has been largely prevented due to the complexity of handling abbreviations, misspelling, punctuation errors and other types of noise.

Objective

We present a robust system for cleaning noisy progress notes in real-time, with a focus on abbreviations and misspellings.

Methods

The system uses statistical semantic analysis based on Web data and the occasional participation of clinicians to automatically replace abbreviations with the actual senses and misspellings with the correct words.

Results

An accuracy of as high as 88.73% was achieved based only on statistical semantic analysis using Web data. The response time of the system with the caching mechanism enabled is 1.5–2 s per word which is about the same as the average typing speed of clinicians.

Conclusions

The overall accuracy and the response time of the system will improve with time, especially when the confidence mechanism is activated through clinicians’ interactions with the system. This system will be implemented in a clinical information system to drive interactive decision support and analysis functions leading to improved patient care and outcomes.

Introduction

Clinical records consist of a collection of clinical observations, treatment histories and medical test results produced at different stages of patients’ health care. Elements of clinical records include admission notes, progress notes, radiology reports, pathology reports and discharge summaries. Despite containing some coded information, most of the important details in clinical records remain locked away in unstructured text [1]. For this reason, the automatic analysis of clinical records has become an important research area with significance to a variety of systems for reducing medical errors and improving other aspects of health care such as disease surveillance [2].

Our research focuses specifically on the automated analysis of clinical progress notes to provide clinicians with intelligent functions to support their everyday decision making. The urge amongst clinicians for faster text entry while attempting to retain semantic clarity has, however, contributed to the noisy structure of progress notes [3]. A progress note is considered as containing noise when there is potential difference or ambiguity between the surface form of the entered text and the intended content. For instance, a clinician could mistakenly enter “pb” instead of the intended “bp” or “blood pressure”. Similarly, the preferential use of acronyms such as “arf” over “acute renal failure”, which could potentially lead to ambiguous interpretations of notes, is another source of noise. The more noise clinicians introduce in their progress notes, the less intelligible the notes will become. Some of the common types of noise are abbreviation, misspelling and punctuation error. From here onwards, abbreviations will be taken to simply mean shortened forms of words (whether common or ad hoc), and can comprise acronyms, initialisms and so on.

A quick look at the literature reveals that noise remains a major hurdle towards achieving accurate and robust progress note analysis for real-world applications. In the words of Friedlin et al. [4], “A chief barrier to accurate NLP classification was that many free text results disregarded grammatical rules. These reports frequently contained incomplete sentences, spelling errors, and/or lacked proper punctuation”. A missing punctuation mark or a few too many abbreviations can often render standard natural language processing tools such as part-of-speech taggers unusable. To date, most research into progress note analysis remains experimental, and deals with the cleaning phase (also known as normalising) by either ignoring it all together [4], requiring the input data to be cleaned beforehand [5], or using machine learning techniques that are expensive to set up and impractical to deploy. The challenge of achieving complete and accurate progress note analysis, therefore, begins with a robust, practical approach dedicated to cleaning noisy input. The focus of this paper is on addressing this challenge.

Recently, it was suggested that the problem of abbreviation disambiguation is simpler than that of general term disambiguation [6]. A quick look at an actual progress note in Fig. 1 would immediately suggest otherwise. While expanding abbreviations may be relatively straightforward with respect to the biomedical literature [6], [7] or other properly curated texts [8], disambiguating abbreviations in clinical progress notes is definitely non-trivial. A number of problems contributed to the difficulty of achieving real-time performance in progress note analysis amongst existing systems. This paper focuses on the following: (1) nonconformity of the structure of progress notes to conventions, (2) noisy contextual information derived from progress notes, (3) inability of static dictionaries to easily reflect changes in the domain, and (4) problems with machine learning techniques. We will discuss more on these points and other related work in Section 2.

This paper addresses the four problems described above in the form of an integrated clinical progress note cleaning system based on statistical semantics and a novel confidence mechanism. The four main distinguishing points that separate the proposed system from existing ones are:

  • The system automatically corrects both abbreviations and misspellings.

  • The system is easy to set up without the trouble of training using hand-crafted data, or the complexity of preparing a variety of features (e.g. from static knowledge sources, part-of-speech tagging, etc.) for classification.

  • The system uses statistical information derived from the distributional behaviour of words (i.e. statistical semantics) on the World Wide Web (or the Web from here on) to generate a set of scores to automatically produce the preliminary corrections.

  • The system memorizes its interactions with clinicians over time to establish a confidence mechanism which is used to adjust the scores for improving future automatic corrections.

In short, the system provides realistic, real-time operation in the face of noisy progress notes with the ability to incrementally improve its performance through repeated interactions with clinicians. Section 3 describes the detail of this system. The system's sources of long forms or senses from here on, and the possible replacements for misspellings called suggestions, are discussed in Section 3.1. We evaluated the system's performance in two parts. First, we look at the response time of the system during real-time operations. Second, we assess the accuracy of the initial automatic corrections without the confidence mechanism. The results that arose from the evaluation are presented in Section 4. Section 5 discusses the limitations of the proposed system. We conclude this paper in Section 6.

Section snippets

Related work

Unlike properly curated texts, the noisy nature of clinical progress notes and the importance of complete and accurate analysis place extra burden on the cleaning phase, and hence, the significance of this research. A look into the literature immediately reveals the many problems that contribute to the challenging nature of this task. We first take a look at some of the existing work on disambiguating abbreviations and correcting spelling errors in Section 2.2. We then discuss in detail in

Real-time clinical progress note cleaning

In essence, our system takes as input a noisy progress note, and produces the corresponding cleaned text as output. The system cleans a progress note in six steps. First, the system splits the input note by whitespace into a set of n words W={w1,,wn} (line 2 in Algorithm 1). For instance, the sentence “Pt continues itnubated and on ventilatory support” is broken down into W = {“Pt”, “continues”, “itnubated”, “and”, “on”, “ventilatory”, “support”}. Second, the system identifies words wiW that

Evaluation results

For the initial evaluation, we randomly selected a test set of 30 samples from a corpus of 2433 actual de-identified progress notes by http://physionet.org.2 The test set comprises 961 words, with each note containing an average of 32 words. Fig. 3 shows 4 of the 30 progress notes used in this evaluation. We performed the test in four steps. First, we typed the 30 progress notes word by

Limitations and discussions

The system is capable of removing on average 7 out of 10 noise words in a progress note. Considering the simplicity of the interface (e.g. no user involvement in setting system parameters) and the absence of the need for expertise to maintain complex knowledge bases, such performance is commendable. Due to the relatively small test set, care should be taken when interpreting these figures. The current performance only signifies a baseline which will improve with time through the participation

Conclusion

Clinical progress notes are a rich source of valuable data for improving the many facets of health care. The problem, however, lies in the noisy nature of such texts, and the inability of many of the present technologies to adequately process and analyse them. Existing work either ignores the noise, requires the input data to be cleaned beforehand, or uses techniques that are expensive to set up and impractical to deploy.

In this paper, we presented a robust system for cleaning clinical progress

References (30)

  • F. Damerau

    A technique for computer detection and correction of spelling errors

    Communications of the ACM

    (1964)
  • V. Levenshtein

    Binary codes capable of correcting deletions, insertions, and reversals

    Soviet Physics Doklady

    (1966)
  • R. Wagner et al.

    The string-to-string correction problem

    Journal of the ACM

    (1974)
  • K. Kukich

    Technique for automatically correcting words in text

    ACM Computing Surveys

    (1992)
  • Odell M, Russell R. U.S. patent numbers 1,261,167. U.S. Patent Office, Washington, DC;...
  • Cited by (15)

    • Incidence of speech recognition errors in the emergency department

      2016, International Journal of Medical Informatics
      Citation Excerpt :

      Automated correction of SR documents has been attempted by several studies [27–29], including Voll et al. who developed a statistical error detection method to detect post-transcription errors with a 96% detection rate [30]. Wong et al. developed an automatic system to process noisy clinical notes in real-time and achieved an accuracy of 88.7% [31]. The system was implemented within their clinical information system to drive decision support and further analytical functions.

    • Vietnamese Electronic Medical Record Management with Text Preprocessing for Spelling Errors

      2022, Proceedings - 2022 9th NAFOSTED Conference on Information and Computer Science, NICS 2022
    • Survey of automatic spelling correction

      2020, Electronics (Switzerland)
    View all citing articles on Scopus
    View full text