Skip to main content

Linguistic Uncertainty in Clinical NLP: A Taxonomy, Dataset and Approach

  • Conference paper
  • First Online:
Experimental IR Meets Multilinguality, Multimodality, and Interaction (CLEF 2021)

Abstract

Linguistic uncertainty is prevalent in electronic health records (EHRs). The ability to handle and preserve uncertainty in natural language is an essential skill for clinicians, facilitating decidability and effective clinical reasoning processes despite incomplete knowledge in some situations. This has been addressed by previous research in clinical NLP by the development of algorithms that detect uncertainty expressions. However, existing rule-based algorithms have limited uncertainty detection capabilities. Therefore, we seek to reformulate uncertainty detection as a supervised machine learning problem by (i) reevaluating the concept of uncertainty, (ii) embedding this understanding in an improved linguistic uncertainty taxonomy and (iii) introducing a new dataset of EHRs annotated for nine types of uncertainty – the first publicly available dataset of its kind. Many of our classes are novel and emphasise implicit uncertainties – a form of uncertainty that is ignored by existing algorithms, yet has crucial functions in clinical settings. Through an evaluation of our dataset, we demonstrate the scalability of our approach and its utility in relation to research on clinical information extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    All examples hereinafter are from MIMIC and are paraphrased.

  2. 2.

    https://physionet.org/.

References

  1. Chapman, W.W., Bridewell, W., Hanbury, P., Cooper, G.F., Buchanan, B.G.: A simple algorithm for identifying negated findings and diseases in discharge summaries. J. Biomed. Inform. 34, 301–310 (2001). https://doi.org/10.1006/jbin.2001.1029

    Article  Google Scholar 

  2. Goldberger, A.L., et al.: PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101, e215–e220 (2000). https://doi.org/10.1161/01.cir.101.23.e215

    Article  Google Scholar 

  3. Irvin, J., et al.: CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019). https://doi.org/10.1609/aaai.v33i01.3301590

  4. Johnson, A.E., Pollard, T.J., Mark, R.G.: MIMIC-III clinical database (version 1.4). PhysioNet (2016)

    Google Scholar 

  5. Johnson, A.E., Pollard, T.J., Mark, R.G., Seth, B., Horng, S.: MIMIC-CXR Database (version 2.0.0). PhysioNet (2019)

    Google Scholar 

  6. Johnson, A.E., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data 3, 1–9 (2016). https://doi.org/10.1038/sdata.2016.35

    Article  Google Scholar 

  7. Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (2014). https://doi.org/10.3115/v1/d14-1181

  8. Mowery, D.L., Ave, M., Chapman, W.W.: Medical diagnosis lost in translation – analysis of uncertainty and negation expressions in English and Swedish clinical texts. In: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing (BioNLP 2012) (2012)

    Google Scholar 

  9. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  10. Peng, Y., Wang, X., Lu, L., Bagheri, M., Summers, R., Lu, Z.: NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. In: AMIA Joint Summits on Translational Science Proceedings. AMIA Joint Summits on Translational Science (2018)

    Google Scholar 

  11. Velupillai, S.: Shades of certainty: annotation and classification of Swedish medical records (2012). http://su.diva-portal.org/smash/record.jsf?searchId=1&pid=diva2:512263

  12. Vincze, V., Szarvas, G., Farkas, R., Móra, G., Csirik, J.: The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinformatics 9, 1–9 (2008). https://doi.org/10.1186/1471-2105-9-S11-S9

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mark Turner .

Editor information

Editors and Affiliations

Ethics declarations

Ethics

This study has been carried out in accordance with all relevant guidelines and regulations for the use of MIMIC-III data. Assisting human medical experts to make better decisions in complex environments is the sole aim of this paper and the way we handle data in our dataset. Further, all annotators involved in the construction of our dataset were volunteers. Before deployment in an actual clinical setting, we plan to systematically evaluate our methodology under the supervision of expert clinicians.

A Model Implementation

A Model Implementation

Following the work of Kim [7], a state-of-the-art single channel Convolutional Neural Network (CNN) for sentence classification was used as a binary classifier.

For our experiments (see Sect. 5.2 and 5.3), the majority of hyperparameters were kept constant: the learning rate was set at 0.3; the dropout probability in the dropout layer was 0.1; BioWordVec embeddings were scaled by a factor of 0.65. The window sizes for our two convolutional layers were either 1 and 3 or 3 and 5. The number of training epochs ranged from 30 to 70. These hyperparameters were determined by monitoring the training loss. Random classifiers used as a baseline were drawn from the Scikitlearn library [9].

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Turner, M., Ive, J., Velupillai, S. (2021). Linguistic Uncertainty in Clinical NLP: A Taxonomy, Dataset and Approach. In: Candan, K.S., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2021. Lecture Notes in Computer Science(), vol 12880. Springer, Cham. https://doi.org/10.1007/978-3-030-85251-1_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-85251-1_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-85250-4

  • Online ISBN: 978-3-030-85251-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics