Skip to main content

Named Entity Recognition for De-identifying Real-World Health Records in Spanish

  • Conference paper
  • First Online:
Computational Science – ICCS 2023 (ICCS 2023)

Abstract

A growing and renewed interest has emerged in Electronic Health Records (EHRs) as a source of information for decision-making in clinical practice. In this context, the automatic de-identification of EHRs constitutes an essential task, since their dissociation from personal data is a mandatory first step before their distribution. However, the majority of previous studies on this subject have been conducted on English EHRs, due to the limited availability of annotated corpora in other languages, such as Spanish. In this study, we addressed the automatic de-identification of medical documents in Spanish. A private corpus of 599 real-world clinical cases have been annotated with 8 different protected health information categories. We have tackled the predictive problem as a named entity recognition task, developing two different deep learning-based methodologies, namely a first strategy based on recurrent neural networks (RNN) and an end-to-end approach based on transformers. Additionally, we have developed a data augmentation procedure to increase the number of texts used to train the models. The results obtained show that transformers outperform RNN on the de-identification of Spanish clinical data. In particular, the best performance was obtained by the XLM-RoBERTa large transformer, with a strict-match micro-averaged value of 0.946 for precision, 0.954 for recall and 0.95 for F1-score, when trained on the augmented version of the corpus. The performance achieved by transformers in this study proves the viability of applying these state-of-the-art models in real-world clinical scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.ine.es/inebmenu/indiceAZ.htm.

References

  1. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Associat. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  2. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, Online (Jul 2020)

    Google Scholar 

  3. Cortes Generales de España: Ley Orgánica 3/2018, de 5 de diciembre. de Protección de Datos Personales y garantía de los derechos digitales, Boletìn Oficial del Estado (2018)

    Google Scholar 

  4. Council of the European Union: Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). Off. J. Eur. Union 119, 1–88 (2016)

    Google Scholar 

  5. Dernoncourt, F., Lee, J.Y., Uzuner, O., Szolovits, P.: De-identification of patient notes with recurrent neural networks. J. Am. Med. Inform. Assoc. 24(3), 596–606 (2017). https://doi.org/10.1093/jamia/ocw156

    Article  Google Scholar 

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171–4186 (2019)

    Google Scholar 

  7. Dorr, D.A., Phillips, W., Phansalkar, S., Sims, S.A., Hurdle, J.F.: Assessing the difficulty and time cost of de-identification in clinical narratives. Methods Inf. Med. 45(03), 246–252 (2006). https://doi.org/10.1055/s-0038-1634080

    Article  Google Scholar 

  8. Grishman, R., Sundheim, B.M.: Message Understanding Conference-6: A brief history. In: COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics (1996)

    Google Scholar 

  9. Gutiérrez-Fandiño, A., et al.: MarIA: Spanish Language Models. Procesamiento del Lenguaje Natural 68(0), 39–60 (2022). https://doi.org/10.26342/2022-68-3

  10. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  11. Jan, T., Trienschnigg, D., Seifert, C., Hiemstra, D.: Comparing rule-based, feature-based and deep neural methods for de-identification of dutch medical records. In: ACM Health Search and Data Mining Workshop, HSDM 2020 (2020)

    Google Scholar 

  12. Jha, A., et al.: Use of electronic health records in US hospitals. N. Engl. J. Med. 360(16), 1628–1638 (2009)

    Article  Google Scholar 

  13. Jiang, Z., Zhao, C., He, B., Guan, Y., Jiang, J.: De-identification of medical records using conditional random fields and long short-term memory networks. J. Biomed. Inform. 75, S43–S53 (2017)

    Article  Google Scholar 

  14. Lafferty, J.D., McCallum, A., Pereira, F.C.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)

    Google Scholar 

  15. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)

  16. Lee, J., et al.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)

    Article  MathSciNet  Google Scholar 

  17. Liu, L., Perez-Concha, O., Nguyen, A., Bennett, V., Jorm, L.: De-identifying Australian hospital discharge summaries: An end-to-end framework using ensemble of deep learning models. J. Biomed. Inform. 135, 104215 (2022)

    Article  Google Scholar 

  18. Liu, Y., et al.: RoBERTa: A robustly optimized BERT pretraining approach. arXiv [cs.CL] (2019)

    Google Scholar 

  19. López-García, G., Jerez, J.M., Ribelles, N., Alba, E., Veredas, F.J.: Detection of tumor morphology mentions in clinical reports in spanish using transformers. In: Advances in Computational Intelligence, pp. 24–35. Springer International Publishing, Cham (2021). https://doi.org/10.1007/978-3-030-85030-2_3

  20. López-Garcìa, G., Jerez, J.M., Ribelles, N., Alba, E., Veredas, F.J.: Transformers for Clinical Coding in Spanish. IEEE Access 9, 72387–72397 (2021)

    Article  Google Scholar 

  21. Marimon, M., et al.: Automatic de-identification of medical texts in spanish: the meddocan track, corpus, guidelines, methods and evaluation of results. In: IberLEF@ SEPLN, pp. 618–638 (2019)

    Google Scholar 

  22. Perez, N., García-Sardiña, L., Serras, M., Del Pozo, A.: Vicomtech at MEDDOCAN: Medical Document Anonymization. In: IberLEF@ SEPLN, pp. 696–703 (2019)

    Google Scholar 

  23. Pérez-Díez, I., Pérez-Moraga, R., López-Cerdán, A., Salinas-Serrano, J.M., la Iglesia-Vayá, M.d.: De-identifying Spanish medical texts-named entity recognition applied to radiology reports. J. Biomed. Semant. 12(1), 1–13 (2021)

    Google Scholar 

  24. Ramshaw, L.A., Marcus, M.P.: Text chunking using Transformation-Based learning. In: Natural Language Processing Using Very Large Corpora, pp. 157–176. Springer, Netherlands, Dordrecht (1999). https://doi.org/10.1007/978-94-017-2390-9_10

  25. Ribelles, N., et al.: Galén: Sistema de información para la gestión y coordinación de procesos en un servicio de oncología. RevistaeSalud 6(21), 1–12 (2010)

    Google Scholar 

  26. Richter-Pechanski, P., Amr, A., Katus, H.A., Dieterich, C.: Deep learning approaches outperform conventional strategies in de-identification of german medical reports. In: GMDS, pp. 101–109 (2019). https://doi.org/10.3233/SHTI190813

  27. Stubbs, A., Kotfila, C.: Özlem Uzuner: Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. J. Biomed. Inform. 58, S11–S19 (2015)

    Article  Google Scholar 

  28. Urda, D., Ribelles, N., Subirats, J.L., Franco, L., Alba, E., Jerez, J.M.: Addressing critical issues in the development of an oncology information system. Int. J. Med. Informatics 82(5), 398–407 (2013)

    Article  Google Scholar 

  29. U.S. Dept. of Health & Human Services: Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. Office for Civil Rights (OCR) (2012)

    Google Scholar 

  30. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)

    Google Scholar 

  31. Vítores, D.F.: El español: una lengua viva. Instituto Cervantes (2019). https://www.cervantes.es/imagenes/File/espanol_lengua_viva_2019.pdf

  32. Yang, H., Garibaldi, J.M.: Automatic detection of protected health information from clinic narratives. J. Biomed. Inform. 58, S30–S38 (2015)

    Article  Google Scholar 

Download references

Acknowledgements

The authors acknowledge the support from the Ministerio de Economía y Empresa (MINECO) through grant TIN2017-88728-C2-1-R, from the Ministerio de Ciencia e Innovación (MICINN) under project PID2020-116898RB-I00, from the Universidad de Málaga and Junta de Andalucía through grant UMA20-FEDERJA-045, from the Malaga-Pfizer consortium for AI research in Cancer - MAPIC, and from the Instituto de Investigación Biomédica de Málaga - IBIMA (all including FEDER funds).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guillermo López-García .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

López-García, G. et al. (2023). Named Entity Recognition for De-identifying Real-World Health Records in Spanish. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 10475. Springer, Cham. https://doi.org/10.1007/978-3-031-36024-4_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-36024-4_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-36023-7

  • Online ISBN: 978-3-031-36024-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics