Named Entity Recognition for De-identifying Real-World Health Records in Spanish

López-García, Guillermo; Moreno-Barea, Francisco J.; Mesa, Héctor; Jerez, José M.; Ribelles, Nuria; Alba, Emilio; Veredas, Francisco J.

doi:10.1007/978-3-031-36024-4_17

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 10475))

Included in the following conference series:

International Conference on Computational Science

545 Accesses

Abstract

A growing and renewed interest has emerged in Electronic Health Records (EHRs) as a source of information for decision-making in clinical practice. In this context, the automatic de-identification of EHRs constitutes an essential task, since their dissociation from personal data is a mandatory first step before their distribution. However, the majority of previous studies on this subject have been conducted on English EHRs, due to the limited availability of annotated corpora in other languages, such as Spanish. In this study, we addressed the automatic de-identification of medical documents in Spanish. A private corpus of 599 real-world clinical cases have been annotated with 8 different protected health information categories. We have tackled the predictive problem as a named entity recognition task, developing two different deep learning-based methodologies, namely a first strategy based on recurrent neural networks (RNN) and an end-to-end approach based on transformers. Additionally, we have developed a data augmentation procedure to increase the number of texts used to train the models. The results obtained show that transformers outperform RNN on the de-identification of Spanish clinical data. In particular, the best performance was obtained by the XLM-RoBERTa large transformer, with a strict-match micro-averaged value of 0.946 for precision, 0.954 for recall and 0.95 for F1-score, when trained on the augmented version of the corpus. The performance achieved by transformers in this study proves the viability of applying these state-of-the-art models in real-world clinical scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.ine.es/inebmenu/indiceAZ.htm.

References

Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Associat. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, Online (Jul 2020)
Google Scholar
Cortes Generales de España: Ley Orgánica 3/2018, de 5 de diciembre. de Protección de Datos Personales y garantía de los derechos digitales, Boletìn Oficial del Estado (2018)
Google Scholar
Council of the European Union: Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). Off. J. Eur. Union 119, 1–88 (2016)
Google Scholar
Dernoncourt, F., Lee, J.Y., Uzuner, O., Szolovits, P.: De-identification of patient notes with recurrent neural networks. J. Am. Med. Inform. Assoc. 24(3), 596–606 (2017). https://doi.org/10.1093/jamia/ocw156
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171–4186 (2019)
Google Scholar
Dorr, D.A., Phillips, W., Phansalkar, S., Sims, S.A., Hurdle, J.F.: Assessing the difficulty and time cost of de-identification in clinical narratives. Methods Inf. Med. 45(03), 246–252 (2006). https://doi.org/10.1055/s-0038-1634080
Article Google Scholar
Grishman, R., Sundheim, B.M.: Message Understanding Conference-6: A brief history. In: COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics (1996)
Google Scholar
Gutiérrez-Fandiño, A., et al.: MarIA: Spanish Language Models. Procesamiento del Lenguaje Natural 68(0), 39–60 (2022). https://doi.org/10.26342/2022-68-3
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Jan, T., Trienschnigg, D., Seifert, C., Hiemstra, D.: Comparing rule-based, feature-based and deep neural methods for de-identification of dutch medical records. In: ACM Health Search and Data Mining Workshop, HSDM 2020 (2020)
Google Scholar
Jha, A., et al.: Use of electronic health records in US hospitals. N. Engl. J. Med. 360(16), 1628–1638 (2009)
Article Google Scholar
Jiang, Z., Zhao, C., He, B., Guan, Y., Jiang, J.: De-identification of medical records using conditional random fields and long short-term memory networks. J. Biomed. Inform. 75, S43–S53 (2017)
Article Google Scholar
Lafferty, J.D., McCallum, A., Pereira, F.C.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)
Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)
Lee, J., et al.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)
Article MathSciNet Google Scholar
Liu, L., Perez-Concha, O., Nguyen, A., Bennett, V., Jorm, L.: De-identifying Australian hospital discharge summaries: An end-to-end framework using ensemble of deep learning models. J. Biomed. Inform. 135, 104215 (2022)
Article Google Scholar
Liu, Y., et al.: RoBERTa: A robustly optimized BERT pretraining approach. arXiv [cs.CL] (2019)
Google Scholar
López-García, G., Jerez, J.M., Ribelles, N., Alba, E., Veredas, F.J.: Detection of tumor morphology mentions in clinical reports in spanish using transformers. In: Advances in Computational Intelligence, pp. 24–35. Springer International Publishing, Cham (2021). https://doi.org/10.1007/978-3-030-85030-2_3
López-Garcìa, G., Jerez, J.M., Ribelles, N., Alba, E., Veredas, F.J.: Transformers for Clinical Coding in Spanish. IEEE Access 9, 72387–72397 (2021)
Article Google Scholar
Marimon, M., et al.: Automatic de-identification of medical texts in spanish: the meddocan track, corpus, guidelines, methods and evaluation of results. In: IberLEF@ SEPLN, pp. 618–638 (2019)
Google Scholar
Perez, N., García-Sardiña, L., Serras, M., Del Pozo, A.: Vicomtech at MEDDOCAN: Medical Document Anonymization. In: IberLEF@ SEPLN, pp. 696–703 (2019)
Google Scholar
Pérez-Díez, I., Pérez-Moraga, R., López-Cerdán, A., Salinas-Serrano, J.M., la Iglesia-Vayá, M.d.: De-identifying Spanish medical texts-named entity recognition applied to radiology reports. J. Biomed. Semant. 12(1), 1–13 (2021)
Google Scholar
Ramshaw, L.A., Marcus, M.P.: Text chunking using Transformation-Based learning. In: Natural Language Processing Using Very Large Corpora, pp. 157–176. Springer, Netherlands, Dordrecht (1999). https://doi.org/10.1007/978-94-017-2390-9_10
Ribelles, N., et al.: Galén: Sistema de información para la gestión y coordinación de procesos en un servicio de oncología. RevistaeSalud 6(21), 1–12 (2010)
Google Scholar
Richter-Pechanski, P., Amr, A., Katus, H.A., Dieterich, C.: Deep learning approaches outperform conventional strategies in de-identification of german medical reports. In: GMDS, pp. 101–109 (2019). https://doi.org/10.3233/SHTI190813
Stubbs, A., Kotfila, C.: Özlem Uzuner: Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. J. Biomed. Inform. 58, S11–S19 (2015)
Article Google Scholar
Urda, D., Ribelles, N., Subirats, J.L., Franco, L., Alba, E., Jerez, J.M.: Addressing critical issues in the development of an oncology information system. Int. J. Med. Informatics 82(5), 398–407 (2013)
Article Google Scholar
U.S. Dept. of Health & Human Services: Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. Office for Civil Rights (OCR) (2012)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)
Google Scholar
Vítores, D.F.: El español: una lengua viva. Instituto Cervantes (2019). https://www.cervantes.es/imagenes/File/espanol_lengua_viva_2019.pdf
Yang, H., Garibaldi, J.M.: Automatic detection of protected health information from clinic narratives. J. Biomed. Inform. 58, S30–S38 (2015)
Article Google Scholar

Download references

Acknowledgements

The authors acknowledge the support from the Ministerio de Economía y Empresa (MINECO) through grant TIN2017-88728-C2-1-R, from the Ministerio de Ciencia e Innovación (MICINN) under project PID2020-116898RB-I00, from the Universidad de Málaga and Junta de Andalucía through grant UMA20-FEDERJA-045, from the Malaga-Pfizer consortium for AI research in Cancer - MAPIC, and from the Instituto de Investigación Biomédica de Málaga - IBIMA (all including FEDER funds).

Author information

Authors and Affiliations

Departamento de Lenguajes y Ciencias de la Computación, Escuela Técnica Superior de Ingeniería Informática, Universidad de Málaga, Málaga, Spain
Guillermo López-García, Francisco J. Moreno-Barea, Héctor Mesa, José M. Jerez & Francisco J. Veredas
Unidad de Gestión Clínica Intercentros de Oncología, Instituto de Investigación Biomédica de Málaga (IBIMA), Hospitales Universitarios Regional y Virgen de la Victoria, Málaga, Spain
Nuria Ribelles & Emilio Alba
Research Institute of Multilingual Language Technologies, Universidad de Málaga, Málaga, Spain
Francisco J. Veredas

Authors

Guillermo López-García
View author publications
You can also search for this author in PubMed Google Scholar
Francisco J. Moreno-Barea
View author publications
You can also search for this author in PubMed Google Scholar
Héctor Mesa
View author publications
You can also search for this author in PubMed Google Scholar
José M. Jerez
View author publications
You can also search for this author in PubMed Google Scholar
Nuria Ribelles
View author publications
You can also search for this author in PubMed Google Scholar
Emilio Alba
View author publications
You can also search for this author in PubMed Google Scholar
Francisco J. Veredas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guillermo López-García .

Editor information

Editors and Affiliations

Czech Technical University in Prague, Prague, Czech Republic
Jiří Mikyška
University of Amsterdam, Amsterdam, The Netherlands
Clélia de Mulatier
AGH University of Science and Technology, Krakow, Poland
Maciej Paszynski
University of Amsterdam, Amsterdam, The Netherlands
Valeria V. Krzhizhanovskaya
University of Tennessee at Knoxville, Knoxville, TN, USA
Jack J. Dongarra
University of Amsterdam, Amsterdam, The Netherlands
Peter M.A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

López-García, G. et al. (2023). Named Entity Recognition for De-identifying Real-World Health Records in Spanish. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 10475. Springer, Cham. https://doi.org/10.1007/978-3-031-36024-4_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-36024-4_17
Published: 26 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36023-7
Online ISBN: 978-3-031-36024-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Named Entity Recognition for De-identifying Real-World Health Records in Spanish