Challenges in Annotating a Treebank of Clinical Narratives in Brazilian Portuguese

de Oliveira, Lucas Ferro Antunes; Pagano, Adriana; e Oliveira, Lucas Emanuel Silva; Moro, Claudia

doi:10.1007/978-3-030-98305-5_9

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13208))

Included in the following conference series:

International Conference on Computational Processing of the Portuguese Language

660 Accesses
1 Citations

Abstract

Dependency parsing can enhance the performance of Named Entity Recognition (NER) models and can be leveraged to boost information extraction. NER tasks are essential to deal with clinical narratives, but models for Brazilian Portuguese dependency parsing are scarce, even less for clinical texts and its specificities. This paper reports on the development of a treebank of clinical narratives in Brazilian Portuguese and the drafting of guidelines. Based on a corpus of 1,000 clinical narratives manually annotated with semantic information, split into 12,711 sentences, we identified some characteristics of these texts that differ from traditional domains and have a deep impact on the annotation process, such as extensive use of acronyms and abbreviations, words not recognized by POS taggers, misspelling, special use of some symbols, different uses for numerals, heterogeneity of sentence sizes, and coordinated phrases without any punctuation. We developed a document to describe the annotation types and to explain how difficult cases should be treated to ensure consistency, including examples that could be found in this kind of texts. We created a Tag versus Frequency relation to justify some of the characteristics and challenges of the corpus. The corpus when completely annotated will be made available to the entire scientific community that performs research with clinical texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT)

Article Open access 15 February 2017

Medical Entity and Relation Extraction from Narrative Clinical Records in Italian Language

Clinical Narrative Analytics Challenges

Notes

1.
The use of SemClinBr texts was approved by the Ethics Committee in Research (CEP) of PUCPR, under register n^o. 1,354,675.
2.
https://universaldependencies.org/treebanks/pt_bosque/index.html.
3.
https://arboratorgrew.elizia.net/#/.
4.
https://universaldependencies.org.

References

Bretonnel Cohen, K., Demner-Fushman, D.: Biomedical Natural Language Processing. John Benjamins (2014). https://www.jbe-platform.com/content/books/9789027271068
Dalianis, H.: Basic building blocks for clinical text processing. In: Clinical Text Mining, pp. 55–82. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78503-5_7
Chapter Google Scholar
Dalianis, H., Hassel, M., Henriksson, A., Skeppstedt, M.: Stockholm EPR corpus: a clinical database used to improve health care. In: Swedish Language Technology Conference, pp. 17–18 (2012)
Google Scholar
Hao, T., Rusanov, A., Boland, M.R., Weng, C.: Clustering clinical trials with similar eligibility criteria features. J. Biomed. Inf. 52, 112–120 (2014)
Article Google Scholar
Jiang, Z., Zhao, F., Guan, Y.: Developing a linguistically annotated corpus of Chinese electronic medical record. In: 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 307–310. IEEE (2014)
Google Scholar
Lopes, F., Teixeira, C.A., Oliveira, H.G.: Contributions to clinical named entity recognition in Portuguese. In: BioNLP@ACL (2019)
Google Scholar
Meystre, S.M., Savova, G.K., Kipper-Schuler, K.C., Hurdle, J.F.: Extracting information from textual documents in the electronic health record: a review of recent research. Yearbook of Med. Inf. 17(01), 128–144 (2008)
Article Google Scholar
Névéol, A., Dalianis, H., Velupillai, S., Savova, G., Zweigenbaum, P.: Clinical natural language processing in languages other than english: opportunities and challenges. J. Biomed. Semantics 9(1), 1–13 (2018)
Article Google Scholar
Ogren, P.V., Savova, G.K., Chute, C.G., et al.: Constructing evaluation corpora for automated clinical named entity recognition. In: LREC, vol. 8, pp. 3143–3150 (2008)
Google Scholar
Oinam, N., Mishra, D., Patel, P., Choudhary, N., Desai, H.: A treebank for the healthcare domain. In: Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), pp. 144–155 (2018)
Google Scholar
Oliveira, L., et al.: Semclinbr-a multi institutional and multi specialty semantically annotated corpus for Portuguese clinical NLP tasks. In: CoRR (2020)
Google Scholar
Oliveira, L.E.S., de Souza, A.C., Nohama, P., Moro, C.M.C.: A novel method for identifying continuity of care in hospital discharge summaries. In: Zhang, Y.-T. (ed.) The International Conference on Health Informatics. IP, vol. 42, pp. 284–287. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-03005-0_72
Chapter Google Scholar
de Oliveira, L.F.A., e Oliveira, L.E.S., Gumiel, Y.B., Carvalho, D.R., Moro, C.M.C.: Defining a state-of-the-art POS-tagging environment for Brazilian Portuguese clinical texts. Res. Biomed. Eng. 36(3), 267–276 (2020). https://doi.org/10.1007/s42600-020-00067-7
Article Google Scholar
Pakhomov, S.V., Coden, A., Chute, C.G.: Developing a corpus of clinical notes manually annotated for part-of-speech. Int. J. Med. Inf. 75(6), 418–429 (2006)
Article Google Scholar
Percha, B.: Modern clinical text mining: a guide and review. Ann. Rev. Biomed. Data Sci. 4(1), 165–187 (2021). https://doi.org/10.1146/annurev-biodatasci-030421-030931, pMID: 34465177
Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: a Python natural language processing toolkit for many human languages. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (2020). https://nlp.stanford.edu/pubs/qi2020stanza.pdf
Schneider, E.T.R., et al.: BioBERTpt - a Portuguese neural language model for clinical named entity recognition. In: Proceedings of the 3rd Clinical Natural Language Processing Workshop, pp. 65–72. Association for Computational Linguistics, November 2020. https://doi.org/10.18653/v1/2020.clinicalnlp-1.7, https://aclanthology.org/2020.clinicalnlp-1.7
Tateisi, Y., Tsujii, J.: Part-of-speech annotation of biology research abstracts. In: LREC (2004)
Google Scholar
Wu, S.T., Liu, H., Li, D., Tao, C., Musen, M.A., Chute, C.G., Shah, N.H.: Unified medical language system term occurrences in clinical notes: a large-scale corpus analysis. J. Am. Med. Inf. Assoc. 19(e1), e149–e156 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Pontifical Catholic University of Paraná, Curitiba, PR, Brazil
Lucas Ferro Antunes de Oliveira, Lucas Emanuel Silva e Oliveira & Claudia Moro
Federal University of Minas Gerais, Belo Horizonte, MG, Brazil
Adriana Pagano

Authors

Lucas Ferro Antunes de Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Adriana Pagano
View author publications
You can also search for this author in PubMed Google Scholar
Lucas Emanuel Silva e Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Moro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lucas Ferro Antunes de Oliveira .

Editor information

Editors and Affiliations

Universidade de Fortaleza, Fortaleza, Brazil
Vládia Pinheiro
CiTIUS - Universidade de Santiago de Compostela, Santiago de Compostela, Spain
Pablo Gamallo
Universidade Nova de Lisboa, Lisbon, Portugal
Raquel Amaro
University of Sheffield, Sheffield, UK
Carolina Scarton
INESC-ID, Lisbon, Portugal
Fernando Batista
Federal University of São Carlos, São Carlos, Brazil
Diego Silva
University of Lisbon, Lisbon, Portugal
Catarina Magro
Sentimonitor, Porto Alegre, Brazil
Hugo Pinto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Oliveira, L.F.A., Pagano, A., e Oliveira, L.E.S., Moro, C. (2022). Challenges in Annotating a Treebank of Clinical Narratives in Brazilian Portuguese. In: Pinheiro, V., et al. Computational Processing of the Portuguese Language. PROPOR 2022. Lecture Notes in Computer Science(), vol 13208. Springer, Cham. https://doi.org/10.1007/978-3-030-98305-5_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-98305-5_9
Published: 16 March 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-98304-8
Online ISBN: 978-3-030-98305-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Challenges in Annotating a Treebank of Clinical Narratives in Brazilian Portuguese

Abstract

Access this chapter

Similar content being viewed by others

A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT)

Medical Entity and Relation Extraction from Narrative Clinical Records in Italian Language

Clinical Narrative Analytics Challenges

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Challenges in Annotating a Treebank of Clinical Narratives in Brazilian Portuguese

Abstract

Access this chapter

Similar content being viewed by others

A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT)

Medical Entity and Relation Extraction from Narrative Clinical Records in Italian Language

Clinical Narrative Analytics Challenges

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation