Bete: A Brazilian Portuguese Dataset for Named Entity Recognition and Relation Extraction in the Diabetes Healthcare Domain

Pavanelli, Lucas; Gumiel, Yohan Bonescki; Ferreira, Thiago; Pagano, Adriana; Laber, Eduardo

doi:10.1007/978-3-031-45392-2_17

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14197))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

220 Accesses

Abstract

The biomedical NLP community has seen great advances in dataset development mostly for the English language, which has hindered progress in the field, as other languages are still underrepresented. This study introduces a dataset of Brazilian Portuguese annotated for named entity recognition and relation extraction in the healthcare domain. We compiled and annotated a corpus of health professionals’ responses to frequently asked questions in online healthcare forums on diabetes. We measured inter-annotator agreement and conducted initial experiments using up-to-date methods to recognize entities and extract relations, such as BERT-based ones. Data, models, and results are publicly available at https://github.com/pavalucas/Bete .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Translation into English: “Being overweight can lead to type 2 diabetes. Therefore, intermittent fasting may be a way to prevent type 2 diabetes. Intermittent fasting can also be used as a treatment for people newly diagnosed with type 1 diabetes who need to lose weight to achieve a more stable health condition; these people should be advised and monitored by an endocrinologist and a nutritionist.”.

References

Ben Abacha, A., Zweigenbaum, P.: MEANS: a medical question-answering system combining NLP techniques and semantic web technologies. Inf. Process. Manag. 51(5), 570–594 (2015). https://doi.org/10.1016/j.ipm.2015.04.006, https://www.sciencedirect.com/science/article/pii/S0306457315000515
Bose, P., Srinivasan, S., Sleeman, W.C., Palta, J., Kapoor, R., Ghosh, P.: A survey on recent named entity recognition and relationship extraction techniques on clinical texts. Appl. Sci. 11(18) (2021). https://doi.org/10.3390/app11188319, https://www.mdpi.com/2076-3417/11/18/8319
Brandsen, A., Verberne, S., Wansleeben, M., Lambers, K.: Creating a dataset for named entity recognition in the archaeology domain. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 4573–4577. European Language Resources Association, Marseille, France (2020), https://aclanthology.org/2020.lrec-1.562
Eckart de Castilho, R., et al.: A web-based tool for the integrated annotation of semantic and syntactic structures. In: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pp. 76–84. The COLING 2016 Organizing Committee, Osaka, Japan (2016), https://www.aclweb.org/anthology/W16-4011
Castro Ferreira, T., et al.: Evaluating recognizing question entailment methods for a Portuguese community question-answering system about diabetes mellitus. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 234–243. INCOMA Ltd., Held Online (2021), https://aclanthology.org/2021.ranlp-main.28
Choudhary, A., Arora, A.: Linguistic feature based learning model for fake news detection and classification. Expert Syst. Appl. 169, 114171 (2021)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423, https://www.aclweb.org/anthology/N19-1423
Gabarron, E., et al.: Social media for health promotion in diabetes: study protocol for a participatory public health intervention design. BMC Health Serv. Res. 18(1), 414 (2018). https://doi.org/10.1186/s12913-018-3178-7
Article Google Scholar
Grouin, C., Rosset, S., Zweigenbaum, P., Fort, K., Galibert, O., Quintard, L.: Proposal for an extension of traditional named entities: from guidelines to evaluation, an overview. In: Proceedings of the 5th linguistic annotation workshop, pp. 92–100 (2011)
Google Scholar
Honnibal, M., Montani, I.: spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To Appear 7(1), 411–420 (2017)
Google Scholar
Hripcsak, G., Rothschild, A.S.: Agreement, the f-measure, and reliability in information retrieval. J. Am. Med. Inform. Assoc. 12(3), 296–298 (2005)
Article Google Scholar
Lahav, D., et al.: A search engine for discovery of scientific challenges and directions. In: AAAI (2022)
Google Scholar
Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34(1), 50–70 (2020)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=Bkg6RiCqY7
Nikfarjam, A., Emadzadeh, E., Gonzalez, G.: Towards generating a patient’s timeline: Extracting temporal relationships from clinical notes. J. Biomed. Inform. 46, S40–S47 (2013). https://doi.org/10.1016/j.jbi.2013.11.001, supplement: 2012 i2b2 NLP Challenge on Temporal Relations in Clinical Data
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Association for Computational Linguistics, New Orleans, Louisiana (2018). https://doi.org/10.18653/v1/N18-1202, https://aclanthology.org/N18-1202
Saeedi, P., et al.: Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the international diabetes federation diabetes atlas, 9th edition. Diabetes Res. Clin. Pract. 157, 107843 (2019). https://doi.org/10.1016/j.diabres.2019.107843
Schneider, E.T.R., et al.: BioBERTpt - a Portuguese neural language model for clinical named entity recognition. In: Proceedings of the 3rd Clinical Natural Language Processing Workshop, pp. 65–72. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.clinicalnlp-1.7
Sharma, V., Kulkarni, N., Pranavi, S., Bayomi, G., Nyberg, E., Mitamura, T.: BioAMA: towards an end to end biomedical question answering system. In: Proceedings of the BioNLP 2018 workshop, pp. 109–117. Association for Computational Linguistics, Melbourne, Australia (2018). https://doi.org/10.18653/v1/W18-2312, https://aclanthology.org/W18-2312
Soares, L.B., FitzGerald, N., Ling, J., Kwiatkowski, T.: Matching the blanks: distributional similarity for relation learning. In: ACL 2019–57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pp. 2895–2905 (2020). https://doi.org/10.18653/v1/p19-1279
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28
Chapter Google Scholar
Wagner, J., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

aiXplain Inc., Los Gatos, USA
Lucas Pavanelli & Thiago Ferreira
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Yohan Bonescki Gumiel & Adriana Pagano
Pontifícia Universidade Católica do Rio de Janeiro (PUC-RJ), Rio de Janeiro, Brazil
Eduardo Laber

Authors

Lucas Pavanelli
View author publications
You can also search for this author in PubMed Google Scholar
Yohan Bonescki Gumiel
View author publications
You can also search for this author in PubMed Google Scholar
Thiago Ferreira
View author publications
You can also search for this author in PubMed Google Scholar
Adriana Pagano
View author publications
You can also search for this author in PubMed Google Scholar
Eduardo Laber
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lucas Pavanelli .

Editor information

Editors and Affiliations

Federal University of São Carlos, São Carlos, Brazil
Murilo C. Naldi
Centro Universitario da FEI, São Bernardo do Campo, Brazil
Reinaldo A. C. Bianchi

Ethics declarations

Ethical Statement

Our study fully complies with ethical standards and did not require any submission to ethical boards, since no data collection with human subjects was carried out. Our dataset was created by our team and contains texts drafted by medical students under the supervision of healthcare professionals, all of whom are research members in our project.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pavanelli, L., Gumiel, Y.B., Ferreira, T., Pagano, A., Laber, E. (2023). Bete: A Brazilian Portuguese Dataset for Named Entity Recognition and Relation Extraction in the Diabetes Healthcare Domain. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14197. Springer, Cham. https://doi.org/10.1007/978-3-031-45392-2_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-45392-2_17
Published: 12 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45391-5
Online ISBN: 978-3-031-45392-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Bete: A Brazilian Portuguese Dataset for Named Entity Recognition and Relation Extraction in the Diabetes Healthcare Domain