Abstract
The biomedical NLP community has seen great advances in dataset development mostly for the English language, which has hindered progress in the field, as other languages are still underrepresented. This study introduces a dataset of Brazilian Portuguese annotated for named entity recognition and relation extraction in the healthcare domain. We compiled and annotated a corpus of health professionals’ responses to frequently asked questions in online healthcare forums on diabetes. We measured inter-annotator agreement and conducted initial experiments using up-to-date methods to recognize entities and extract relations, such as BERT-based ones. Data, models, and results are publicly available at https://github.com/pavalucas/Bete .
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Translation into English: “Being overweight can lead to type 2 diabetes. Therefore, intermittent fasting may be a way to prevent type 2 diabetes. Intermittent fasting can also be used as a treatment for people newly diagnosed with type 1 diabetes who need to lose weight to achieve a more stable health condition; these people should be advised and monitored by an endocrinologist and a nutritionist.”.
References
Ben Abacha, A., Zweigenbaum, P.: MEANS: a medical question-answering system combining NLP techniques and semantic web technologies. Inf. Process. Manag. 51(5), 570–594 (2015). https://doi.org/10.1016/j.ipm.2015.04.006, https://www.sciencedirect.com/science/article/pii/S0306457315000515
Bose, P., Srinivasan, S., Sleeman, W.C., Palta, J., Kapoor, R., Ghosh, P.: A survey on recent named entity recognition and relationship extraction techniques on clinical texts. Appl. Sci. 11(18) (2021). https://doi.org/10.3390/app11188319, https://www.mdpi.com/2076-3417/11/18/8319
Brandsen, A., Verberne, S., Wansleeben, M., Lambers, K.: Creating a dataset for named entity recognition in the archaeology domain. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 4573–4577. European Language Resources Association, Marseille, France (2020), https://aclanthology.org/2020.lrec-1.562
Eckart de Castilho, R., et al.: A web-based tool for the integrated annotation of semantic and syntactic structures. In: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pp. 76–84. The COLING 2016 Organizing Committee, Osaka, Japan (2016), https://www.aclweb.org/anthology/W16-4011
Castro Ferreira, T., et al.: Evaluating recognizing question entailment methods for a Portuguese community question-answering system about diabetes mellitus. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 234–243. INCOMA Ltd., Held Online (2021), https://aclanthology.org/2021.ranlp-main.28
Choudhary, A., Arora, A.: Linguistic feature based learning model for fake news detection and classification. Expert Syst. Appl. 169, 114171 (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423, https://www.aclweb.org/anthology/N19-1423
Gabarron, E., et al.: Social media for health promotion in diabetes: study protocol for a participatory public health intervention design. BMC Health Serv. Res. 18(1), 414 (2018). https://doi.org/10.1186/s12913-018-3178-7
Grouin, C., Rosset, S., Zweigenbaum, P., Fort, K., Galibert, O., Quintard, L.: Proposal for an extension of traditional named entities: from guidelines to evaluation, an overview. In: Proceedings of the 5th linguistic annotation workshop, pp. 92–100 (2011)
Honnibal, M., Montani, I.: spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To Appear 7(1), 411–420 (2017)
Hripcsak, G., Rothschild, A.S.: Agreement, the f-measure, and reliability in information retrieval. J. Am. Med. Inform. Assoc. 12(3), 296–298 (2005)
Lahav, D., et al.: A search engine for discovery of scientific challenges and directions. In: AAAI (2022)
Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34(1), 50–70 (2020)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=Bkg6RiCqY7
Nikfarjam, A., Emadzadeh, E., Gonzalez, G.: Towards generating a patient’s timeline: Extracting temporal relationships from clinical notes. J. Biomed. Inform. 46, S40–S47 (2013). https://doi.org/10.1016/j.jbi.2013.11.001, supplement: 2012 i2b2 NLP Challenge on Temporal Relations in Clinical Data
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Association for Computational Linguistics, New Orleans, Louisiana (2018). https://doi.org/10.18653/v1/N18-1202, https://aclanthology.org/N18-1202
Saeedi, P., et al.: Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the international diabetes federation diabetes atlas, 9th edition. Diabetes Res. Clin. Pract. 157, 107843 (2019). https://doi.org/10.1016/j.diabres.2019.107843
Schneider, E.T.R., et al.: BioBERTpt - a Portuguese neural language model for clinical named entity recognition. In: Proceedings of the 3rd Clinical Natural Language Processing Workshop, pp. 65–72. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.clinicalnlp-1.7
Sharma, V., Kulkarni, N., Pranavi, S., Bayomi, G., Nyberg, E., Mitamura, T.: BioAMA: towards an end to end biomedical question answering system. In: Proceedings of the BioNLP 2018 workshop, pp. 109–117. Association for Computational Linguistics, Melbourne, Australia (2018). https://doi.org/10.18653/v1/W18-2312, https://aclanthology.org/W18-2312
Soares, L.B., FitzGerald, N., Ling, J., Kwiatkowski, T.: Matching the blanks: distributional similarity for relation learning. In: ACL 2019–57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pp. 2895–2905 (2020). https://doi.org/10.18653/v1/p19-1279
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28
Wagner, J., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Ethical Statement
Our study fully complies with ethical standards and did not require any submission to ethical boards, since no data collection with human subjects was carried out. Our dataset was created by our team and contains texts drafted by medical students under the supervision of healthcare professionals, all of whom are research members in our project.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Pavanelli, L., Gumiel, Y.B., Ferreira, T., Pagano, A., Laber, E. (2023). Bete: A Brazilian Portuguese Dataset for Named Entity Recognition and Relation Extraction in the Diabetes Healthcare Domain. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14197. Springer, Cham. https://doi.org/10.1007/978-3-031-45392-2_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-45392-2_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45391-5
Online ISBN: 978-3-031-45392-2
eBook Packages: Computer ScienceComputer Science (R0)