Skip to main content

Bete: A Brazilian Portuguese Dataset for Named Entity Recognition and Relation Extraction in the Diabetes Healthcare Domain

  • Conference paper
  • First Online:
Intelligent Systems (BRACIS 2023)

Abstract

The biomedical NLP community has seen great advances in dataset development mostly for the English language, which has hindered progress in the field, as other languages are still underrepresented. This study introduces a dataset of Brazilian Portuguese annotated for named entity recognition and relation extraction in the healthcare domain. We compiled and annotated a corpus of health professionals’ responses to frequently asked questions in online healthcare forums on diabetes. We measured inter-annotator agreement and conducted initial experiments using up-to-date methods to recognize entities and extract relations, such as BERT-based ones. Data, models, and results are publicly available at https://github.com/pavalucas/Bete .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Translation into English: “Being overweight can lead to type 2 diabetes. Therefore, intermittent fasting may be a way to prevent type 2 diabetes. Intermittent fasting can also be used as a treatment for people newly diagnosed with type 1 diabetes who need to lose weight to achieve a more stable health condition; these people should be advised and monitored by an endocrinologist and a nutritionist.”.

References

  1. Ben Abacha, A., Zweigenbaum, P.: MEANS: a medical question-answering system combining NLP techniques and semantic web technologies. Inf. Process. Manag. 51(5), 570–594 (2015). https://doi.org/10.1016/j.ipm.2015.04.006, https://www.sciencedirect.com/science/article/pii/S0306457315000515

  2. Bose, P., Srinivasan, S., Sleeman, W.C., Palta, J., Kapoor, R., Ghosh, P.: A survey on recent named entity recognition and relationship extraction techniques on clinical texts. Appl. Sci. 11(18) (2021). https://doi.org/10.3390/app11188319, https://www.mdpi.com/2076-3417/11/18/8319

  3. Brandsen, A., Verberne, S., Wansleeben, M., Lambers, K.: Creating a dataset for named entity recognition in the archaeology domain. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 4573–4577. European Language Resources Association, Marseille, France (2020), https://aclanthology.org/2020.lrec-1.562

  4. Eckart de Castilho, R., et al.: A web-based tool for the integrated annotation of semantic and syntactic structures. In: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pp. 76–84. The COLING 2016 Organizing Committee, Osaka, Japan (2016), https://www.aclweb.org/anthology/W16-4011

  5. Castro Ferreira, T., et al.: Evaluating recognizing question entailment methods for a Portuguese community question-answering system about diabetes mellitus. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 234–243. INCOMA Ltd., Held Online (2021), https://aclanthology.org/2021.ranlp-main.28

  6. Choudhary, A., Arora, A.: Linguistic feature based learning model for fake news detection and classification. Expert Syst. Appl. 169, 114171 (2021)

    Article  Google Scholar 

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423, https://www.aclweb.org/anthology/N19-1423

  8. Gabarron, E., et al.: Social media for health promotion in diabetes: study protocol for a participatory public health intervention design. BMC Health Serv. Res. 18(1), 414 (2018). https://doi.org/10.1186/s12913-018-3178-7

    Article  Google Scholar 

  9. Grouin, C., Rosset, S., Zweigenbaum, P., Fort, K., Galibert, O., Quintard, L.: Proposal for an extension of traditional named entities: from guidelines to evaluation, an overview. In: Proceedings of the 5th linguistic annotation workshop, pp. 92–100 (2011)

    Google Scholar 

  10. Honnibal, M., Montani, I.: spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To Appear 7(1), 411–420 (2017)

    Google Scholar 

  11. Hripcsak, G., Rothschild, A.S.: Agreement, the f-measure, and reliability in information retrieval. J. Am. Med. Inform. Assoc. 12(3), 296–298 (2005)

    Article  Google Scholar 

  12. Lahav, D., et al.: A search engine for discovery of scientific challenges and directions. In: AAAI (2022)

    Google Scholar 

  13. Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34(1), 50–70 (2020)

    Google Scholar 

  14. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=Bkg6RiCqY7

  15. Nikfarjam, A., Emadzadeh, E., Gonzalez, G.: Towards generating a patient’s timeline: Extracting temporal relationships from clinical notes. J. Biomed. Inform. 46, S40–S47 (2013). https://doi.org/10.1016/j.jbi.2013.11.001, supplement: 2012 i2b2 NLP Challenge on Temporal Relations in Clinical Data

  16. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  17. Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Association for Computational Linguistics, New Orleans, Louisiana (2018). https://doi.org/10.18653/v1/N18-1202, https://aclanthology.org/N18-1202

  18. Saeedi, P., et al.: Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the international diabetes federation diabetes atlas, 9th edition. Diabetes Res. Clin. Pract. 157, 107843 (2019). https://doi.org/10.1016/j.diabres.2019.107843

  19. Schneider, E.T.R., et al.: BioBERTpt - a Portuguese neural language model for clinical named entity recognition. In: Proceedings of the 3rd Clinical Natural Language Processing Workshop, pp. 65–72. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.clinicalnlp-1.7

  20. Sharma, V., Kulkarni, N., Pranavi, S., Bayomi, G., Nyberg, E., Mitamura, T.: BioAMA: towards an end to end biomedical question answering system. In: Proceedings of the BioNLP 2018 workshop, pp. 109–117. Association for Computational Linguistics, Melbourne, Australia (2018). https://doi.org/10.18653/v1/W18-2312, https://aclanthology.org/W18-2312

  21. Soares, L.B., FitzGerald, N., Ling, J., Kwiatkowski, T.: Matching the blanks: distributional similarity for relation learning. In: ACL 2019–57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pp. 2895–2905 (2020). https://doi.org/10.18653/v1/p19-1279

  22. Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28

    Chapter  Google Scholar 

  23. Wagner, J., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lucas Pavanelli .

Editor information

Editors and Affiliations

Ethics declarations

Ethical Statement

Our study fully complies with ethical standards and did not require any submission to ethical boards, since no data collection with human subjects was carried out. Our dataset was created by our team and contains texts drafted by medical students under the supervision of healthcare professionals, all of whom are research members in our project.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pavanelli, L., Gumiel, Y.B., Ferreira, T., Pagano, A., Laber, E. (2023). Bete: A Brazilian Portuguese Dataset for Named Entity Recognition and Relation Extraction in the Diabetes Healthcare Domain. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14197. Springer, Cham. https://doi.org/10.1007/978-3-031-45392-2_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45392-2_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45391-5

  • Online ISBN: 978-3-031-45392-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics