Skip to main content

Data-Centric and Model-Centric Approaches for Biomedical Question Answering

  • Conference paper
  • First Online:
Experimental IR Meets Multilinguality, Multimodality, and Interaction (CLEF 2022)

Abstract

Biomedical question answering (BioQA) is the process of automated information extraction from the biomedical literature, and as the number of accessible biomedical papers is increasing rapidly, BioQA is attracting more attention. In order to improve the performance of BioQA systems, we designed strategies for the sub-tasks of BioQA and assessed their effectiveness using the BioASQ dataset. We designed data-centric and model-centric strategies based on the potential for improvement for each sub-task. For example, model design for the factoid-type questions has been explored intensely but the potential of increased label consistency has not been investigated (data-centric approach). On the other hand, for list-type questions, we apply the sequence tagging model as it is more natural for the multi-answer (i.e. multi-label) task (model-centric approach).

Our experimental results suggest two main points: scarce resources like BioQA datasets can be benefited from data-centric approaches with relatively little effort; and a model design reflecting data characteristics can improve the performance of the system.

The scope of this paper is majorly focused on applications of our strategies in the BioASQ 8b dataset and our participating systems in the 9th BioASQ challenges. Our submissions achieve competitive results with top or near top performance in the 9th challenge (Task b - Phase B).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Resources for our data cleaning operations (our annotations) are available at https://github.com/dmis-lab/bioasq9b-dmis.

  2. 2.

    https://github.com/myint/language-check.

  3. 3.

    https://github.com/dmis-lab/bioasq8b.

  4. 4.

    Last checked on 2022 May.

  5. 5.

    The official result (human evaluation) is on: http://participants-area.bioasq.org/results/9b/phaseB/.

References

  1. Medline PubMed Production Statistics. https://www.nlm.nih.gov/bsd/medline_pubmed_production_stats.html. Accessed 19 June 2022

  2. Alsentzer, E., et al.: Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78. Association for Computational Linguistics, Minneapolis, June 2019. https://doi.org/10.18653/v1/W19-1909, https://www.aclweb.org/anthology/W19-1909

  3. Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620 (2019)

    Google Scholar 

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, June 2019. https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423

  5. Dror, R., Peled-Cohen, L., Shlomov, S., Reichart, R.: Statistical significance testing for natural language processing. Synthesis Lect. Hum. Lang. Technol. 13(2), 1–116 (2020)

    Article  Google Scholar 

  6. Falke, T., Ribeiro, L.F., Utama, P.A., Dagan, I., Gurevych, I.: Ranking generated summaries by correctness: an interesting but challenging application for natural language inference. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2214–2220 (2019)

    Google Scholar 

  7. Jeong, M., et al.: Transferability of natural language inference to biomedical question answering. arXiv preprint arXiv:2007.00217 (2020)

  8. Jin, Q., Dhingra, B., Cohen, W.W., Lu, X.: Probing biomedical embeddings from language models. arXiv preprint (2019)

    Google Scholar 

  9. Kim, D., et al.: A neural named entity recognition and multi-type normalization tool for biomedical text mining. IEEE Access 7, 73729–73740 (2019). https://doi.org/10.1109/ACCESS.2019.2920708

    Article  Google Scholar 

  10. Kim, N., et al.: Probing what different NLP tasks teach machines about function word comprehension. In: Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019), pp. 235–249. Association for Computational Linguistics, Minneapolis, June 2019. https://doi.org/10.18653/v1/S19-1026, https://www.aclweb.org/anthology/S19-1026

  11. Krithara, A., Nentidis, A., Paliouras, G., Krallinger, M., Miranda, A.: BioASQ at CLEF2021: large-scale biomedical semantic indexing and question answering. In: Hiemstra, D., Moens, M.-F., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds.) ECIR 2021. LNCS, vol. 12657, pp. 624–630. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72240-1_73

    Chapter  Google Scholar 

  12. Kryściński, W., McCann, B., Xiong, C., Socher, R.: Evaluating the factual consistency of abstractive text summarization. arXiv preprint arXiv:1910.12840 (2019)

  13. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)

    Google Scholar 

  14. Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension (2019)

    Google Scholar 

  15. Mollá, D., Khanna, U., Galat, D., Nguyen, V., Rybinski, M.: Query-focused extractive summarisation for finding ideal answers to biomedical and COVID-19 questions. arXiv preprint arXiv:2108.12189 (2021)

  16. Ng, A.Y.: A Chat with Andrew on MLOps: from model-centric to data-centric AI (2021). https://www.youtube.com/06-AZXmwHjo

  17. Ozyurt, I.B.: End-to-end biomedical question answering via bio-answerfinder and discriminative language representation models. In: CLEF (Working Notes) (2021)

    Google Scholar 

  18. Peng, Y., Yan, S., Lu, Z.: Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. arXiv preprint (2019)

    Google Scholar 

  19. Peters, M.E., et al.: Deep contesxtualized word representations (2018)

    Google Scholar 

  20. Phang, J., Févry, T., Bowman, S.R.: Sentence encoders on STILTs: supplementary training on intermediate labeled-data tasks (2019)

    Google Scholar 

  21. Tsatsaronis, G., et al.: An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform. 16(1), 1–28 (2015)

    Article  Google Scholar 

  22. Wiese, G., Weissenborn, D., Neves, M.: Neural domain adaptation for biomedical question answering. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 281–289. Association for Computational Linguistics, Vancouver, August 2017. https://doi.org/10.18653/v1/K17-1029, https://www.aclweb.org/anthology/K17-1029

  23. Williams, A., Nangia, N., Bowman, S.: A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. Association for Computational Linguistics, New Orleans, June 2018. https://doi.org/10.18653/v1/N18-1101, https://www.aclweb.org/anthology/N18-1101

  24. Yoon, W., Jackson, R., Lagerberg, A., Kang, J.: Sequence tagging for biomedical extractive question answering. Bioinformatics (2022). https://doi.org/10.1093/bioinformatics/btac397

  25. Yoon, W., Lee, J., Kim, D., Jeong, M., Kang, J.: Pre-trained language model for biomedical question answering. In: Cellier, P., Driessens, K. (eds.) ECML PKDD 2019. CCIS, vol. 1168, pp. 727–740. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43887-6_64

    Chapter  Google Scholar 

  26. Yoon, W., et al.: KU-DMIS at BioASQ 9: data-centric and model-centric approaches for biomedical question answering. In: CLEF (Working Notes), pp. 351–359 (2021)

    Google Scholar 

  27. Zhang, Y., Han, J.C., Tsai, R.T.H.: NCU-IISR/AS-GIS: results of various pre-trained biomedical language models and linear regression model in BioASQ task 9b phase B. In: CEUR Workshop Proceedings (2021)

    Google Scholar 

  28. Zhu, C., et al.: Enhancing factual consistency of abstractive summarization. arXiv preprint arXiv:2003.08612 (2020)

Download references

Acknowledgements

We express gratitude towards Dr. Jihye Kim and Dr. Sungjoon Park from Korea University for their invaluable insight into our systems’ output. This research is supported by National Research Foundation of Korea (NRF-2020R1A2C3010638) and a grant of the Korea Health Technology R &D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HR20C0021).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jaewoo Kang .

Editor information

Editors and Affiliations

Ethics declarations

Author Note

This work is submitted to the 2022 CLEF - Best of 2021 Labs track. Our work originates from our challenge participation in the 9th BioASQ (2021 CLEF Labs), presented under the title KU-DMIS at BioASQ 9: Data-centric and model-centric approaches for biomedical question answering (Yoon et al. 2021 [26]).

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yoon, W. et al. (2022). Data-Centric and Model-Centric Approaches for Biomedical Question Answering. In: Barrón-Cedeño, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2022. Lecture Notes in Computer Science, vol 13390. Springer, Cham. https://doi.org/10.1007/978-3-031-13643-6_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-13643-6_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-13642-9

  • Online ISBN: 978-3-031-13643-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics