Data-Centric and Model-Centric Approaches for Biomedical Question Answering

Yoon, Wonjin; Yoo, Jaehyo; Seo, Sumin; Sung, Mujeen; Jeong, Minbyul; Kim, Gangwoo; Kang, Jaewoo

doi:10.1007/978-3-031-13643-6_16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13390))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

1072 Accesses
1 Citations

Abstract

Biomedical question answering (BioQA) is the process of automated information extraction from the biomedical literature, and as the number of accessible biomedical papers is increasing rapidly, BioQA is attracting more attention. In order to improve the performance of BioQA systems, we designed strategies for the sub-tasks of BioQA and assessed their effectiveness using the BioASQ dataset. We designed data-centric and model-centric strategies based on the potential for improvement for each sub-task. For example, model design for the factoid-type questions has been explored intensely but the potential of increased label consistency has not been investigated (data-centric approach). On the other hand, for list-type questions, we apply the sequence tagging model as it is more natural for the multi-answer (i.e. multi-label) task (model-centric approach).

Our experimental results suggest two main points: scarce resources like BioQA datasets can be benefited from data-centric approaches with relatively little effort; and a model design reflecting data characteristics can improve the performance of the system.

The scope of this paper is majorly focused on applications of our strategies in the BioASQ 8b dataset and our participating systems in the 9th BioASQ challenges. Our submissions achieve competitive results with top or near top performance in the 9th challenge (Task b - Phase B).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Resources for our data cleaning operations (our annotations) are available at https://github.com/dmis-lab/bioasq9b-dmis.
2.
https://github.com/myint/language-check.
3.
https://github.com/dmis-lab/bioasq8b.
4.
Last checked on 2022 May.
5.
The official result (human evaluation) is on: http://participants-area.bioasq.org/results/9b/phaseB/.

References

Medline PubMed Production Statistics. https://www.nlm.nih.gov/bsd/medline_pubmed_production_stats.html. Accessed 19 June 2022
Alsentzer, E., et al.: Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78. Association for Computational Linguistics, Minneapolis, June 2019. https://doi.org/10.18653/v1/W19-1909, https://www.aclweb.org/anthology/W19-1909
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620 (2019)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, June 2019. https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
Dror, R., Peled-Cohen, L., Shlomov, S., Reichart, R.: Statistical significance testing for natural language processing. Synthesis Lect. Hum. Lang. Technol. 13(2), 1–116 (2020)
Article Google Scholar
Falke, T., Ribeiro, L.F., Utama, P.A., Dagan, I., Gurevych, I.: Ranking generated summaries by correctness: an interesting but challenging application for natural language inference. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2214–2220 (2019)
Google Scholar
Jeong, M., et al.: Transferability of natural language inference to biomedical question answering. arXiv preprint arXiv:2007.00217 (2020)
Jin, Q., Dhingra, B., Cohen, W.W., Lu, X.: Probing biomedical embeddings from language models. arXiv preprint (2019)
Google Scholar
Kim, D., et al.: A neural named entity recognition and multi-type normalization tool for biomedical text mining. IEEE Access 7, 73729–73740 (2019). https://doi.org/10.1109/ACCESS.2019.2920708
Article Google Scholar
Kim, N., et al.: Probing what different NLP tasks teach machines about function word comprehension. In: Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019), pp. 235–249. Association for Computational Linguistics, Minneapolis, June 2019. https://doi.org/10.18653/v1/S19-1026, https://www.aclweb.org/anthology/S19-1026
Krithara, A., Nentidis, A., Paliouras, G., Krallinger, M., Miranda, A.: BioASQ at CLEF2021: large-scale biomedical semantic indexing and question answering. In: Hiemstra, D., Moens, M.-F., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds.) ECIR 2021. LNCS, vol. 12657, pp. 624–630. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72240-1_73
Chapter Google Scholar
Kryściński, W., McCann, B., Xiong, C., Socher, R.: Evaluating the factual consistency of abstractive text summarization. arXiv preprint arXiv:1910.12840 (2019)
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)
Google Scholar
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension (2019)
Google Scholar
Mollá, D., Khanna, U., Galat, D., Nguyen, V., Rybinski, M.: Query-focused extractive summarisation for finding ideal answers to biomedical and COVID-19 questions. arXiv preprint arXiv:2108.12189 (2021)
Ng, A.Y.: A Chat with Andrew on MLOps: from model-centric to data-centric AI (2021). https://www.youtube.com/06-AZXmwHjo
Ozyurt, I.B.: End-to-end biomedical question answering via bio-answerfinder and discriminative language representation models. In: CLEF (Working Notes) (2021)
Google Scholar
Peng, Y., Yan, S., Lu, Z.: Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. arXiv preprint (2019)
Google Scholar
Peters, M.E., et al.: Deep contesxtualized word representations (2018)
Google Scholar
Phang, J., Févry, T., Bowman, S.R.: Sentence encoders on STILTs: supplementary training on intermediate labeled-data tasks (2019)
Google Scholar
Tsatsaronis, G., et al.: An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform. 16(1), 1–28 (2015)
Article Google Scholar
Wiese, G., Weissenborn, D., Neves, M.: Neural domain adaptation for biomedical question answering. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 281–289. Association for Computational Linguistics, Vancouver, August 2017. https://doi.org/10.18653/v1/K17-1029, https://www.aclweb.org/anthology/K17-1029
Williams, A., Nangia, N., Bowman, S.: A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. Association for Computational Linguistics, New Orleans, June 2018. https://doi.org/10.18653/v1/N18-1101, https://www.aclweb.org/anthology/N18-1101
Yoon, W., Jackson, R., Lagerberg, A., Kang, J.: Sequence tagging for biomedical extractive question answering. Bioinformatics (2022). https://doi.org/10.1093/bioinformatics/btac397
Yoon, W., Lee, J., Kim, D., Jeong, M., Kang, J.: Pre-trained language model for biomedical question answering. In: Cellier, P., Driessens, K. (eds.) ECML PKDD 2019. CCIS, vol. 1168, pp. 727–740. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43887-6_64
Chapter Google Scholar
Yoon, W., et al.: KU-DMIS at BioASQ 9: data-centric and model-centric approaches for biomedical question answering. In: CLEF (Working Notes), pp. 351–359 (2021)
Google Scholar
Zhang, Y., Han, J.C., Tsai, R.T.H.: NCU-IISR/AS-GIS: results of various pre-trained biomedical language models and linear regression model in BioASQ task 9b phase B. In: CEUR Workshop Proceedings (2021)
Google Scholar
Zhu, C., et al.: Enhancing factual consistency of abstractive summarization. arXiv preprint arXiv:2003.08612 (2020)

Download references

Acknowledgements

We express gratitude towards Dr. Jihye Kim and Dr. Sungjoon Park from Korea University for their invaluable insight into our systems’ output. This research is supported by National Research Foundation of Korea (NRF-2020R1A2C3010638) and a grant of the Korea Health Technology R &D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HR20C0021).

Author information

Authors and Affiliations

Korea University, Seoul, 02841, South Korea
Wonjin Yoon, Jaehyo Yoo, Sumin Seo, Mujeen Sung, Minbyul Jeong, Gangwoo Kim & Jaewoo Kang
AIGEN Sciences Inc., Seoul, 04778, South Korea
Jaewoo Kang

Authors

Wonjin Yoon
View author publications
You can also search for this author in PubMed Google Scholar
Jaehyo Yoo
View author publications
You can also search for this author in PubMed Google Scholar
Sumin Seo
View author publications
You can also search for this author in PubMed Google Scholar
Mujeen Sung
View author publications
You can also search for this author in PubMed Google Scholar
Minbyul Jeong
View author publications
You can also search for this author in PubMed Google Scholar
Gangwoo Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jaewoo Kang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jaewoo Kang .

Editor information

Editors and Affiliations

University of Bologna, Forlì, Italy
Alberto Barrón-Cedeño
University of Padua, Padova, Italy
Giovanni Da San Martino
University of Bologna, Bologna, Italy
Mirko Degli Esposti
Instituto di Scienza e Tecnologie dell' Informazione “Alessandro Faedo”, Pisa, Italy
Fabrizio Sebastiani
University of Glasgow, Glasgow, UK
Craig Macdonald
University Milano-Bicocca, Milan, Italy
Gabriella Pasi
TU Wien, Vienna, Austria
Allan Hanbury
Leipzig University, Leipzig, Germany
Martin Potthast
University of Padua, Padova, Italy
Guglielmo Faggioli
University of Padua, Padova, Italy
Nicola Ferro

Ethics declarations

Author Note

This work is submitted to the 2022 CLEF - Best of 2021 Labs track. Our work originates from our challenge participation in the 9th BioASQ (2021 CLEF Labs), presented under the title KU-DMIS at BioASQ 9: Data-centric and model-centric approaches for biomedical question answering (Yoon et al. 2021 [26]).

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yoon, W. et al. (2022). Data-Centric and Model-Centric Approaches for Biomedical Question Answering. In: Barrón-Cedeño, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2022. Lecture Notes in Computer Science, vol 13390. Springer, Cham. https://doi.org/10.1007/978-3-031-13643-6_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-13643-6_16
Published: 25 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13642-9
Online ISBN: 978-3-031-13643-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Data-Centric and Model-Centric Approaches for Biomedical Question Answering