Towards a Polish Question Answering Dataset (PoQuAD)

Tuora, Ryszard; Zawadzka-Paluektau, Natalia; Klamra, Cezary; Zwierzchowska, Aleksandra; Kobyliński, Łukasz

doi:10.1007/978-3-031-21756-2_16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13636))

Included in the following conference series:

International Conference on Asian Digital Libraries

691 Accesses

Abstract

This paper presents the efforts towards creating PoQuAD, a dataset for training automatic question answering models in Polish. It justifies why having native data is vital for training accurate Question Answering systems. PoQuAD broadly follows the methodology of SQuAD 2.0 (including impossible questions), but detracts from it in a few aspects. The first of these concerns reducing annotation density in order to broaden the range of topics included. The second is the inclusion of a generative answer layer to better suit the needs of a morphologically rich language. PoQuAD is a work in progress and so far consists of over 29000 question-answer pairs with contexts extracted from Polish Wikipedia. The planned size of the dataset is over 50 thousand such entries. The paper describes the annotation process and the guidelines which were given to annotators in order to ensure quality of the data. The collected data is subjected to analysis in order to shed some light on its linguistic properties and on the difficulty of the task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/ipipan/spacy-pl-trf.
2.
The repository at https://github.com/ipipan/poquad will be continually updated with new data. It is licensed on GNU GPL 3.0 license.

References

Ayoubi, S., Davoodeh, M.Y.: PersianQA: a dataset for Persian question answering. https://github.com/SajjjadAyobi/PersianQA (2021)
Borzymowski, H.: Polish QA model (2020), model trained on HuggingFace. https://huggingface.co/henryk/bert-base-multilingual-cased-finetuned-polish-squad2
Chrabrowa, A., et al.: Evaluation of transfer learning for polish with a text-to-text model. arXiv preprint arXiv:2205.08808 (2022)
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. CoRR (2019). https://arxiv.org/abs/1911.02116
Cui, Y., et al.: A span-extraction dataset for Chinese machine reading comprehension. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5883–5889 Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1600
Dadas, S.: Polish BART. https://github.com/sdadas/polish-nlp-resources#bart
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR (2018). https://arxiv.org/abs/1810.04805
d’Hoffschmidt, M., Belblidia, W., Brendlé, T., Heinrich, Q., Vidal, M.: FQuAD: French question answering dataset (2020). https://arxiv.org/abs/2002.06071
Efimov, P., Chertok, A., Boytsov, L., Braslavski, P.: SberQuAD – Russian reading comprehension dataset: description and analysis. In: Arampatzis, A., et al. (eds.) CLEF 2020. LNCS, vol. 12260, pp. 3–15. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58219-7_1
Chapter Google Scholar
Lim, S., Kim, M., Lee, J.: Korquad1.0: korean QA dataset for machine reading comprehension (2019). https://arxiv.org/abs/1909.07005
Macková, K., Straka, M.: Reading comprehension in Czech via machine translation and cross-lingual transfer (2020). https://arxiv.org/abs/2007.01667
Medved, M., Horak, A.: SQAD: Simple question answering database. In: RASLAN (2014)
Google Scholar
Mroczkowski, R., Rybak, P., Wróblewska, A., Gawlik, I.: HerBERT: efficiently pretrained transformer-based language model for polish. In: Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, pp. 1–10. Association for Computational Linguistics, Kiyv, Ukraine (2021). https://www.aclweb.org/anthology/2021.bsnlp-1.1
Möller, T., Risch, J., Pietsch, M.: GermanQuAD and GermanDPR: improving non-english question answering and passage retrieval (2021). https://arxiv.org/abs/2104.12741
Nguyen, K., Nguyen, V., Nguyen, A., Nguyen, N.: A Vietnamese dataset for evaluating machine reading comprehension. In: Proceedings of the 28th International Conference on Computational Linguistics. pp. 2595–2605. International Committee on Computational Linguistics, Barcelona, Spain (2020). https://doi.org/10.18653/v1/2020.coling-main.233
Ogrodniczuk, M., Przybyła, P.: PolEval 2021 task 4: question answering challenge (2021)
Google Scholar
Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: Unanswerable questions for squad (2018). https://doi.org/10.48550/ARXIV.1806.03822
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. pp. 2383–2392. Association for Computational Linguistics, Austin, Texas (2016). https://doi.org/10.18653/v1/D16-1264
Sabol, R., Medved’ M., Horák, A.: Czech question answering with extended sqad v3.0 benchmark dataset. In: Horák, A., Rychlý, P., Rambousek, A. (eds.) Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2019, pp. 99–108. Tribun EU, Brno (2019)
Google Scholar
Tkachenko, M., Malyuk, M., Holmanyuk, A., Liubimov, N.: Label studio: data labeling software (2020–2022). https://github.com/heartexlabs/label-studio
Xue, L., et al.: mT5: a massively multilingual pre-trained text-to-text transformer. CoRR (2020). https://arxiv.org/abs/2010.11934
Šulganová, T., Marek, M., Horák, A.: Enlargement of the Czech question-answering dataset to SQAD v2.0. In: Proceedings of the Eleventh Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN, pp. 79–84. Brno (2017)
Google Scholar

Download references

Acknowledgements

This work was supported by the European Regional Development Fund as a part of the 2014–2020 Smart Growth Operational Programme: (1) Intelligent travel search system based on natural language understanding algorithms, project no. POIR.01.01.01–00-0798/19; (2) CLARIN — Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00-00C002/19.

Author information

Authors and Affiliations

Institute of Computer Science, Polish Academy of Sciences, Jana Kazimierza 5, 01–248, Warszawa, Poland
Ryszard Tuora, Natalia Zawadzka-Paluektau, Cezary Klamra, Aleksandra Zwierzchowska & Łukasz Kobyliński

Authors

Ryszard Tuora
View author publications
You can also search for this author in PubMed Google Scholar
Natalia Zawadzka-Paluektau
View author publications
You can also search for this author in PubMed Google Scholar
Cezary Klamra
View author publications
You can also search for this author in PubMed Google Scholar
Aleksandra Zwierzchowska
View author publications
You can also search for this author in PubMed Google Scholar
Łukasz Kobyliński
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ryszard Tuora .

Editor information

Editors and Affiliations

National Taiwan Normal University, Taipei, Taiwan
Yuen-Hsien Tseng
Doshisha University, Kyoto, Japan
Marie Katsurai
VNU University of Engineering and Technology, Hanoi, Vietnam
Hoa N. Nguyen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tuora, R., Zawadzka-Paluektau, N., Klamra, C., Zwierzchowska, A., Kobyliński, Ł. (2022). Towards a Polish Question Answering Dataset (PoQuAD). In: Tseng, YH., Katsurai, M., Nguyen, H.N. (eds) From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries. ICADL 2022. Lecture Notes in Computer Science, vol 13636. Springer, Cham. https://doi.org/10.1007/978-3-031-21756-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-21756-2_16
Published: 07 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21755-5
Online ISBN: 978-3-031-21756-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards a Polish Question Answering Dataset (PoQuAD)