Skip to main content

Towards a Polish Question Answering Dataset (PoQuAD)

  • Conference paper
  • First Online:
From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries (ICADL 2022)

Abstract

This paper presents the efforts towards creating PoQuAD, a dataset for training automatic question answering models in Polish. It justifies why having native data is vital for training accurate Question Answering systems. PoQuAD broadly follows the methodology of SQuAD 2.0 (including impossible questions), but detracts from it in a few aspects. The first of these concerns reducing annotation density in order to broaden the range of topics included. The second is the inclusion of a generative answer layer to better suit the needs of a morphologically rich language. PoQuAD is a work in progress and so far consists of over 29000 question-answer pairs with contexts extracted from Polish Wikipedia. The planned size of the dataset is over 50 thousand such entries. The paper describes the annotation process and the guidelines which were given to annotators in order to ensure quality of the data. The collected data is subjected to analysis in order to shed some light on its linguistic properties and on the difficulty of the task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/ipipan/spacy-pl-trf.

  2. 2.

    The repository at https://github.com/ipipan/poquad will be continually updated with new data. It is licensed on GNU GPL 3.0 license.

References

  1. Ayoubi, S., Davoodeh, M.Y.: PersianQA: a dataset for Persian question answering. https://github.com/SajjjadAyobi/PersianQA (2021)

  2. Borzymowski, H.: Polish QA model (2020), model trained on HuggingFace. https://huggingface.co/henryk/bert-base-multilingual-cased-finetuned-polish-squad2

  3. Chrabrowa, A., et al.: Evaluation of transfer learning for polish with a text-to-text model. arXiv preprint arXiv:2205.08808 (2022)

  4. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. CoRR (2019). https://arxiv.org/abs/1911.02116

  5. Cui, Y., et al.: A span-extraction dataset for Chinese machine reading comprehension. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5883–5889 Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1600

  6. Dadas, S.: Polish BART. https://github.com/sdadas/polish-nlp-resources#bart

  7. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR (2018). https://arxiv.org/abs/1810.04805

  8. d’Hoffschmidt, M., Belblidia, W., Brendlé, T., Heinrich, Q., Vidal, M.: FQuAD: French question answering dataset (2020). https://arxiv.org/abs/2002.06071

  9. Efimov, P., Chertok, A., Boytsov, L., Braslavski, P.: SberQuAD – Russian reading comprehension dataset: description and analysis. In: Arampatzis, A., et al. (eds.) CLEF 2020. LNCS, vol. 12260, pp. 3–15. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58219-7_1

    Chapter  Google Scholar 

  10. Lim, S., Kim, M., Lee, J.: Korquad1.0: korean QA dataset for machine reading comprehension (2019). https://arxiv.org/abs/1909.07005

  11. Macková, K., Straka, M.: Reading comprehension in Czech via machine translation and cross-lingual transfer (2020). https://arxiv.org/abs/2007.01667

  12. Medved, M., Horak, A.: SQAD: Simple question answering database. In: RASLAN (2014)

    Google Scholar 

  13. Mroczkowski, R., Rybak, P., Wróblewska, A., Gawlik, I.: HerBERT: efficiently pretrained transformer-based language model for polish. In: Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, pp. 1–10. Association for Computational Linguistics, Kiyv, Ukraine (2021). https://www.aclweb.org/anthology/2021.bsnlp-1.1

  14. Möller, T., Risch, J., Pietsch, M.: GermanQuAD and GermanDPR: improving non-english question answering and passage retrieval (2021). https://arxiv.org/abs/2104.12741

  15. Nguyen, K., Nguyen, V., Nguyen, A., Nguyen, N.: A Vietnamese dataset for evaluating machine reading comprehension. In: Proceedings of the 28th International Conference on Computational Linguistics. pp. 2595–2605. International Committee on Computational Linguistics, Barcelona, Spain (2020). https://doi.org/10.18653/v1/2020.coling-main.233

  16. Ogrodniczuk, M., Przybyła, P.: PolEval 2021 task 4: question answering challenge (2021)

    Google Scholar 

  17. Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: Unanswerable questions for squad (2018). https://doi.org/10.48550/ARXIV.1806.03822

  18. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. pp. 2383–2392. Association for Computational Linguistics, Austin, Texas (2016). https://doi.org/10.18653/v1/D16-1264

  19. Sabol, R., Medved’ M., Horák, A.: Czech question answering with extended sqad v3.0 benchmark dataset. In: Horák, A., Rychlý, P., Rambousek, A. (eds.) Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2019, pp. 99–108. Tribun EU, Brno (2019)

    Google Scholar 

  20. Tkachenko, M., Malyuk, M., Holmanyuk, A., Liubimov, N.: Label studio: data labeling software (2020–2022). https://github.com/heartexlabs/label-studio

  21. Xue, L., et al.: mT5: a massively multilingual pre-trained text-to-text transformer. CoRR (2020). https://arxiv.org/abs/2010.11934

  22. Šulganová, T., Marek, M., Horák, A.: Enlargement of the Czech question-answering dataset to SQAD v2.0. In: Proceedings of the Eleventh Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN, pp. 79–84. Brno (2017)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the European Regional Development Fund as a part of the 2014–2020 Smart Growth Operational Programme: (1) Intelligent travel search system based on natural language understanding algorithms, project no. POIR.01.01.01–00-0798/19; (2) CLARIN — Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00-00C002/19.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ryszard Tuora .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tuora, R., Zawadzka-Paluektau, N., Klamra, C., Zwierzchowska, A., Kobyliński, Ł. (2022). Towards a Polish Question Answering Dataset (PoQuAD). In: Tseng, YH., Katsurai, M., Nguyen, H.N. (eds) From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries. ICADL 2022. Lecture Notes in Computer Science, vol 13636. Springer, Cham. https://doi.org/10.1007/978-3-031-21756-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21756-2_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21755-5

  • Online ISBN: 978-3-031-21756-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics