Dutch SQuAD and Ensemble Learning for Question Answering from Labour Agreements

Rouws, Niels J.; Vakulenko, Svitlana; Katrenko, Sophia

doi:10.1007/978-3-030-93842-0_9

Niels J. Rouws^10,11,
Svitlana Vakulenko¹⁰ &
Sophia Katrenko¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1530))

Included in the following conference series:

Benelux Conference on Artificial Intelligence

Abstract

The Dutch Ministry of Social Affairs and Employment has to regularly explore the content of labour agreements. Studies on topics such as diversity and work flexibility are conducted on the regular basis by means of specialised questionnaires. We show that a relatively small domain-specific dataset allows to train the state-of-the-art extractive question answering (QA) system to answer these questions automatically. This paper introduces the new dataset, Dutch SQuAD, obtained by machine translating the original SQuAD v2.0 dataset from English to Dutch (made publicly available on https://gitlab.com/niels.rouws/dutch-squad-v2.0). Our results demonstrate that it allows us to improve domain adaptation for QA models by pre-training these models first on this general domain machine-translated dataset. In our experiments, we compare fine-tuning the pre-trained Dutch versus multilingual language models: BERTje, RobBERT, and mBERT. Our results demonstrate that domain adaptation of the QA models that were first trained on a general-domain machine-translated QA dataset to the Dutch labour agreement dataset outperforms the models that were directly fine-tuned on the in-domain documents. We also compare several ensemble learning techniques and show how they allow to achieve additional performance gain on this task. A new approach of string-based voting is introduced and we showed that it performs on par with a previously proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DaNetQA: A Yes/No Question Answering Dataset for the Russian Language

Results and Lessons of the Question Answering Track at CLEF

Towards a Polish Question Answering Dataset (PoQuAD)

Notes

References

Abadani, N., Mozafari, J., Fatemi, A., Nematbakhsh, M.A., Kazemi, A.: ParSQuAD: machine translated squad dataset for Persian question answering. In: 2021 7th International Conference on Web Research (ICWR), pp. 163–168. IEEE (2021)
Google Scholar
Aniol, A., Pietron, M., Duda, J.: Ensemble approach for natural language question answering problem. In: 2019 Seventh International Symposium on Computing and Networking Workshops (CANDARW), pp. 180–183. IEEE (2019)
Google Scholar
Borzymowski, H.: Henryk/BERT-base-multilingual-cased-finetuned-dutch-squad2 $\cdot $ Hugging Face (2020). https://huggingface.co/henryk/bert-base-multilingual-cased-finetuned-dutch-squad2
Carrino, C.P., Costa-jussà, M.R., Fonollosa, J.A.R.: Automatic Spanish translation of the squad dataset for multilingual question answering. arXiv preprint arXiv:1912.05200 (2019)
de Vries, W., van Cranenburgh, A., Bisazza, A., Caselli, T., van Noord, G., Nissim, M.: BERTje: a Dutch BERT model. CoRR abs/1912.09582 (2019). http://arxiv.org/abs/1912.09582
Delobelle, P., Winters, T., Berendt, B.: RobBERT: a Dutch RoBERTa-based language model (2020)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019)
Google Scholar
Hazen, T.J., Dhuliawala, S., Boies, D.: Towards domain adaptation from limited data for question answering using deep neural networks (2019)
Google Scholar
Isotalo, L.: Generative question answering in a low-resource setting
Google Scholar
Jeong, M., et al.: Transferability of natural language inference to biomedical question answering. arXiv preprint arXiv:2007.00217 (2020)
Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)
Google Scholar
Lee, K., Yoon, K., Park, S., Hwang, S.-W.: Semi-supervised training data generation for multilingual question answering. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019)
Google Scholar
Lui, M., Baldwin, T.: langid.py: an off-the-shelf language identification tool. In: Proceedings of the ACL 2012 System Demonstrations, Jeju Island, Korea. Association for Computational Linguistics, pp. 25–30, July 2012. https://www.aclweb.org/anthology/P12-3005
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013)
Google Scholar
Möller, T., Reina, A., Jayakumar, R., Pietsch, M.: COVID-QA: a question answering dataset for COVID-19. In: Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020 (2020)
Google Scholar
Pires, T., Schlinger, E., Garrette, D.: How multilingual is multilingual BERT? (2019)
Google Scholar
Poerner, N., Waltinger, U., Schütze, H.: Inexpensive domain adaptation of pretrained language models: case studies on biomedical NER and COVID-19 QA. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, pp. 1482–1490, November 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.134, https://www.aclweb.org/anthology/2020.findings-emnlp.134
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)
Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for SQuAD (2018)
Google Scholar
Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in bertology: what we know about how BERT works (2020)
Google Scholar
Startup in Residence Intergov. Geautomatiseerde tekst-analyse cao’s | Startup in Residence Intergov (2020). https://intergov.startupinresidence.com/nl/szw/geautomatiseerde-tekst-analyse-cao/brief
Tsatsaronis, G., et al.: An overview of the bioASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform. 16(1), 1–28 (2015)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need (2017)
Google Scholar
Williams, A., Nangia, N., Bowman, S.: A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana. Association for Computational Linguistics, pp. 1112–1122, June 2018. https://doi.org/10.18653/v1/N18-1101, https://www.aclweb.org/anthology/N18-1101
Xu, Y., Qiu, X., Zhou, L., Huang, X.: Improving BERT fine-tuning via self-ensemble and self-distillation. arXiv preprint arXiv:2002.10345 (2020)

Download references

Author information

Authors and Affiliations

University of Amsterdam, 1098 XG, Amsterdam, The Netherlands
Niels J. Rouws & Svitlana Vakulenko
DEUS B.V., 1017 DL, Amsterdam, The Netherlands
Niels J. Rouws & Sophia Katrenko

Authors

Niels J. Rouws
View author publications
You can also search for this author in PubMed Google Scholar
Svitlana Vakulenko
View author publications
You can also search for this author in PubMed Google Scholar
Sophia Katrenko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Niels J. Rouws .

Editor information

Editors and Affiliations

University of Luxembourg, Esch-sur-Alzette, Luxembourg
Luis A. Leiva
Luxembourg Institute of Science and Technology, Esch-sur-Alzette, Luxembourg
Cédric Pruski
University of Luxembourg, Esch-sur-Alzette, Luxembourg
Réka Markovich
University of Luxembourg, Esch-sur-Alzette, Luxembourg
Amro Najjar
University of Luxembourg, Esch-sur-Alzette, Luxembourg
Christoph Schommer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rouws, N.J., Vakulenko, S., Katrenko, S. (2022). Dutch SQuAD and Ensemble Learning for Question Answering from Labour Agreements. In: Leiva, L.A., Pruski, C., Markovich, R., Najjar, A., Schommer, C. (eds) Artificial Intelligence and Machine Learning. BNAIC/Benelearn 2021. Communications in Computer and Information Science, vol 1530. Springer, Cham. https://doi.org/10.1007/978-3-030-93842-0_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-93842-0_9
Published: 11 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93841-3
Online ISBN: 978-3-030-93842-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Dutch SQuAD and Ensemble Learning for Question Answering from Labour Agreements