Skip to main content

Dutch SQuAD and Ensemble Learning for Question Answering from Labour Agreements

  • Conference paper
  • First Online:
Artificial Intelligence and Machine Learning (BNAIC/Benelearn 2021)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1530))

Included in the following conference series:

Abstract

The Dutch Ministry of Social Affairs and Employment has to regularly explore the content of labour agreements. Studies on topics such as diversity and work flexibility are conducted on the regular basis by means of specialised questionnaires. We show that a relatively small domain-specific dataset allows to train the state-of-the-art extractive question answering (QA) system to answer these questions automatically. This paper introduces the new dataset, Dutch SQuAD, obtained by machine translating the original SQuAD v2.0 dataset from English to Dutch (made publicly available on https://gitlab.com/niels.rouws/dutch-squad-v2.0). Our results demonstrate that it allows us to improve domain adaptation for QA models by pre-training these models first on this general domain machine-translated dataset. In our experiments, we compare fine-tuning the pre-trained Dutch versus multilingual language models: BERTje, RobBERT, and mBERT. Our results demonstrate that domain adaptation of the QA models that were first trained on a general-domain machine-translated QA dataset to the Dutch labour agreement dataset outperforms the models that were directly fine-tuned on the in-domain documents. We also compare several ensemble learning techniques and show how they allow to achieve additional performance gain on this task. A new approach of string-based voting is introduced and we showed that it performs on par with a previously proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://gitlab.com/niels.rouws/dutch-squad-v2.0.

  2. 2.

    https://github.com/borhenryk/train_custom_qa_model.

  3. 3.

    https://github.com/saffsd/langid.py.

  4. 4.

    https://huggingface.co/GroNLP/bert-base-dutch-cased.

  5. 5.

    https://huggingface.co/pdelobelle/robbert-v2-dutch-base.

  6. 6.

    https://huggingface.co/bert-base-multilingual-cased.

  7. 7.

    https://www.geeksforgeeks.org/longest-common-substring-dp-29/.

References

  1. Abadani, N., Mozafari, J., Fatemi, A., Nematbakhsh, M.A., Kazemi, A.: ParSQuAD: machine translated squad dataset for Persian question answering. In: 2021 7th International Conference on Web Research (ICWR), pp. 163–168. IEEE (2021)

    Google Scholar 

  2. Aniol, A., Pietron, M., Duda, J.: Ensemble approach for natural language question answering problem. In: 2019 Seventh International Symposium on Computing and Networking Workshops (CANDARW), pp. 180–183. IEEE (2019)

    Google Scholar 

  3. Borzymowski, H.: Henryk/BERT-base-multilingual-cased-finetuned-dutch-squad2 \(\cdot \) Hugging Face (2020). https://huggingface.co/henryk/bert-base-multilingual-cased-finetuned-dutch-squad2

  4. Carrino, C.P., Costa-jussà, M.R., Fonollosa, J.A.R.: Automatic Spanish translation of the squad dataset for multilingual question answering. arXiv preprint arXiv:1912.05200 (2019)

  5. de Vries, W., van Cranenburgh, A., Bisazza, A., Caselli, T., van Noord, G., Nissim, M.: BERTje: a Dutch BERT model. CoRR abs/1912.09582 (2019). http://arxiv.org/abs/1912.09582

  6. Delobelle, P., Winters, T., Berendt, B.: RobBERT: a Dutch RoBERTa-based language model (2020)

    Google Scholar 

  7. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019)

    Google Scholar 

  8. Hazen, T.J., Dhuliawala, S., Boies, D.: Towards domain adaptation from limited data for question answering using deep neural networks (2019)

    Google Scholar 

  9. Isotalo, L.: Generative question answering in a low-resource setting

    Google Scholar 

  10. Jeong, M., et al.: Transferability of natural language inference to biomedical question answering. arXiv preprint arXiv:2007.00217 (2020)

  11. Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)

    Google Scholar 

  12. Lee, K., Yoon, K., Park, S., Hwang, S.-W.: Semi-supervised training data generation for multilingual question answering. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

    Google Scholar 

  13. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019)

    Google Scholar 

  14. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019)

    Google Scholar 

  15. Lui, M., Baldwin, T.: langid.py: an off-the-shelf language identification tool. In: Proceedings of the ACL 2012 System Demonstrations, Jeju Island, Korea. Association for Computational Linguistics, pp. 25–30, July 2012. https://www.aclweb.org/anthology/P12-3005

  16. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013)

    Google Scholar 

  17. Möller, T., Reina, A., Jayakumar, R., Pietsch, M.: COVID-QA: a question answering dataset for COVID-19. In: Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020 (2020)

    Google Scholar 

  18. Pires, T., Schlinger, E., Garrette, D.: How multilingual is multilingual BERT? (2019)

    Google Scholar 

  19. Poerner, N., Waltinger, U., Schütze, H.: Inexpensive domain adaptation of pretrained language models: case studies on biomedical NER and COVID-19 QA. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, pp. 1482–1490, November 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.134, https://www.aclweb.org/anthology/2020.findings-emnlp.134

  20. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)

  21. Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for SQuAD (2018)

    Google Scholar 

  22. Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in bertology: what we know about how BERT works (2020)

    Google Scholar 

  23. Startup in Residence Intergov. Geautomatiseerde tekst-analyse cao’s | Startup in Residence Intergov (2020). https://intergov.startupinresidence.com/nl/szw/geautomatiseerde-tekst-analyse-cao/brief

  24. Tsatsaronis, G., et al.: An overview of the bioASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform. 16(1), 1–28 (2015)

    Article  Google Scholar 

  25. Vaswani, A., et al.: Attention is all you need (2017)

    Google Scholar 

  26. Williams, A., Nangia, N., Bowman, S.: A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana. Association for Computational Linguistics, pp. 1112–1122, June 2018. https://doi.org/10.18653/v1/N18-1101, https://www.aclweb.org/anthology/N18-1101

  27. Xu, Y., Qiu, X., Zhou, L., Huang, X.: Improving BERT fine-tuning via self-ensemble and self-distillation. arXiv preprint arXiv:2002.10345 (2020)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Niels J. Rouws .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rouws, N.J., Vakulenko, S., Katrenko, S. (2022). Dutch SQuAD and Ensemble Learning for Question Answering from Labour Agreements. In: Leiva, L.A., Pruski, C., Markovich, R., Najjar, A., Schommer, C. (eds) Artificial Intelligence and Machine Learning. BNAIC/Benelearn 2021. Communications in Computer and Information Science, vol 1530. Springer, Cham. https://doi.org/10.1007/978-3-030-93842-0_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-93842-0_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-93841-3

  • Online ISBN: 978-3-030-93842-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics