Skip to main content

Advertisement

Log in

Constructing Arabic Reading Comprehension Datasets: Arabic WikiReading and KaifLematha

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Neural machine reading comprehension models have gained immense popularity over the last decade given the availability of large-scale English datasets. A key limiting factor for neural model development and investigations of the Arabic language is the limitation of the currently available datasets. Current available datasets are either too small to train deep neural models or created by the automatic translation of the available English datasets, where the exact answer may not be found in the corresponding text. In this paper, we propose two high quality and large-scale Arabic reading comprehension datasets: Arabic WikiReading and KaifLematha with around +100 K instances. We followed two different methodologies to construct our datasets. First, we employed crowdworkers to collect non-factoid questions from paragraphs on Wikipedia. Then, we constructed Arabic WikiReading following a distant supervision strategy, utilizing the Wikidata knowledge base as a ground truth. We carried out both quantitative and qualitative analyses to investigate the level of reasoning required to answer the questions in the proposed datasets. We evaluated competitive pre-trained language model that attained F1 scores of 81.77 and 68.61 for the Arabic WikiReading and KaifLematha datasets, respectively, but struggled to extract a precise answer for the KaifLematha dataset. Human performance reported an F1 score of 82.54 for the KaifLematha development set, which leaves ample room for improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

Datasets are available under https://github.com/esulaiman/Arabic-WikiReading-and-KaifLematha-datasets.

Code availability

Not applicable.

Notes

  1. https://github.com/motazsaad/arwikiExtracts

  2. https://github.com/attardi/wikiextractor

  3. https://github.com/xwhan/wikidata-filter

  4. https://www.wikidata.org/wiki/Special:ListDatatypes

  5. This explanation concerns the evaluation of human performance. However, for the model evaluation, all the three collected answers are considered as ground truth.

  6. https://github.com/google-research/bert.

  7. https://github.com/aub-mind/arabert.

References

  • Abdelali, A., Darwish, K., Durrani, N., & Mubarak, H. (2016). Farasa: A fast and furious segmenter for Arabic. In Proceedings of the 2016 conference of the north American chapter of the association for computational linguistics: Demonstrations (pp. 11–16). Association for Computational Linguistics. https://doi.org/10.18653/v1/N16-3003

  • Abouenour, L., Bouzoubaa, K., Rosso, P., & School, M. (2012). IDRAAQ: New arabic question answering system based on query expansion and passage retrieval (p. 11). In Presented at the CLEF 2012 conference and labs of the evaluation forum.

  • Abu Farha, I., & Magdy, W. (2019). Mazajak: An online Arabic sentiment analyser. In Proceedings of the fourth Arabic natural language processing workshop (pp. 192–198). Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4621

  • Antoun, W., Baly, F., & Hajj, H. (2020). AraBERT: Transformer-based model for Arabic language understanding. In Proceedings of the 4th workshop on open-source arabic corpora and processing tools, with a shared task on offensive language detection (pp. 9–15). Marseille, France: European Language Resource Association.

  • Azmi, A. M., & Alshenaifi, N. A. (2017). Lemaza: An Arabic why-question answering system. Natural Language Engineering, 23(6), 877–903. https://doi.org/10.1017/S1351324917000304

    Article  Google Scholar 

  • Belinkov, Y., Glass, J. (2019). Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7, 49–72

  • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146. https://doi.org/10.1162/tacl_a_00051

    Article  Google Scholar 

  • Chen, D., Bolton, J., & Manning, C. D. (2016, August). A Thorough examination of the CNN/daily mail reading comprehension task. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 2358–2367).‏

  • Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W., Choi, Y., Liang, P., & Zettlemoyer, L. (2018). QuAC: Question answering in context. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 2174–2184). Presented at the EMNLP 2018, Brussels, Belgium:Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1241

  • Clark, J. H., Palomaki, J., Nikolaev, V., Choi, E., Garrette, D., Collins, M., & Kwiatkowski, T. (2020). TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8, 454–470.

    Article  Google Scholar 

  • Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). https://doi.org/10.18653/v1/N19-1423.

  • Dunn, M., Sagun, L., Higgins, M., Guney, V. U., Cirik, V., & Cho, K. (2017). SearchQA: A new Q&A dataset augmented with context from a search engine. Retrieved September 3, 2019 from http://arxiv.org/abs/1704.05179.

  • ElJundi, O., Antoun, W., El Droubi, N., Hajj, H., El-Hajj, W., & Shaban, K. (2019). hULMonA: The universal language model in Arabic. In Proceedings of the fourth arabic natural language processing workshop (pp. 68–77). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4608

  • Ezzeldin, A. M., Kholief, M. H., El-Sonbaty, Y. (2013). ALQASIM: Arabic language question answer selection in machines. In Information access evaluation. multilinguality, multimodality, and visualization (pp. 100–103). Presented at the International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40802-1_12

  • Ezzeldin, A. M., El-Sonbaty, Y., & Kholief, M. H. (2015). Exploring the effects of root expansion, sentence splitting and ontology on Arabic answer selection. Natural Language Processing and Cognitive Science: Proceedings , 2014, 273.

  • Farghaly, A., & Shaalan, K. (2009). Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP), 8(4), 1–22.

  • Girju, R. (2003). Automatic detection of causal relations for question answering. In Proceedings of the ACL 2003 workshop on multilingual summarization and question answering (pp. 76–83). Association for Computational Linguistics. https://doi.org/10.3115/1119312.1119322

  • Habash, N. Y. (2010). Introduction to Arabic natural language processing. Synthesis lectures on human language technologies, 3, 1–187.

  • Hermann, K. M., Kociský, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., & Blunsom, P. (2015). Teaching machines to read and comprehend. In Advances in neural information processing systems (pp. 1693–1701). Retrieved March 7, 2019 from http://arxiv.org/abs/1506.03340.

  • Hewlett, D., Lacoste, A., Jones, L., Polosukhin, I., Fandrianto, A., Han, J., Kelcey, M., Berthelot, D., Berthelot, D. (2016). WikiReading: A novel large-scale language understanding task over wikipedia. In Proceedings of the 54th annual meeting of the association for computational linguistics (Vol 1: Long Papers, pp. 1535–1545).

  • Hill, F., Bordes, A., Chopra, S., & Weston, J. (2016). The goldilocks principle: Reading children’s books with explicit memory representations. In Presented at the 4th international conference on learning representations, ICLR 2016. San Juan, Puerto Rico.

  • Jia, R., & Liang, P. (2017). Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 2021–2031). Presented at the EMNLP 2017, Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1215

  • Joshi, M., Choi, E., Weld, D., & Zettlemoyer, L. (2017). TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: long papers) (pp. 1601–1611). Presented at the ACL 2017, Association for Computational Linguistics. https://doi.org/10.18653/v1/P17-1147

  • Kočiský, T., Schwarz, J., Blunsom, P., Dyer, C., Hermann, K. M., Melis, G., & Grefenstette, E. (2018). The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6, 317–328. https://doi.org/10.1162/tacl_a_00023

    Article  Google Scholar 

  • Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., et al. (2019). Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7, 453–466. https://doi.org/10.1162/tacl_a_00276

    Article  Google Scholar 

  • Lewis, P., Oguz, B., Rinott, R., Riedel, S., & Schwenk, H. (2020). MLQA: Evaluating cross-lingual extractive question answering. In Presented at the proceedings of the 58th annual meeting of the association for computational linguistics. (pp. 7315–7330). Retrieved July 28, 2020, from https://www.aclweb.org/anthology/2020.acl-main.653.

  • Liu, Q., Kusner, M. J., & Blunsom, P. (2020). A survey on contextual embeddings. arXiv e-prints, 2003, http://arxiv.org/abs/2003.07278.

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In Workshop track proceedings. Presented at the 1st International Conference on Learning Representations, Scottsdale, Arizona, USA.

  • Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP (pp. 1003–1011). Presented at the ACL-IJCNLP 2009, Suntec, Singapore: Association for Computational Linguistics. Retrieved August 11, 2020 from https://www.aclweb.org/anthology/P09-1113.

  • Motaz, S., & Wesam, A. (2010). Osac: Open source Arabic corpora (vol. 10). In Presented at the 6th ArchEng Int. Symposiums, EEECS.

  • Mozannar, H., Maamary, E., El Hajal, K., & Hajj, H. (2019). Neural Arabic question answering. In Proceedings of the fourth arabic natural language processing workshop (pp. 108–118). Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4612

  • Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359. Presented at the IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2009.191

  • Peñas, A., Hovy, E., Forner, P., Rodrigo, Á., Sutcliffe, R., Forascu, C., et al. (2012). Overview of QA4MRE at CLEF 2012: Question answering for machine reading evaluation. In CLEF (notebook papers/labs/workshop) (pp. 1–20).

  • Peñas, A., Hovy, E., Forner, P., Rodrigo, Á., Sutcliffe, R., & Morante, R. (2013). QA4MRE 2011–2013: Overview of question answering for machine reading evaluation. In Information Access Evaluation. Multilinguality, Multimodality, and Visualization. (pp. 303–320). Presented at the International Conference of the Cross-Language Evaluation Forum for European Languages. Berlin: Springer. https://doi.org/10.1007/978-3-642-40802-1_29.

  • Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543). Presented at the EMNLP 2014, Doha, Qatar: Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162.

  • Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: Human language technologies, volume 1 (long papers) (pp. 2227–2237). Presented at the NAACL-HLT 2018, New Orleans, Louisiana: Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1202.

  • Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., et al. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21(140), 1–67.

    Google Scholar 

  • Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 conference on empirical methods in natural language processing. Retrieved December 1, 2017 from http://arxiv.org/abs/1606.05250.

  • Rajpurkar, P., Jia, R., & Liang, P. (2018). Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 2: short papers) (pp. 784–789). Presented at the ACL 2018, Melbourne, Australia: Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-2124.

  • Reddy, S., Chen, D., & Manning, C. D. (2019). CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7, 249–266. https://doi.org/10.1162/tacl_a_00266

    Article  Google Scholar 

  • Richardson, M., Burges, C. J. C., & Renshaw, E. (2013). MCTest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 193–203). Presented at the EMNLP 2013, Seattle, Washington, USA: Association for Computational Linguistics. Retrieved July 28, 2020, from https://www.aclweb.org/anthology/D13-1020.

  • Salem, Z., Sadek, J., Chakkour, F., & Haskkour, N. (2010). Automatically finding answers to “why” and “how to” questions for Arabic language. In R. Setchi, I. Jordanov, R. J. Howlett, & L. C. Jain (Eds.), Knowledge-based and intelligent information and engineering systems (pp. 586–593). Springer.

    Chapter  Google Scholar 

  • Seo, M. J., Kembhavi, A., Farhadi, A., & Hajishirzi, H. (2017). Bidirectional attention flow for machine comprehension. In International conference on learning representations (ICLR) (Vol. abs/1611.01603). Retrieved March 7, 2019, from http://arxiv.org/abs/1611.01603.

  • Shaheen, M., & Ezzeldin, A. M. (2014). Arabic question answering: Systems, resources, tools, and future trends. Arabian Journal for Science and Engineering, 39(6), 4541–4564. https://doi.org/10.1007/s13369-014-1062-2

    Article  Google Scholar 

  • Smirnova, A., Cudré-Mauroux, P. (2018). Relation extraction using distant supervision: A survey. ACM Computing Surveys (CSUR), 51(5), 1–35.

  • Soliman, A. B., Eissa, K., & El-Beltagy, S. R. (2017). AraVec: A set of Arabic word embedding models for use in Arabic NLP. Procedia Computer Science, 117, 256–265. https://doi.org/10.1016/j.procs.2017.10.117

    Article  Google Scholar 

  • Trigui, O., Hadrich Belguith, L., Rosso, P., Ben Amor, H., & Gafsaoui, B. (2012). Arabic QA4MRE at CLEF 2012: Arabic question answering for machine reading evaluation. In Presented at the CLEF 2012 workshop on question answering for machine reading evaluation (QA4MRE).

  • Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., & Suleman, K. (2017). NewsQA: A machine comprehension dataset. In Proceedings of the 2nd workshop on representation learning for NLP (pp. 191–200). Vancouver, Canada: Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-2623

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st international conference on neural information processing systems (pp. 6000–6010). Long Beach, California, USA.

  • Welbl, J., Stenetorp, P., & Riedel, S. (2018). Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6, 287–302. https://doi.org/10.1162/tacl_a_00021

    Article  Google Scholar 

  • Weston, J., Chopra, S., & Bordes, A. (2015). Memory networks. In Presented at the international conference on learning representations, San Diego, CA. http://arxiv.org/abs/1410.3916.

  • Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W., Salakhutdinov, R., & Manning, C. D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 2369–2380). Presented at the EMNLP 2018, Brussels, Belgium: Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1259

  • Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). XLNet: Generalized autoregressive pretraining for language understanding (pp. 5753–5763). In Advances in Neural Information Processing Systems.

  • Yu, A. W., Dohan, D., Luong, M.-T., Zhao, R., Chen, K., Norouzi, M., & Le, Q. V. (2018). QANet: Combining local convolution with global self-attention for reading comprehension. In International conference on learning representations. Retrieved May 1, 2018, from http://arxiv.org/abs/1804.09541.

  • Zeroual, I., Goldhahn, D., Eckart, T., & Lakhouaja, A. (2019). OSIAN: Open source international Arabic news corpus—preparation and integration into the CLARIN-infrastructure. In Proceedings of the fourth arabic natural language processing workshop (pp. 175–182). Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4619

Download references

Acknowledgements

The authors would like to thank Deanship of scientific research in King Saud University for funding and supporting this research through the initiative of DSR Graduate Students Research Support (GSR). The authors thank the Deanship of Scientific Research and RSSU at King Saud University for their technical support

Funding

The authors would like to thank Deanship of scientific research in King Saud University for funding and supporting this research through the initiative of DSR Graduate Students Research Support (GSR).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eman Albilali.

Ethics declarations

Conflict of interest

All the authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Figs. 

Fig. 7
figure 7

Human annotation interface, on which additional answers were collected for the KaifLematha development and test sets

7 and

Fig. 8
figure 8

Four examples from The Arabic WikiReading dataset. The first two examples share the same question المهنة occupation (appears in blue color) and different paragraphs and answers (appears in red color). The second two examples share the same question الاسم الأول the first name, each example with different paragraphs and answers

8.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Albilali, E., Al-Twairesh, N. & Hosny, M. Constructing Arabic Reading Comprehension Datasets: Arabic WikiReading and KaifLematha. Lang Resources & Evaluation 56, 729–764 (2022). https://doi.org/10.1007/s10579-022-09577-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-022-09577-5

Keywords