Abstract
Neural machine reading comprehension models have gained immense popularity over the last decade given the availability of large-scale English datasets. A key limiting factor for neural model development and investigations of the Arabic language is the limitation of the currently available datasets. Current available datasets are either too small to train deep neural models or created by the automatic translation of the available English datasets, where the exact answer may not be found in the corresponding text. In this paper, we propose two high quality and large-scale Arabic reading comprehension datasets: Arabic WikiReading and KaifLematha with around +100 K instances. We followed two different methodologies to construct our datasets. First, we employed crowdworkers to collect non-factoid questions from paragraphs on Wikipedia. Then, we constructed Arabic WikiReading following a distant supervision strategy, utilizing the Wikidata knowledge base as a ground truth. We carried out both quantitative and qualitative analyses to investigate the level of reasoning required to answer the questions in the proposed datasets. We evaluated competitive pre-trained language model that attained F1 scores of 81.77 and 68.61 for the Arabic WikiReading and KaifLematha datasets, respectively, but struggled to extract a precise answer for the KaifLematha dataset. Human performance reported an F1 score of 82.54 for the KaifLematha development set, which leaves ample room for improvement.






Similar content being viewed by others
Data availability
Datasets are available under https://github.com/esulaiman/Arabic-WikiReading-and-KaifLematha-datasets.
Code availability
Not applicable.
Notes
This explanation concerns the evaluation of human performance. However, for the model evaluation, all the three collected answers are considered as ground truth.
References
Abdelali, A., Darwish, K., Durrani, N., & Mubarak, H. (2016). Farasa: A fast and furious segmenter for Arabic. In Proceedings of the 2016 conference of the north American chapter of the association for computational linguistics: Demonstrations (pp. 11–16). Association for Computational Linguistics. https://doi.org/10.18653/v1/N16-3003
Abouenour, L., Bouzoubaa, K., Rosso, P., & School, M. (2012). IDRAAQ: New arabic question answering system based on query expansion and passage retrieval (p. 11). In Presented at the CLEF 2012 conference and labs of the evaluation forum.
Abu Farha, I., & Magdy, W. (2019). Mazajak: An online Arabic sentiment analyser. In Proceedings of the fourth Arabic natural language processing workshop (pp. 192–198). Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4621
Antoun, W., Baly, F., & Hajj, H. (2020). AraBERT: Transformer-based model for Arabic language understanding. In Proceedings of the 4th workshop on open-source arabic corpora and processing tools, with a shared task on offensive language detection (pp. 9–15). Marseille, France: European Language Resource Association.
Azmi, A. M., & Alshenaifi, N. A. (2017). Lemaza: An Arabic why-question answering system. Natural Language Engineering, 23(6), 877–903. https://doi.org/10.1017/S1351324917000304
Belinkov, Y., Glass, J. (2019). Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7, 49–72
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146. https://doi.org/10.1162/tacl_a_00051
Chen, D., Bolton, J., & Manning, C. D. (2016, August). A Thorough examination of the CNN/daily mail reading comprehension task. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 2358–2367).
Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W., Choi, Y., Liang, P., & Zettlemoyer, L. (2018). QuAC: Question answering in context. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 2174–2184). Presented at the EMNLP 2018, Brussels, Belgium:Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1241
Clark, J. H., Palomaki, J., Nikolaev, V., Choi, E., Garrette, D., Collins, M., & Kwiatkowski, T. (2020). TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8, 454–470.
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). https://doi.org/10.18653/v1/N19-1423.
Dunn, M., Sagun, L., Higgins, M., Guney, V. U., Cirik, V., & Cho, K. (2017). SearchQA: A new Q&A dataset augmented with context from a search engine. Retrieved September 3, 2019 from http://arxiv.org/abs/1704.05179.
ElJundi, O., Antoun, W., El Droubi, N., Hajj, H., El-Hajj, W., & Shaban, K. (2019). hULMonA: The universal language model in Arabic. In Proceedings of the fourth arabic natural language processing workshop (pp. 68–77). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4608
Ezzeldin, A. M., Kholief, M. H., El-Sonbaty, Y. (2013). ALQASIM: Arabic language question answer selection in machines. In Information access evaluation. multilinguality, multimodality, and visualization (pp. 100–103). Presented at the International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40802-1_12
Ezzeldin, A. M., El-Sonbaty, Y., & Kholief, M. H. (2015). Exploring the effects of root expansion, sentence splitting and ontology on Arabic answer selection. Natural Language Processing and Cognitive Science: Proceedings , 2014, 273.
Farghaly, A., & Shaalan, K. (2009). Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP), 8(4), 1–22.
Girju, R. (2003). Automatic detection of causal relations for question answering. In Proceedings of the ACL 2003 workshop on multilingual summarization and question answering (pp. 76–83). Association for Computational Linguistics. https://doi.org/10.3115/1119312.1119322
Habash, N. Y. (2010). Introduction to Arabic natural language processing. Synthesis lectures on human language technologies, 3, 1–187.
Hermann, K. M., Kociský, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., & Blunsom, P. (2015). Teaching machines to read and comprehend. In Advances in neural information processing systems (pp. 1693–1701). Retrieved March 7, 2019 from http://arxiv.org/abs/1506.03340.
Hewlett, D., Lacoste, A., Jones, L., Polosukhin, I., Fandrianto, A., Han, J., Kelcey, M., Berthelot, D., Berthelot, D. (2016). WikiReading: A novel large-scale language understanding task over wikipedia. In Proceedings of the 54th annual meeting of the association for computational linguistics (Vol 1: Long Papers, pp. 1535–1545).
Hill, F., Bordes, A., Chopra, S., & Weston, J. (2016). The goldilocks principle: Reading children’s books with explicit memory representations. In Presented at the 4th international conference on learning representations, ICLR 2016. San Juan, Puerto Rico.
Jia, R., & Liang, P. (2017). Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 2021–2031). Presented at the EMNLP 2017, Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1215
Joshi, M., Choi, E., Weld, D., & Zettlemoyer, L. (2017). TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: long papers) (pp. 1601–1611). Presented at the ACL 2017, Association for Computational Linguistics. https://doi.org/10.18653/v1/P17-1147
Kočiský, T., Schwarz, J., Blunsom, P., Dyer, C., Hermann, K. M., Melis, G., & Grefenstette, E. (2018). The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6, 317–328. https://doi.org/10.1162/tacl_a_00023
Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., et al. (2019). Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7, 453–466. https://doi.org/10.1162/tacl_a_00276
Lewis, P., Oguz, B., Rinott, R., Riedel, S., & Schwenk, H. (2020). MLQA: Evaluating cross-lingual extractive question answering. In Presented at the proceedings of the 58th annual meeting of the association for computational linguistics. (pp. 7315–7330). Retrieved July 28, 2020, from https://www.aclweb.org/anthology/2020.acl-main.653.
Liu, Q., Kusner, M. J., & Blunsom, P. (2020). A survey on contextual embeddings. arXiv e-prints, 2003, http://arxiv.org/abs/2003.07278.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In Workshop track proceedings. Presented at the 1st International Conference on Learning Representations, Scottsdale, Arizona, USA.
Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP (pp. 1003–1011). Presented at the ACL-IJCNLP 2009, Suntec, Singapore: Association for Computational Linguistics. Retrieved August 11, 2020 from https://www.aclweb.org/anthology/P09-1113.
Motaz, S., & Wesam, A. (2010). Osac: Open source Arabic corpora (vol. 10). In Presented at the 6th ArchEng Int. Symposiums, EEECS.
Mozannar, H., Maamary, E., El Hajal, K., & Hajj, H. (2019). Neural Arabic question answering. In Proceedings of the fourth arabic natural language processing workshop (pp. 108–118). Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4612
Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359. Presented at the IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2009.191
Peñas, A., Hovy, E., Forner, P., Rodrigo, Á., Sutcliffe, R., Forascu, C., et al. (2012). Overview of QA4MRE at CLEF 2012: Question answering for machine reading evaluation. In CLEF (notebook papers/labs/workshop) (pp. 1–20).
Peñas, A., Hovy, E., Forner, P., Rodrigo, Á., Sutcliffe, R., & Morante, R. (2013). QA4MRE 2011–2013: Overview of question answering for machine reading evaluation. In Information Access Evaluation. Multilinguality, Multimodality, and Visualization. (pp. 303–320). Presented at the International Conference of the Cross-Language Evaluation Forum for European Languages. Berlin: Springer. https://doi.org/10.1007/978-3-642-40802-1_29.
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543). Presented at the EMNLP 2014, Doha, Qatar: Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162.
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: Human language technologies, volume 1 (long papers) (pp. 2227–2237). Presented at the NAACL-HLT 2018, New Orleans, Louisiana: Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1202.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., et al. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21(140), 1–67.
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 conference on empirical methods in natural language processing. Retrieved December 1, 2017 from http://arxiv.org/abs/1606.05250.
Rajpurkar, P., Jia, R., & Liang, P. (2018). Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 2: short papers) (pp. 784–789). Presented at the ACL 2018, Melbourne, Australia: Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-2124.
Reddy, S., Chen, D., & Manning, C. D. (2019). CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7, 249–266. https://doi.org/10.1162/tacl_a_00266
Richardson, M., Burges, C. J. C., & Renshaw, E. (2013). MCTest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 193–203). Presented at the EMNLP 2013, Seattle, Washington, USA: Association for Computational Linguistics. Retrieved July 28, 2020, from https://www.aclweb.org/anthology/D13-1020.
Salem, Z., Sadek, J., Chakkour, F., & Haskkour, N. (2010). Automatically finding answers to “why” and “how to” questions for Arabic language. In R. Setchi, I. Jordanov, R. J. Howlett, & L. C. Jain (Eds.), Knowledge-based and intelligent information and engineering systems (pp. 586–593). Springer.
Seo, M. J., Kembhavi, A., Farhadi, A., & Hajishirzi, H. (2017). Bidirectional attention flow for machine comprehension. In International conference on learning representations (ICLR) (Vol. abs/1611.01603). Retrieved March 7, 2019, from http://arxiv.org/abs/1611.01603.
Shaheen, M., & Ezzeldin, A. M. (2014). Arabic question answering: Systems, resources, tools, and future trends. Arabian Journal for Science and Engineering, 39(6), 4541–4564. https://doi.org/10.1007/s13369-014-1062-2
Smirnova, A., Cudré-Mauroux, P. (2018). Relation extraction using distant supervision: A survey. ACM Computing Surveys (CSUR), 51(5), 1–35.
Soliman, A. B., Eissa, K., & El-Beltagy, S. R. (2017). AraVec: A set of Arabic word embedding models for use in Arabic NLP. Procedia Computer Science, 117, 256–265. https://doi.org/10.1016/j.procs.2017.10.117
Trigui, O., Hadrich Belguith, L., Rosso, P., Ben Amor, H., & Gafsaoui, B. (2012). Arabic QA4MRE at CLEF 2012: Arabic question answering for machine reading evaluation. In Presented at the CLEF 2012 workshop on question answering for machine reading evaluation (QA4MRE).
Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., & Suleman, K. (2017). NewsQA: A machine comprehension dataset. In Proceedings of the 2nd workshop on representation learning for NLP (pp. 191–200). Vancouver, Canada: Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-2623
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st international conference on neural information processing systems (pp. 6000–6010). Long Beach, California, USA.
Welbl, J., Stenetorp, P., & Riedel, S. (2018). Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6, 287–302. https://doi.org/10.1162/tacl_a_00021
Weston, J., Chopra, S., & Bordes, A. (2015). Memory networks. In Presented at the international conference on learning representations, San Diego, CA. http://arxiv.org/abs/1410.3916.
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W., Salakhutdinov, R., & Manning, C. D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 2369–2380). Presented at the EMNLP 2018, Brussels, Belgium: Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1259
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). XLNet: Generalized autoregressive pretraining for language understanding (pp. 5753–5763). In Advances in Neural Information Processing Systems.
Yu, A. W., Dohan, D., Luong, M.-T., Zhao, R., Chen, K., Norouzi, M., & Le, Q. V. (2018). QANet: Combining local convolution with global self-attention for reading comprehension. In International conference on learning representations. Retrieved May 1, 2018, from http://arxiv.org/abs/1804.09541.
Zeroual, I., Goldhahn, D., Eckart, T., & Lakhouaja, A. (2019). OSIAN: Open source international Arabic news corpus—preparation and integration into the CLARIN-infrastructure. In Proceedings of the fourth arabic natural language processing workshop (pp. 175–182). Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4619
Acknowledgements
The authors would like to thank Deanship of scientific research in King Saud University for funding and supporting this research through the initiative of DSR Graduate Students Research Support (GSR). The authors thank the Deanship of Scientific Research and RSSU at King Saud University for their technical support
Funding
The authors would like to thank Deanship of scientific research in King Saud University for funding and supporting this research through the initiative of DSR Graduate Students Research Support (GSR).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All the authors declare that they have no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
See Figs.
7 and
Four examples from The Arabic WikiReading dataset. The first two examples share the same question المهنة occupation (appears in blue color) and different paragraphs and answers (appears in red color). The second two examples share the same question الاسم الأول the first name, each example with different paragraphs and answers
8.
Rights and permissions
About this article
Cite this article
Albilali, E., Al-Twairesh, N. & Hosny, M. Constructing Arabic Reading Comprehension Datasets: Arabic WikiReading and KaifLematha. Lang Resources & Evaluation 56, 729–764 (2022). https://doi.org/10.1007/s10579-022-09577-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-022-09577-5