Skip to main content

WSDTN a Novel Dataset for Arabic Word Sense Disambiguation

  • Conference paper
  • First Online:
Advances in Computational Collective Intelligence (ICCCI 2023)

Abstract

Word sense disambiguation (WSD) task aims to find the exact sense of an ambiguous word in a particular context. It is crucial for many applications, including machine translation, information retrieval, and semantic textual similarity. Arabic WSD faces significant challenges, primarily due to the scarcity of resources, which hinders the development of robust deep learning models. Additionally, the semantic sparsity of context further complicates the task, as Arabic words often exhibit multiple meanings. In this paper, we propose WSDTN, a manually annotated corpus, designed to fill this gap and to enable the automatic disambiguation of Arabic words. It consists of 27530 sentences collected from different resources and spanning different domains, each with a target word and its appropriate sense. We present the novel corpus itself, its creation procedure for reproducibility and a transformer based model to disambiguate new words and evaluate the performance of the corpus. The experimental results show that the baseline approach achieves an accuracy of around 90%. The corpus is publically available upon request and is open for extension.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ma, J., Li, L.: Data augmentation for Chinese text classification using back-translation. In: Journal of Physics: Conference Series, vol. 1651, no. 1, p. 012039. IOP Publishing (2020)

    Google Scholar 

  2. Elmougy, S., Taher, H., Noaman, H.: Naïve Bayes classifier for Arabic word sense disambiguation. In: Proceeding of the 6th International Conference on Informatics and Systems, pp. 16–21 (2008)

    Google Scholar 

  3. El-Gedawy, M.N.: Using fuzzifiers to solve word sense ambiguation in Arabic language. Int. J. Comput. Appl. 79(2) (2013)

    Google Scholar 

  4. Alkhatlan, A., Kalita, J., Alhaddad, A.: Word sense disambiguation for Arabic exploiting Arabic wordnet and word embedding. Procedia Comput. Sci. 142, 50–60 (2018)

    Article  Google Scholar 

  5. Hadni, M., Ouatik, S.E.A., Lachkar, A.: Word sense disambiguation for Arabic text categorization. Int. Arab J. Inf. Technol. 13(1A), 215–222 (2016)

    Google Scholar 

  6. Merhbene, L., Zouaghi, A., Zrigui, M.: An experimental study for some supervised lexical disambiguation methods of Arabic language. In: Fourth International Conference on Information and Communication Technology and Accessibility (ICTA), pp. 1–6. IEEE (2013)

    Google Scholar 

  7. Laatar, R., Aloulou, C., Belghuith, L.H.: Word2vec for Arabic word sense disambiguation. In: Silberztein, M., Atigui, F., Kornyshova, E., Métais, E., Meziane, F. (eds.) NLDB 2018. LNCS, vol. 10859, pp. 308–311. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91947-8_32

    Chapter  Google Scholar 

  8. El-Razzaz, M., Fakhr, M.W., Maghraby, F.A.: Arabic gloss WSD using BERT. Appl. Sci. 11(6), 2567 (2021)

    Article  Google Scholar 

  9. Antoun, W., Baly, F., Hajj, H.: AraBERT: transformer-based model for Arabic language understanding. arXiv preprint arXiv:2003.00104 (2020)

  10. Abdul-Mageed, M., Elmadany, A., Nagoudi, E.M.B.: ARBERT & MARBERT: deep bidirectional transformers for Arabic. arXiv preprint arXiv:2101.01785 (2020)

  11. Safaya, A., Abdullatif, M., Yuret, D.: KUISAIL at SemEval-2020 Task 12: BERT-CNN for offensive speech identification in social media. In: Proceedings of the Fourteenth Workshop on Semantic Evaluation, pp. 2054–2059 (2020)

    Google Scholar 

  12. Libovický, J., Rosa, R., Fraser, A.: How language-neutral is multilingual BERT? arXiv preprint arXiv:1911.03310 (2019)

  13. Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)

  14. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  15. Vial, L., Lecouteux, B., Schwab, D.: UFSAC: unification of sense annotated corpora and tools. In: Language Resources and Evaluation Conference (LREC) (2018)

    Google Scholar 

  16. Saidi, R., Jarray, F.: Combining BERT representation and POS tagger for Arabic word sense disambiguation. In: Abraham, A., Gandhi, N., Hanne, T., Hong, T.-P., Nogueira Rios, T., Ding, W. (eds.) ISDA 2021. LNNS, vol. 418, pp. 676–685. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-96308-8_63

    Chapter  Google Scholar 

  17. El-Gamml, M.M., Fakhr, M.W., Rashwan, M.A., Al-Said, A.B.: A comparative study for Arabic word sense disambiguation using document preprocessing and machine learning techniques. In: Arabic Language Technology International Conference, Bibliotheca Alexandrina, CBA, vol. 11 (2011)

    Google Scholar 

  18. Al-Hajj, M., Jarrar, M.: ArabGlossBERT: fine-tuning BERT on context-gloss pairs for WSD. arXiv preprint arXiv:2205.09685 (2022)

  19. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  20. Saidi, R., Jarray, F., Kang, J., Schwab, D.: GPT-2 contextual data augmentation for word sense disambiguation. In: Pacific Asia Conference on Language, Information and Computation (2022)

    Google Scholar 

  21. Saidi, R., Jarray, F., Alsuhaibani, M.: Comparative analysis of recurrent neural network architectures for Arabic word sense disambiguation. In: Proceedings of the 18th International Conference on Web Information Systems and Technologies, WEBIST 2022, 25–27 October 2022 (2022)

    Google Scholar 

  22. MarBERRT model. https://huggingface.co/UBC-NLP/MARBERT. Accessed 10 Nov 2022

  23. Camel Bert. https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-ca. Accessed 10 Nov 2022

  24. Arabic WordNet. http://globalwordnet.org/resources/arabic-wordnet/awn-browser/. Accessed 20 Mar 2021

  25. Ontonotes. https://goo.gl/peHdKQ. Accessed 10 Feb 2023

  26. Doha dictionnaries. https://www.dohadictionary.org/. Accessed 14 Dec 2022

  27. Arabic Digital dictionnaries. https://www.almaany.com/. Accessed 18 Jan 2023

  28. ArBERT. https://huggingface.co/UBC-NLP/ARBERT. Accessed 10 Nov 2022

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rakia Saidi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Saidi, R., Jarray, F., Akacha, A., Aribi, W. (2023). WSDTN a Novel Dataset for Arabic Word Sense Disambiguation. In: Nguyen, N.T., et al. Advances in Computational Collective Intelligence. ICCCI 2023. Communications in Computer and Information Science, vol 1864. Springer, Cham. https://doi.org/10.1007/978-3-031-41774-0_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-41774-0_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-41773-3

  • Online ISBN: 978-3-031-41774-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics