skip to main content
research-article

The Impact of Arabic Diacritization on Word Embeddings

Published: 16 June 2023 Publication History

Abstract

Word embedding is used to represent words for text analysis. It plays an essential role in many Natural Language Processing (NLP) studies and has hugely contributed to the extraordinary developments in the field in the last few years. In Arabic, diacritic marks are a vital feature for the readability and understandability of the language. Current Arabic word embeddings are non-diacritized. In this article, we aim to develop and compare word embedding models based on diacritized and non-diacritized corpora to study the impact of Arabic diacritization on word embeddings. We propose evaluating the models in four different ways: clustering of the nearest words; morphological semantic analysis; part-of-speech tagging; and semantic analysis. For a better evaluation, we took the challenge to create three new datasets from scratch for the three downstream tasks. We conducted the downstream tasks with eight machine learning algorithms and two deep learning algorithms. Experimental results show that the diacritized model exhibits a better ability to capture syntactic and semantic relations and in clustering words of similar categories. Overall, the diacritized model outperforms the non-diacritized model. We obtained some more interesting findings. For example, from the morphological semantics analysis, we found that with the increase in the number of target words, the advantages of the diacritized model are also more obvious, and the diacritic marks have more significance in POS tagging than in other tasks.

References

[1]
Wael Abid and Younes Bensouda Mourri. 2018. Improving English to Arabic machine translation. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NIPS’18), Montréal, Canada.
[2]
Tosin P. Adewumi, Foteini Liwicki, and Marcus Liwicki. 2020. The challenge of diacritics in Yorùbá embeddings. In Proceedings of the ML4D Workshop at 34th Conference on Neural Information Processing Systems (NeurIPS) 2020 Workshop on Machine Learning for the Developing World. Vancouver, Canada. arXiv preprint arXiv:2011.07605.
[3]
Jesujoba Alabi, Kwabena Amponsah-Kaakyire, David Adelani, and Cristina España-Bonet. 2020. Massive vs. curated embeddings for low-resourced languages: the case of Yorùbá and Twi. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC’20). Computation and Language. arXi v preprint arXiv:1912.02481. Version 2.
[4]
Yousef Alotaibi, Ali Meftah, and Sid Ahmed Selouani. 2013. Diacritization, automatic segmentation and labeling for Levantine Arabic speech. In Proceedings of IEEE Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE). 7–11.
[5]
Muhammad Altabba, Ammar Al-Zaraee, and Mohammad Arif Shukairy. 2010. An Arabic Morphological Analyzer and Part-of-speech tagger. A Thesis Presented to the Faculty of Informatics Engineering, Arab International University, Damascus, Syria.
[6]
Sawsan Alqahtani, Mahmoud Ghoneim, and Mona Diab. 2016. Investigating the impact of various partial diacritization schemes on Arabic-English statistical machine translation. In Proceedings of the Association for Machine Translation in the Americas: MT Researchers' Track Conferences. Austin, TX, USA, 191–204.
[7]
Farid Binbeshr, Amirrudin Kamsin, and Manal Mohammed. 2021. A systematic review on hadith authentication and classification methods. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 2 (2021), Article 34.
[8]
Angana Borah, Manash Pratim Barman, and Amit Awekar. 2021. Are word embedding methods stable and should we care about it?. In Proceedings of the 32nd ACM Conference on Hypertext and Social Media. 45–55.
[9]
Eddy Muntina Dharma, Ford Lumban Gaol, Harco Leslie Hendric Spits Warnars, and Benfano Soewito. 2022. The accuracy comparison among word2vec, glove, and fasttext towards convolution neural network text classification. Journal of Theoretical and Applied Information Technology (2022).
[10]
Mona Diab, Mahmoud Ghoneim, and Nizar Habash. 2007. Arabic diacritization in the context of statistical machine translation. In Proceedings of Machine Translation Summit XI, Copenhagen, Denmark.
[11]
Esther Fleming. 2020. Is Madinah Arabic free? Retrieved May 20, 2022 from https://www.sidmartinbio.org/is-madinah-arabic-free/.
[12]
Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893.
[13]
Ismail Hadjir, Mohamed Abbache, and Fatma Zohra Belkredim. 2019. An approach for Arabic diacritization. In Proceedings of the International Conference on Applications of Natural Language to Information Systems. NLDB 2019: Natural Language Processing and Information Systems. Springer, 337–344.
[14]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. An Introduction to Information Retrieval. Cambridge University Press.
[15]
Abir Masmoudi, Mariem Ellouze Khemakhem, Yannick Estève, Lamia Hadrich Belguith, and Nizar Habash. 2014. A corpus and phonetic dictionary for Tunisian Arabic speech recognition. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC'14), ID L14-1385, Reykjavik, Iceland, 306–310.
[16]
Abir Masmoudi, Salima Mdhaffar, Rahma Sellami, and Lamia Hadrich. 2019. Automatic diacritics restoration for Tunisian dialect. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 3 (2019), 1–18.
[17]
Ayman Nadeem. 2020. Arabic trilateral roots. Retrieved May 20, 2022 from https://medium.com/@aymannadeem/arabic-trilateral-roots-3186e8319b0.
[18]
Abu Bakr Soliman, Kareem Eissa, and Samhaa R. El-Beltagy. 2017. AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP. Procedia Computer Science 117 (2017), 256–265.
[19]
Robin Stohler. 2018. Document Embedding Models - A Comparison with Bag-of-Words. Master Thesis, Supervisors: Abraham Bernstein. Merlin - OEC Faculty Information System. University of Zurich. Zurich ZH, Switzerland.
[20]
Othman Salem Bakheet Qawaqzeh. 2019. Morphological Indication of Al-Khasa'is Book for Ebn Jini: Descriptive Analytical Study. University of Jordan Deanship of Academic Research (DAR).
[21]
Alexis Neme. 2011. A lexicon of Arabic verbs constructed on the basis of Semitic taxonomy and using finite-state transducers. In Proceedings of the WoLeR 2011 Conference at ESSLLI International Workshop on Lexical Resources at: Ljubliana.
[22]
Ahmed Younes and Julie Weeds. 2020. Embed more ignore less (EMIL): Exploiting enriched representations for Arabic NLP. In Proceedings of the 5th Arabic Natural Language Processing Workshop. 139–154.
[23]
Taha Zerrouki and Amar Balla. 2017. Tashkeela: Novel corpus of Arabic vocalized texts, data for autodiacritization systems. Data Brief, 147–151.

Cited By

View all
  • (2025)Unlocking the power of transfer learning with Ad-Dabit-Al-Lughawi: A token classification approach for enhanced Arabic Text DiacritizationExpert Systems with Applications10.1016/j.eswa.2024.126166269(126166)Online publication date: Apr-2025
  • (2024)Toward Robust Arabic AI-Generated Text Detection: Tackling Diacritics ChallengesInformation10.3390/info1507041915:7(419)Online publication date: 19-Jul-2024
  • (2023)AIRABIC: Arabic Dataset for Performance Evaluation of AI Detectors2023 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA58977.2023.00127(864-870)Online publication date: 15-Dec-2023

Index Terms

  1. The Impact of Arabic Diacritization on Word Embeddings

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 6
    June 2023
    635 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3604597
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 June 2023
    Online AM: 19 April 2023
    Accepted: 30 March 2023
    Revised: 03 February 2023
    Received: 07 June 2022
    Published in TALLIP Volume 22, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Arabic NLP
    2. word embeddings
    3. diacritization
    4. morphological semantics
    5. semantic analysis

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)43
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Unlocking the power of transfer learning with Ad-Dabit-Al-Lughawi: A token classification approach for enhanced Arabic Text DiacritizationExpert Systems with Applications10.1016/j.eswa.2024.126166269(126166)Online publication date: Apr-2025
    • (2024)Toward Robust Arabic AI-Generated Text Detection: Tackling Diacritics ChallengesInformation10.3390/info1507041915:7(419)Online publication date: 19-Jul-2024
    • (2023)AIRABIC: Arabic Dataset for Performance Evaluation of AI Detectors2023 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA58977.2023.00127(864-870)Online publication date: 15-Dec-2023

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media