Elsevier

Pattern Recognition Letters

Volume 162, October 2022, Pages 47-55
Pattern Recognition Letters

Automatic dottization of Arabic text (Rasms) using deep recurrent neural networks

https://doi.org/10.1016/j.patrec.2022.09.001Get rights and content

Highlights

  • First work on automatically adding dots to Arabic text without dots (i.e., Rasms).

  • Automatic adding dots using deep recurrent neural networks.

  • Evaluated on four different text corpora.

  • CER rates ranging from 2.0% to 5.5% on independent test sets.

Abstract

Arabic letters in their early stages were only shapes (Rasm) without dots. Dots were added later to ease reading and reduce ambiguity. Thereafter, diacritics were introduced for phonetic guidance, mainly for nonnative speakers. Many studies have been conducted to automatically diacritize Arabic texts using machine learning techniques. However, to the best of our knowledge, automatically adding dots to Arabic Rasms has not been reported in the literature. In this work, we present the automatic addition of dots to Arabic Rasms using deep recurrent neural networks. Different design choices were explored, including the use of character sequences and word sequences as tokens. The presented techniques were evaluated on four diverse publicly available datasets. Character-level models with stacked BiGRU architecture outperformed all the other architectures with character error rates ranging from 2.0% to 5.5% and dottization error rates ranging from 4.2% to 11.0% on independent test sets.

Introduction

Languages and scripts evolve over time. In the early stage of the Arabic language, the alphabet characters were only shapes (called Rasm) without dots or diacritics. Dots were added to the characters to differentiate between similar looking characters. Subsequently, the diacritics were invented and are used for resolving ambiguities and for phonetic guidance. Nowadays, dots are used permanently with characters, unlike diacritics which are used in limited circumstances. Fig. 1 illustrates this idea with an example Arabic sentence written once using only the Rasms, once with the addition of dots, and once with the addition of diacritics along with dots. As a second example, an image of a parchment having an ancient Quranic manuscript written on it is shown in Fig. 2. It was written using Rasms and does not contain any dots.1

A Rasm sequence without dots can represent several possible words. Fig. 3 presents two example Rasm sequences and different possible Arabic words they can represent. For the first example (the top row in the figure), we have six possible words a Rasm sequence can represent. Similarly, for the second example, the presented Rasm sequence can represent five different words. The correct word depends on the context of the text and it is difficult to identify instantly, even by a native speaker.

Many studies have been performed to automate the diacritization process, Abandah et al. [1], Metwally et al. [2], Masmoudi et al. [3]. Moreover, Arabic diacritization is beneficial to other tasks in Arabic natural language processing (ANLP), such as translation Diab et al. [4]. Since dots were introduced before diacritics and are more fundamental to the script, it motivated us to investigate the use of deep learning for automatic dottization of Arabic texts. We believe that automatic dottization will contribute to future ANLP research both directly and indirectly. A direct use case of automatic dottization can be to automatically transcribe ancient Arabic manuscripts once Rasms have been recognized. A possible implication of this research can be on Arabic text recognition research. Some researchers have investigated the idea of splitting the task of Arabic text recognition into multiple stages involving recognizing Arabic Rasms separately from dots and then combining the Rasms and dots to output the final Arabic texts Ahmad and Fink [[5], [6]]. Another use of this approach could be in the tokenization of Arabic texts for other NLP tasks. Additionally, a relatively newer use case is in the field of social media moderation, where users use Arabic texts without dots to evade censorship. As an example, Fig. 4 shows an Arabic tweet using only Rasms.

To the best of our knowledge, automatically adding dots to Arabic Rasms has not been reported in the literature. In this paper, we will present the automatic addition of dots to Arabic text using deep recurrent neural networks. We present two different approaches to dottization, one using word sequences as input and the other using character sequences as input. The presented techniques were evaluated on four different Arabic text corpora that are publicly available.

It should be noted that although automatic Arabic diacritization is a related problem to automatic Arabic dottization, they have differences at the same time. Dots are more fundamental in Arabic texts compared to the diacritics. Most Arabic texts are written without diacritics but not without dots. The challenges and opportunities involved in adding dots are different from those involved in adding diacritics. Although a character can take any diacritic from among the possible diacritics in Arabic, a sequence of diacritics over a word is very limited. Nouns, for example, have a fixed sequence of diacritics, and they do not change except the diacritic on the last character. Furthermore, the sequence of diacritics over a verb follows a certain template, such as fatha fatha fatha as in

or damma kasra fatha as in
and they cannot appear in every possible combination. Additionally, many of the diacritics can appear only at specific positions in a word. For example, the diacritic sukun cannot come over the first character of a word. Similarly, tanwin diacritics can only appear over the last character of a word. On the other hand, dots are fundamental in defining the characters themselves. They do not follow a fixed template as diacritics. Additionally, some Arabic Rasms belong to a single character (like ا and ل) and do not have dots. Moreover, some word Rasms lend themselves to only a specific word in Arabic, as no other Arabic word appears with the same Rasm sequence. This insight can help in the dottization task, as we will discuss in the rest of the paper. Furthermore, homographs in Arabic can have different diacritics, but dottization does not face such issues. For example, the Arabic word
can mean the noun gold
or the verb he went
depending on the diacritic over the last character. On the other hand, many verb forms, such as
(he writes) and
(she writes) and
(we write) have the same Rasm sequence but different dots in the prefix, but all the forms have the same diacritic sequence, fatha sukun damma fatha.

The rest of the paper is structured as follows: in Section 2, we present the background on the Arabic script. In Section 3, we present the related work specifically focusing on Arabic diacritization. In Section 4, we present the methodology, including tokenization, model design and postprocessing steps. In Section 5, we present the datasets used for experimentation, the evaluation metrics used to report the results, the experimentation and results and some discussions related to error analysis. Finally, in Section 6, we conclude with our findings and discuss possible extensions to our work.

Section snippets

Background on the Arabic script

Arabic is written from right to left and has 28 characters, and it does not have upper or lower cases as in English. Some characters have no dots, while many of the characters have dots either above or below them. These 28 characters use 17 unique Rasms, as shown in Fig. 5. It is clear from the figure that many characters share the same Rasm between them. For example, the three characters (ب - ت - ث) are mapped to a single Rasm, as shown in Fig. 5.

Furthermore, an Arabic character can have many

Related work

To the best of our knowledge, automatically adding dots to Arabic Rasms has not been reported in the literature. The closest research related to our topic is the automatic diacritization of Arabic texts. Consequently, we will briefly present some representative studies in the area of automatic diacritization of Arabic text in addition to some other relevant works on Arabic NLP.

Elshafei et al. [7] formulated the diacritization problem using hidden Markov models (HMMs), wherein the hidden state

Methodology

In this section, we describe our methodology for automatically adding dots to Arabic text using deep learning. We use recurrent neural networks (RNNs) because they have shown promising results in various NLP tasks, as evident from the literature. Since the input is a sentence without dots and the output is the same sentence with dots, we adopt the sequence-to-sequence (Seq2Seq) learning approach. Moreover, there is a direct one-to-one mapping between the input and the output. The input sequence

Experiments and results

First, we present the datasets used for the experiments in addition to the preprocessing and tokenization carried out before feeding the data to our system. Next, we present the metrics used for system evaluation. This is followed by details on system training. Finally, we present the results and the discussion.

Conclusions

In this paper, we presented automatic dottization of Arabic text. To the best of our knowledge, this is the first work addressing this topic. We use recurrent neural networks for the task. Moreover, two different approaches were investigated: one using words as tokens and the other using characters as tokens. The benefits and limitations of both approaches were discussed although overall, the character-level system outperformed the word-level system. A postprocessing step led to further small

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

The authors would like to thank Saudi Data and AI Authority (SDAIA) and King Fahd University of Petroleum and Minerals (KFUPM) for supporting this work under SDAIA-KFUPM Joint Research Center for Artificial Intelligence grant no. JRC-AI-RFP-06.

References (22)

  • M.S. Al-Shaibani et al.

    Metrec: a dataset for meter classification of Arabicpoetry

    Data Brief

    (2020)
  • A. Stolcke

    Srilm-an extensible language modeling toolkit

    Seventh International Conference on Spoken Language processing

    (2002)
  • T. Sembok et al.

    A rule and template based stemming algorithm for Arabic language

    Int. J. Math. Mod. Meth. Appl. Sci.

    (2011)
  • G.A. Abandah et al.

    Automatic diacritization of Arabic text using recurrent neural networks

    Int. J. Doc. Anal. Recognit. (IJDAR)

    (2015)
  • A.S. Metwally et al.

    A multi-layered approach for Arabic text diacritization

    2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA)

    (2016)
  • A. Masmoudi et al.

    Automatic diacritization of Tunisian dialect text using recurrent neural network

    Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

    (2019)
  • M. Diab et al.

    Arabic diacritization in the context of statistical machine translation

    Proceedings of Machine Translation Summit XI: Papers, Copenhagen, Denmark

    (2007)
  • I. Ahmad et al.

    Multi-stage HMM based Arabic text recognition with rescoring

    2015 13th International Conference on Document Analysis and Recognition (ICDAR)

    (2015)
  • I. Ahmad et al.

    Handwritten Arabic text recognition using multi-stage sub-core-shape HMMs

    Int. J. Doc. Anal. Recognit. (IJDAR)

    (2019)
  • M. Elshafei et al.

    Statistical methods for automatic diacritization of Arabic text

    The Saudi 18th National Computer Conference. Riyadh

    (2006)
  • Y. Belinkov et al.

    Arabic diacritization with recurrent neural networks

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

    (2015)
  • Cited by (0)

    Editor: Jiwen Lu

    View full text