Automatic dottization of Arabic text (Rasms) using deep recurrent neural networks☆
Introduction
Languages and scripts evolve over time. In the early stage of the Arabic language, the alphabet characters were only shapes (called Rasm) without dots or diacritics. Dots were added to the characters to differentiate between similar looking characters. Subsequently, the diacritics were invented and are used for resolving ambiguities and for phonetic guidance. Nowadays, dots are used permanently with characters, unlike diacritics which are used in limited circumstances. Fig. 1 illustrates this idea with an example Arabic sentence written once using only the Rasms, once with the addition of dots, and once with the addition of diacritics along with dots. As a second example, an image of a parchment having an ancient Quranic manuscript written on it is shown in Fig. 2. It was written using Rasms and does not contain any dots.1
A Rasm sequence without dots can represent several possible words. Fig. 3 presents two example Rasm sequences and different possible Arabic words they can represent. For the first example (the top row in the figure), we have six possible words a Rasm sequence can represent. Similarly, for the second example, the presented Rasm sequence can represent five different words. The correct word depends on the context of the text and it is difficult to identify instantly, even by a native speaker.
Many studies have been performed to automate the diacritization process, Abandah et al. [1], Metwally et al. [2], Masmoudi et al. [3]. Moreover, Arabic diacritization is beneficial to other tasks in Arabic natural language processing (ANLP), such as translation Diab et al. [4]. Since dots were introduced before diacritics and are more fundamental to the script, it motivated us to investigate the use of deep learning for automatic dottization of Arabic texts. We believe that automatic dottization will contribute to future ANLP research both directly and indirectly. A direct use case of automatic dottization can be to automatically transcribe ancient Arabic manuscripts once Rasms have been recognized. A possible implication of this research can be on Arabic text recognition research. Some researchers have investigated the idea of splitting the task of Arabic text recognition into multiple stages involving recognizing Arabic Rasms separately from dots and then combining the Rasms and dots to output the final Arabic texts Ahmad and Fink [[5], [6]]. Another use of this approach could be in the tokenization of Arabic texts for other NLP tasks. Additionally, a relatively newer use case is in the field of social media moderation, where users use Arabic texts without dots to evade censorship. As an example, Fig. 4 shows an Arabic tweet using only Rasms.
To the best of our knowledge, automatically adding dots to Arabic Rasms has not been reported in the literature. In this paper, we will present the automatic addition of dots to Arabic text using deep recurrent neural networks. We present two different approaches to dottization, one using word sequences as input and the other using character sequences as input. The presented techniques were evaluated on four different Arabic text corpora that are publicly available.
It should be noted that although automatic Arabic diacritization is a related problem to automatic Arabic dottization, they have differences at the same time. Dots are more fundamental in Arabic texts compared to the diacritics. Most Arabic texts are written without diacritics but not without dots. The challenges and opportunities involved in adding dots are different from those involved in adding diacritics. Although a character can take any diacritic from among the possible diacritics in Arabic, a sequence of diacritics over a word is very limited. Nouns, for example, have a fixed sequence of diacritics, and they do not change except the diacritic on the last character. Furthermore, the sequence of diacritics over a verb follows a certain template, such as fatha fatha fatha as in or damma kasra fatha as in and they cannot appear in every possible combination. Additionally, many of the diacritics can appear only at specific positions in a word. For example, the diacritic sukun cannot come over the first character of a word. Similarly, tanwin diacritics can only appear over the last character of a word. On the other hand, dots are fundamental in defining the characters themselves. They do not follow a fixed template as diacritics. Additionally, some Arabic Rasms belong to a single character (like ا and ل) and do not have dots. Moreover, some word Rasms lend themselves to only a specific word in Arabic, as no other Arabic word appears with the same Rasm sequence. This insight can help in the dottization task, as we will discuss in the rest of the paper. Furthermore, homographs in Arabic can have different diacritics, but dottization does not face such issues. For example, the Arabic word can mean the noun gold or the verb he went depending on the diacritic over the last character. On the other hand, many verb forms, such as (he writes) and (she writes) and (we write) have the same Rasm sequence but different dots in the prefix, but all the forms have the same diacritic sequence, fatha sukun damma fatha.
The rest of the paper is structured as follows: in Section 2, we present the background on the Arabic script. In Section 3, we present the related work specifically focusing on Arabic diacritization. In Section 4, we present the methodology, including tokenization, model design and postprocessing steps. In Section 5, we present the datasets used for experimentation, the evaluation metrics used to report the results, the experimentation and results and some discussions related to error analysis. Finally, in Section 6, we conclude with our findings and discuss possible extensions to our work.
Section snippets
Background on the Arabic script
Arabic is written from right to left and has 28 characters, and it does not have upper or lower cases as in English. Some characters have no dots, while many of the characters have dots either above or below them. These 28 characters use 17 unique Rasms, as shown in Fig. 5. It is clear from the figure that many characters share the same Rasm between them. For example, the three characters (ب - ت - ث) are mapped to a single Rasm, as shown in Fig. 5.
Furthermore, an Arabic character can have many
Related work
To the best of our knowledge, automatically adding dots to Arabic Rasms has not been reported in the literature. The closest research related to our topic is the automatic diacritization of Arabic texts. Consequently, we will briefly present some representative studies in the area of automatic diacritization of Arabic text in addition to some other relevant works on Arabic NLP.
Elshafei et al. [7] formulated the diacritization problem using hidden Markov models (HMMs), wherein the hidden state
Methodology
In this section, we describe our methodology for automatically adding dots to Arabic text using deep learning. We use recurrent neural networks (RNNs) because they have shown promising results in various NLP tasks, as evident from the literature. Since the input is a sentence without dots and the output is the same sentence with dots, we adopt the sequence-to-sequence (Seq2Seq) learning approach. Moreover, there is a direct one-to-one mapping between the input and the output. The input sequence
Experiments and results
First, we present the datasets used for the experiments in addition to the preprocessing and tokenization carried out before feeding the data to our system. Next, we present the metrics used for system evaluation. This is followed by details on system training. Finally, we present the results and the discussion.
Conclusions
In this paper, we presented automatic dottization of Arabic text. To the best of our knowledge, this is the first work addressing this topic. We use recurrent neural networks for the task. Moreover, two different approaches were investigated: one using words as tokens and the other using characters as tokens. The benefits and limitations of both approaches were discussed although overall, the character-level system outperformed the word-level system. A postprocessing step led to further small
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
The authors would like to thank Saudi Data and AI Authority (SDAIA) and King Fahd University of Petroleum and Minerals (KFUPM) for supporting this work under SDAIA-KFUPM Joint Research Center for Artificial Intelligence grant no. JRC-AI-RFP-06.
References (22)
- et al.
Metrec: a dataset for meter classification of Arabicpoetry
Data Brief
(2020) Srilm-an extensible language modeling toolkit
Seventh International Conference on Spoken Language processing
(2002)- et al.
A rule and template based stemming algorithm for Arabic language
Int. J. Math. Mod. Meth. Appl. Sci.
(2011) - et al.
Automatic diacritization of Arabic text using recurrent neural networks
Int. J. Doc. Anal. Recognit. (IJDAR)
(2015) - et al.
A multi-layered approach for Arabic text diacritization
2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA)
(2016) - et al.
Automatic diacritization of Tunisian dialect text using recurrent neural network
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
(2019) - et al.
Arabic diacritization in the context of statistical machine translation
Proceedings of Machine Translation Summit XI: Papers, Copenhagen, Denmark
(2007) - et al.
Multi-stage HMM based Arabic text recognition with rescoring
2015 13th International Conference on Document Analysis and Recognition (ICDAR)
(2015) - et al.
Handwritten Arabic text recognition using multi-stage sub-core-shape HMMs
Int. J. Doc. Anal. Recognit. (IJDAR)
(2019) - et al.
Statistical methods for automatic diacritization of Arabic text
The Saudi 18th National Computer Conference. Riyadh
(2006)
Arabic diacritization with recurrent neural networks
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
Cited by (0)
- ☆
Editor: Jiwen Lu