Abstract
Unlike other tongues, Arabic language is characterized by its written form which is essentially consonant and may not have short vowels. One of the major functions of short vowels is to determine and facilitate the meaning of words or sentences. However, MSA texts are generally written without vowels. This fact gives rise to a great deal of morphological, semantic, and syntactic ambiguities. Thus, this ambiguity problem is not only associated with Modern Standard Arabic (MSA) but also related to Arabic dialects in general and Tunisian Dialect (TD) in particular. Compared to MSA, TD suffers from the unavailability of basic tools and linguistic resources, like sufficient amount of corpora, multilingual dictionaries, morphological and syntactic analyzers of these resources makes the processing of this language a great challenge (Masmoudi et al., 2020). Despite the numerous efforts currently underway, still some shortages persist in this field. Hence, we tried to challenge this lack by presenting our work that investigates the automatic diacritization of TD texts. In this respect, we regard the diacritization problem as a simplified phrase-based SMT (Statistical Machine Translation) task. The source language is the undiacritic text while the target language is the diacritic text. We initially go deeper into the details of TD corpus creation. This corpus is finally approved and used to build a diacritic restoration system for the TD. It is called TDTACHKIL and it can achieve a Word Error Rate (WER) of 16.7% and Diacritic Error Rate (DER) of 8.89%.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Transcription is coded following Buckwalter. For more details about it, see Habash and Rambow (2007).
References
Abandah, G., Graves, A., Al-Shagoor, B., Arabiyat, A., Jamour, F., & Al-Taee, M. (2015a). Automatic diacritization of arabic text using recurrent neural networks. International Journal on Document Analysis and Recognition (IJDAR).
Abandah, G., Graves, A., Arabiyat, B., Jamour, F., & Al-Taee,M. (2015b). Automatic diacritization of Arabic text using recurrent neural networks, IJDAR,volume’18, number 2, pp. 183–197.
Afli, H., Barrault, L., & Schwenk, H. (2016). OCR error correction using statistical machine translation. International Journal of Linguistics and Computational Applications, 7(1), 175–191.
Ahmed, A., & Elaraby, M. (2000). A large-scale computational processor of the Arabic morphology, and applications. PhD thesis, Faculty of Engineering, Cairo University Giza, Egypt.
Al-Badrashiny, M., Hawwari, A., & Diab, M. (2017). A Layered Language Model based Hybrid Approach to Automatic Full Diacritization of Arabic. In Proceedings of The Third Arabic Natural Language Processing Workshop (WANLP).
Al-Taani, A., & Abu Al-Rub, S. (2009). A Rule-Based Approach for Tagging Non-Vocalized Arabic Words. The International Arab Journal of Information Technology.
Alghamdi, M., & Muzaffar, Z. (2007). KACST Arabic diacritizer. In Proceedings of the First International Symposium on Computers and Arabic Language, Riyadh, Saudi Arabia.
Alnefaiea, R., & Azmi, M. (2017). 2017. ACLing: Automatic minimal diacritization of Arabic texts.
Alotaibi, Y. A., Meftah, A. H., & Selouani, S. A. (2013). Diacritization. In Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE): Automatic Segmentation and Labeling for Levantine Arabic Speech.
Alqudah, S., Abandah, G., & Arabiyat, A. (2017). Investigating Hybrid Approaches for Arabic Text Diacritization with Recurrent Neural Networks. 2017 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies.
Ameur, M., Moulahoum, Y., & Guessoum, A. (2015). Restoration of Arabic Diacritics Using a Multilevel Statistical Model. In IFIP International Federation for Information Processing.
Ayman, A. Z., Elmahdy, M., Husni, H., & Al Jaam, J. (2016). Automatic diacritics restoration for Arabic text. International Journal of Computing & Information Science, December 2016. https://doi.org/10.21700/ijcis.2016.119.
Ayman, A. Z., Elmahdy, M., Husni, H., & Al Jaam, J. (2016). Automatic diacritics restoration for Arabic text. International Journal of Computing and Information Science, December 2016.
Azmi, A., & Almajed, R. (2015). A survey of automatic Arabic diacritization techniques. Natural Language Engineering, 21, pages:477–495.
Baccouche, T. (2003). L’arabe, d’une koin dialectale une langue de culture, Mémoires de la société linguistique de Paris, TomeXI, (les langues de Communication...), 87-93.
Baccouche, T. (1994). L’emprunt en arabe moderne, Beit Elhikma et IBLV.
Belinkov, Y., & Glass. J. (2015). Arabic diacritization with recurrent neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
Bouamor, H., Zaghouani, W., Diab, M., Obeid, O., Kemal, O., Ghoneim, M., & Hawwari, A. (2015). A pilot study on Arabic multi-genre corpus diacritization annotation. In Proceedings of the Second Workshop on Arabic Natural Language Processing.
Boujelbane, R., Mallek, M., Ellouze, M., & Belguith, L. (2014). Fine-Grained (POS) Tagging of Spoken Tunisian Dialect Corpora. In International Conference on Applications of Natural Language to Information Systems, NLDB’2014.
Brown, P., Pietra, S., Pietra, V., & Mercer, R. (1993). The mathematic of statistical machine translation : Parameter estimation. Computational linguistics, 19(2), 263–311.
Darwish, K., Abdelali, A., Mubarak, H., Samih, Y., & Attia, M. (2018). Diacritization of Moroccan and Tunisian Arabic Dialects: A CRF Approach, In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
Darwish, K., Mubarak, H., & Abdelali, A. (2017). Arabic diacritization: Stats, rules, and hacks. Proceedings of the Third Arabic Natural Language Processing Workshop, 9–17.
Diab, M., Ghoneim, M., & Habash. N. (2007). Arabic Diacritization in the Context of Statistical Machine Translation. In Proceedings of MTSummit, Copenhagen, Denmark.
El Klibi, S., El Hamzaoui, S., Ben Abda, H., Kaddes, C., & El Horcheni, F. (2014). and Maalla. Tunisie: A. La constitution en dialectetunisien. Association tunisienne de droitconstitutionnel.
Elshafei, M., Al-muhtaseb, H., & Alghamdi, M. (2006). Statistical methods for automatic diacritization of Arabic text. In Proceedings of Saudi 18th National Computer Conference (NCC18).
Fashwan, A., & Alansary, S. (2017). SHAKKIL: an automatic diacritization system for modern standard Arabic texts. Proceedings of The Third Arabic Natural Language Processing Workshop (WANLP).
Gal, Y. (2002). An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of the ACL’2002 Workshop on Computational Approaches to Semitic Languages, SEMITIC’02.
Gibson, M. L. (1998). Dialect Contact in Tunisian Arabic: Sociolinguistic and Structural Aspects. University of Reading.
Graja, M., Jaoua, M., & Belguith, L. (2015). Statistical Framework with Knowledge Base Integration for Robust Speech Understanding of the Tunisian Dialect. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(12), 2311–2321. https://doi.org/10.1109/TASLP.2015.2464687.
Habash, N., Shahrour, A., & Al-Khalil, M. (2016). 2016.
Habash, N., Diab, M., & Rambow, O. (2012). Conventional Orthography for Dialectal Arabic. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’2012).
Habash, N., & Rambow, O. (2007). Arabic diacritization through full morphological tagging. The Conference of the North American Chapter of the Association for Computational Linguistics.
Hamed, O., & Zesch, T. (2017). A Survey and Comparative Study of Arabic Diacritization Tools. Journal of Language Technology and Computational Linguistics, volume 32, number 1.
Harrat, S., Abbas, M., Meftouh, K., Smaili, K., Bouzareah, E. N. S., & Loria, C. (2013). Diacritics restoration for Arabic dialect texts. 14th Annual Conference of the International Speech Communication.
Hermena, E., Drieghe, D., Hellmuth, S., & Simon P. (2015). Processing of Arabic Diacritical Marks: Phonological Syntactic Disambiguation of Homographic Verbs and Visual Crowding Effects. Journal of Experimental Psychology. Human Perception and Performance, 41, pages: 494–507.
Hifny, Y. (2012). Higher order n-gram language models for Arabic diacritics restoration. In Proceedings of the 12th Conference on Language Engineering (ESOLEC 12).
Holes, C. (2004). Modern Arabic: Structures, Functions, and Varieties, Georgetown. Ed. Washington.
Jarrar, M., Habash, N., Akra, D., Zalmout, N., & Bank, W. (2014). Building a Corpus for Palestinian Arabic: a Preliminary Study.
Khalifa, S., Habash, N., Eryani, F., Obeid, O., Abdulrahim, D., & Al Kaabi, M. (2018). A Morphologically Annotated Corpus of Emirati Arabic. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC’2018.
Kirchhoff, K., & Vergyri, D. (2005). Cross- Dialectal Data Sharing for Acoustic Modeling in Arabic Speech Recognition. Speech Communication, 46, pages: 37–51.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., & Bertoldi, N. (2007). Moses: Open Source Toolkit for Statistical Machine Translation. ACL’2007, demonstration session.
Kubra, A., & Eryigit, G. (2014). Vowel and Diacritic Restoration for Social Media Texts. 5th Workshop on Language Analysis for Social Media (LASM) at EACL’2014.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proc. ICML, 282–289.
Lawson, S., & Itesh, S. (1997). Accommodation communicative en Tunisie: une etude empirique. Plurilinguisme et identités au Maghreb, Publications de l’Universite de Rouen, 01–114.
Maamouri, M., Bies, A., Buckwalter, T., & Mekki, W. (2004). The Penn Arabic treebank: Building a large-scale annotatedArabic corpus. In: NEMLAR Conf. Arabic Language Resources and Tools, pp. 102-109.
Maamouri, M., Bies, A., & Kulick, S. (2006). Diacritization: A Challenge to Arabic Treebank Annotation and Parsing. Proceeding of the British Computer Society Arabic NLP/MT Conference, 2006.
Maamouri, M., Bies, A., & Kulick, S. (2008). Enhancing the Arabic treebank: a collaborative effort toward new annotation guidelines. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008).
Maamouri, M., Bies, A., & Kulick, S. (2009). Creating a methodology for large-scale correction of treebank annotation: The case of the arabic treebank. In Proceedings of MEDAR International Conference on Arabic Language Resources and Tools, Cairo, Egypt.
Masmoudi, A., Bougares, F., Khmekhem, M., Estéve, Y., & Belguith, L. (2018). Automatic speech recognition system for Tunisian dialect. Language Resources and Evaluation, 52(1), 249–267.
Masmoudi, A., Ellouze, M., & Belguith, L. (2019). Automatic diacritization of Tunisian dialect text using Recurrent Neural Network. RANLP, 2019, 730–739.
Masmoudi, A., Ellouze, M., Khrouf, M., & Belguith, L. (2020). Transliteration of Arabizi into Arabic Script for Tunisian Dialect. ACM Transactions on Asian and Low-Resource Language Information Processing, 19(2), 32:1-32:21.
Masmoudi, A., Khmekhem, M., Estéve, Y., Bougares, F., & Belguith, L. (2014). Phonetic tool for the Tunisian Arabic. In the 4th International Workshop on Spoken Language Technologies for Under-resourced Languages.
Masmoudi, A., Khmekhem, M., Estéve, Y., Bougares, F., Belguith, L., & Habash, N. (2014). A corpus and a phonetic dictionary for Tunisian Arabic speech recognition. In 19th edition of the Language Resources and Evaluation Conference.
Masmoudi, A., Habash, N., Khmekhem, M., Estéve, Y., & Belguith, L. (2015). Arabic Transliteration of Romanized Tunisian Dialect Text: A Preliminary Investigation. Computational Linguistics and Intelligent Text Processing, 16th International Conference, CICLing 2015.
Masmoudi, A., Mdhaffar, S., Sellami, R., & Belguith, L. (2019). Automatic Diacritics Restoration for Tunisian Dialect, ACM Transactions on Asian and Low-Resource Language Information Processing, volume 18, number 3.
Mejri, S., Said, M., & Sfar, I. (2009). Pluringuisme et diglossie en Tunisie. Synerg. Tunisie, 1, 53–74.
Nelken, R., & Shieber, S. M. (2005). Arabic diacritization using weighted –nite–state transducers. In: ACL Workshopon Computational Approaches to Semitic Languages, pp. 79–86.
Och, F., & Ney, H. (2003). A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29(1), 19–52.
Ouerhani, B. (2009). Interférence entre le dialectal et le litteral en Tunisie F: Le cas de la morphologie verbale. Synerg. Tunisie, 1, 75–84.
Rashwan, M., Al Sallab, A., Raafat, H., & Rafea, A. (2015). Deep Learning Framework with Confused Sub Set Resolution Architecture for Automatic Arabic Diacritization (p. 2015). IEEE/ACM Transactions on Audio: Speech, and Language Processing.
Saadane, H., & Habash, N. (2015). A Conventional Orthography for Algerian Arabic. In Proceedings of the Second Workshop on Arabic Natural Language Processing.
Said, A., El-Sharqwi, M., Chalabi, A., & Kamal, E. (2013). A hybrid approach for Arabic diacritization. E. Mtais,F. Meziane, M. Saraee, V. Sugumaran, S. Vadera (eds.) Natural Language Processing and Information Systems, Lecture Notes in Computer Science, vol. 7934, pp. 53–64. Springer.
Schlippe, T., ThuyLinh, N., & Stephan, V. (2008). Diacritization as a Machine Translating Problem and as a Sequence Labeling Problem”, Proceedings of the Eighth Conference of the Association for Machine Translation in the Americas (AMTA), Hawai’i, USA,2008.
Schlippe, T. (2008). Statistical methods for automatic Diacritization of Arabic Texts. Carnegie Mello University Pittsburgh, USA, May 2008.
Sfar, I. (2005). Morphologie des noms de professions : incorporation et paraphrase, La terminologie, entre traduction et bilinguisme, pages 15–16, 2005.
Shaalan, K., Abo Bakr, M., & Ziedan. I. (2009). A hybrid approach for building Arabic diacritizer. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages.
Shaalan, K., Abo Bakr, H., & Ziedan. I. (2008). A statistical method for adding case ending diacritics for Arabic text. In Proceedings of Language Engineering Conference.
Stolcke A. (2002). SRILM an Extensible Language Modeling Toolkit. Proceedings of ICSLP.
Talmoudi, F. (1980). A morphosyntactic study of Romance verbs in the Arabic dialects of Tunis, Sousa, and Sfax. Gothobg: GHteborg Acta Univ.
Tilmatine, M. (1999). Substrat Et Convergences: Le Berbére Et L’arabe Nord-Africain, in: HAAK, M., JONG, R. DE, VERSTEEGH, K. (Eds.), Estudios de Dialectologia Norteafricana Y Andalusi.
Vergyri, D., & Kirchho, K. (2004). Automatic diacritization of Arabic for acoustic modeling in speech recognition. Workshop on Computational Approaches to Arabic ScriptbasedLanguages, pp. 66-73 (2004)
Wang, D., & King, S. (2011). Letter-to-sound Pronunciation Prediction Using Conditional Random Fields. IEEE Signal Processing Letters.
Zaghouani, W., Habash, N., Bouamor, H., Rozovskaya, A., Mohit, B., Heider, A., & Oflazer, K. (2015). Correction annotation for non-native arabic texts: Guidelines and corpus. In Proceedings of the Association for Computational Linguistics Fourth Linguistic Annotation Workshop.
Zaghouani, W., Bouamor, H., Hawwari, A., Diab, M., Obeid, O., Ghoneim, M., Alqahtani, S., & Oflazer, K. (2016). Guidelines and framework for a large-scale Arabic diacritized corpus. Proceedings of the Tenth International Conference on Language Resources and Evaluation: LREC’2016.
Zaghouani, W., Habash, N., Obeid, O., Mohit, B., Bouamor, H., & Oflazer, K. (2016). Building an Arabic machine translation post-edited corpus: Guidelines and annotation. In International Conference on Language Resources and Evaluation: LREC’2016.
Zitouni, I., Sorensen, J. & Sarikaya, R. (2006). Maximum entropy based restoration of Arabic diacritics. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics.
Zitouni, I. & Sarikaya, R. (2009). Arabic Diacritic Restoration Approach Based on Maximum Entropy Models. In Journal of Computer Speech and Language.
Zribi, I., Khmekhem, M., Belguith, L., & Blache, P. (2017). Morphological disambiguation of Tunisian dialect. Journal of King Saud University, Computer and Information Sciences, 29, 147–155.
Zribi, I., Boujelbane, R., Masmoudi, A., Ellouze, M., Belguith, L., & Habash, N. (2014). A Conventional Orthography for Tunisian Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation: LREC’14.
Zribi, I., Ellouze, M., Belguith, L. H., & Blache, P. (2015). Spoken Tunisian Arabic Corpus “STAC”: Transcription and Annotation (p. 90). Sci: Res. Comput.
Zribi, I., Graja, M., Khemakhem, M.E., Jaoua, M., & Belguith, L. (2013). Orthographic Transcription for Spoken Tunisian Arabic, in: A. Gelbukh (Ed.): CICLing 2013.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Masmoudi, A., Aloulou, C., Abdellahi, A.G.S. et al. Automatic diacritization of Tunisian dialect text using SMT model. Int J Speech Technol 25, 89–104 (2022). https://doi.org/10.1007/s10772-021-09864-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-021-09864-6