skip to main content
research-article

Translation from Tunisian Dialect to Modern Standard Arabic: Exploring Finite-State Transducers and Sequence-to-Sequence Transformer Approaches

Published: 23 October 2024 Publication History

Abstract

Translation from the mother tongue, including the Tunisian dialect, to modern standard Arabic is a highly significant field in natural language processing due to its wide range of applications and associated benefits. Recently, researchers have shown increased interest in the Tunisian dialect, primarily driven by the massive volume of content generated spontaneously by Tunisians on social media following the revolution. This article presents two distinct translators for converting the Tunisian dialect into Modern Standard Arabic. The first translator utilizes a rule-based approach, employing a collection of finite state transducers and a bilingual dictionary derived from the study corpus. On the other hand, the second translator relies on deep learning models, specifically the sequence-to-sequence transformer model and a parallel corpus. To assess, evaluate, and compare the performance of the two translators, we conducted tests using a parallel corpus comprising 8,599 words. The results achieved by both translators are noteworthy. The translator based on finite state transducers achieved a BLEU score of 56.65, while the transformer model-based translator achieved a higher score of 66.07.

References

[1]
Sina Ahmadi and Maraim Masoud. 2020. Towards machine translation for the kurdish language. In Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages, Alina Karakanta, Atul Kr. Ojha, Chao-Hong Liu, Jade Abbott, John Ortega, Jonathan Washington, Nathaniel Oco, Surafel Melaku Lakew, Tommi A. Pirinen, Valentin Malykh, Varvara Logacheva, and Xiaobing Zhao (Eds.). Association for Computational Linguistics, Suzhou, China, 87–98. https://aclanthology.org/2020.loresmt-1.12
[2]
Roqayah Al-Ibrahim and Rehab M. Duwairi. 2020. Neural machine translation from Jordanian dialect to Modern Standard Arabic. In Proceedings of the 2020 11th International Conference on Information and Communication Systems (ICICS). IEEE, 173–178.
[3]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv:1607.06450 [stat.ML] https://arxiv.org/abs/1607.06450
[4]
Laith H. Baniata, Seyoung Park, and Seong-Bae Park. 2018. A neural machine translation model for Arabic dialects that utilizes multitask learning (MTL). Computational Intelligence and Neuroscience 2018, 1 (2018), 7534712:1–7534712:10. DOI:
[5]
Anabela Barreiro. 2008. ParaMT: A paraphraser for machine translation. In Proceedings of the 8th International Conference on Computational Processing of the Portuguese Language, PROPOR 2008, António J. S. Teixeira, Vera Lúcia Strube de Lima, Luís Caldas de Oliveira, and Paulo Quaresma (Eds.). Lecture Notes in Computer Science, Vol. 5190, Springer, 202–211. DOI:
[6]
Houda Bouamor, Nizar Habash, and Kemal Oflazer. 2014. A multidialectal parallel corpus of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014, Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asunción Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), 1240–1245.
[7]
Houda Bouamor, Nizar Habash, Mohammad Salameh, Wajdi Zaghouani, Owen Rambow, Dana Abdulrahim, Ossama Obeid, Salam Khalifa, Fadhl Eryani, Alexander Erdmann, and Kemal Oflazer. 2018. The MADAR Arabic dialect corpus and lexicon. In Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC 2018, Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Kôiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga (Eds.). European Language Resources Association (ELRA).
[8]
Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, A meeting of SIGDAT, a Special Interest Group of the ACL, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 1724–1734. DOI:
[9]
Injy Hamed, Nizar Habash, Slim Abdennadher, and Ngoc Thang Vu. 2023. Investigating lexical replacements for arabic-english code-switched data augmentation. (2023). arXiv:2205.12649 [cs.CL] https://arxiv.org/abs/2205.12649
[10]
Nadia Ghezaiel Hammouda and Kais Haddar. 2016. Integration of a segmentation tool for Arabic corpora in NooJ platform to build an automatic annotation tool. In Proceedings of the 10th International Conference on Automatic Processing of Natural-Language Electronic Texts with NooJ. NooJ 2016, Revised Selected Papers, Linda Barone, Mario Monteleone, and Max Silberztein (Eds.). Communications in Computer and Information Science, Vol. 667, Springer, 89–100.
[11]
Salima Harrat, Karima Meftouh, Mourad Abbas, Salma Jamoussi, Motaz Saad, and Kamel Smaïli. 2015. Cross-dialectal Arabic processing. In Proceedings of the 16th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2015, Alexander F. Gelbukh (Ed.). Lecture Notes in Computer Science, Vol. 9041, Springer, 620–632. DOI:
[12]
Salima Harrat, Karima Meftouh, and Kamel Smaili. 2017. Creating parallel Arabic dialect corpus: Pitfalls to avoid. In Proceedings of the 18th International Conference on Computational Linguistics and Intelligent Text Processing (CICLING).
[13]
Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6, 2 (1998), 107–116. DOI:
[14]
Laura Martinus and Jade Z. Abbott. 2019. A focus on neural machine translation for african languages. Retrieved from https://arxiv.org/abs/1906.05685
[15]
Karima Meftouh, Salima Harrat, Salma Jamoussi, Mourad Abbas, and Kamel Smaïli. 2015. Machine translation experiments on PADIC: A parallel Arabic DIalect corpus. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, PACLIC 29. ACL.
[16]
Karima Meftouh, Salima Harrat, and Kamel Smaïli. 2018. PADIC: Extension and new experiments. In Proceedings of the 7th International Conference on Advanced Technologies ICAT. 559–564.
[17]
El Moatez Billah Nagoudi, AbdelRahim A. Elmadany, and Muhammad Abdul-Mageed. 2022. AraT5: Text-to-text transformers for Arabic language generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 628–647. DOI:
[18]
El Moatez Billah Nagoudi, AbdelRahim A. Elmadany, and Muhammad Abdul-Mageed. 2022. TURJUMAN: A public toolkit for neural arabic machine translation. Retrieved from https://arxiv.org/abs/2206.03933
[19]
Michael Przystupa and Muhammad Abdul-Mageed. 2019. Neural machine translation of low-resource and similar languages with backtranslation. In Proceedings of the 4th Conference on Machine Translation, WMT 2019 - Volume 3: Shared Task Papers, Day 2, Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno-Yepes, Philipp Koehn, André Martins, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana L. Neves, Matt Post, Marco Turchi, and Karin Verspoor (Eds.). Association for Computational Linguistics, 224–235.
[20]
Azeddine Rhazi, Hayet Ben Ali, and Mourad Aouini. 2017. Translating passive structures from Arabic into English using the NooJ platform. International Journal of Information Technology and Language Studies 1, 2 (2017), 1–7.
[21]
Fatiha Sadat, Fatma Mallek, Mohamed Mahdi Boudabous, Rahma Sellami, and Atefeh Farzindar. 2014. Collaboratively constructed linguistic resources for language variants and their exploitation in NLP application - the case of Tunisian Arabic and the social media. In Proceedings of the Workshop on Lexical and Grammatical Resources for Language Processing, LG-LP@COLING 2014, Jorge Baptista, Pushpak Bhattacharyya, Christiane Fellbaum, Mikel L. Forcada, Chu-Ren Huang, Svetla Koeva, Cvetana Krstev, and Éric Laporte (Eds.). Association for Computational Linguistics and Dublin City University, 102–110. DOI:
[22]
Mohamed Ali Sghaier and Mounir Zrigui. 2020. Rule-based machine translation from Tunisian dialect to modern standard Arabic. In Proceedings of the 24th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, KES-2020, Matteo Cristani, Carlos Toro, Cecilia Zanni-Merk, Robert J. Howlett, and Lakhmi C. Jain (Eds.). Procedia Computer Science, Vol. 176, Elsevier, 310–319. DOI:
[23]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems 2014, Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (Eds.). Curran Associates, Inc., 3104–3112.
[24]
Toshiyuki Takezawa, Gen-ichiro Kikui, Masahide Mizushima, and Eiichiro Sumita. 2007. Multilingual spoken language corpus development for communication research. International Journal of Computational Linguistics & Chinese Language Processing 12, 3 (2007), 303–324.
[25]
Allahsera Auguste Tapo, Bakary Coulibaly, Sébastien Diarra, Christopher Homan, Julia Kreutzer, Sarah Luger, Arthur Nagashima, Marcos Zampieri, and Michael Leventhal. 2020. Neural machine translation for extremely Low-resource african languages: A case study on bambara. Retrieved from https://arxiv.org/abs/2011.05284
[26]
Roua Torjmen and Kais Haddar. 2022. Translation system from Tunisian dialect to modern standard Arabic. Concurrency and Computation: Practice and Experience. 34, 6 (2022), 1–16. DOI:
[27]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). Curran Associates, Inc., 5998–6008.
[28]
Inès Zribi, Rahma Boujelbane, Abir Masmoudi, Mariem Ellouze, Lamia Hadrich Belguith, and Nizar Habash. 2014. A conventional orthography for Tunisian Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014, Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asunción Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), 2355–2361.
[29]
Inès Zribi, Mariem Ellouze, Lamia Hadrich Belguith, and Philippe Blache. 2015. Spoken Tunisian Arabic corpus “STAC”: Transcription and annotation. Research in Computing Science 90, 1 (2015), 123–135.

Index Terms

  1. Translation from Tunisian Dialect to Modern Standard Arabic: Exploring Finite-State Transducers and Sequence-to-Sequence Transformer Approaches

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 10
    October 2024
    189 pages
    EISSN:2375-4702
    DOI:10.1145/3613658
    • Editor:
    • Imed Zitouni
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 October 2024
    Online AM: 24 July 2024
    Accepted: 22 July 2024
    Revised: 06 July 2024
    Received: 01 February 2024
    Published in TALLIP Volume 23, Issue 10

    Check for updates

    Author Tags

    1. Tunisian dialect
    2. finite-state transducer
    3. sequence-to-sequence transformer
    4. machine translation

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 117
      Total Downloads
    • Downloads (Last 12 months)117
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media