skip to main content
research-article

Hybrid Pipeline for Building Arabic Tunisian Dialect-standard Arabic Neural Machine Translation Model from Scratch

Published: 14 April 2023 Publication History

Abstract

Deep Learning is one of the most promising technologies compared to other methods in the context of machine translation. It has been proven to achieve impressive results on large amounts of parallel data for well-endowed languages. Nevertheless, for low-resource languages such as the Arabic Dialects, Deep Learning models failed due to the lack of available parallel corpora. In this article, we present a method to create a parallel corpus to build an effective NMT model able to translate into MSA, Tunisian Dialect texts present in social networks. For this, we propose a set of data augmentation methods aiming to increase the size of the state-of-the-art parallel corpus. By evaluating the impact of this step, we noticed that it has effectively boosted both the size and the quality of the corpus. Then, using the resulted corpus, we compare the effectiveness of CNN, RNN and transformers models to translate Tunisian Dialect into MSA. Experiments show that a better translation is achieved by the transformer model with a BLEU score of 60 vs., respectively, 33.36 and 53.98 with RNN and CNN models.

References

[1]
R. Al-Ibrahim and R. M. Duwairi. 2020. Neural machine translation from Jordanian dialect to modern standard Arabic. In Proceedings of the 11th International Conference on Information and Communication Systems (ICICS). 173–178. DOI:
[2]
Alina Karakanta, Jon Dehdari, and Josef van Genabith. 2018. Neural machine translation for low-resource languages without parallel corpora. Mach. Translat. 32 (2018),167–189.
[3]
Ebtesam H. Almansor and Ahmed Al-Ani. 2018. A hybrid neural machine translation technique for translating low resource languages. In Machine Learning and Data Mining in Pattern Recognition. Springer International Publishing, Cham, 347–356.
[4]
Ebtesam H. Almansor and Ahmed Al-Ani. 2017. Translating dialectal Arabic as low resource language using word embedding. In Proceedings of the International Conference Recent Advances in Natural Language Processing. INCOMA Ltd., 52–57. DOI:
[5]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2016. Neural Machine Translation by Jointly Learning to Align and Translate. arxiv:1409.0473 [cs.CL].
[6]
Laith H. Baniata, Seyoung Park, and Seong-Bae Park. 2018. A neural machine translation model for Arabic dialects that utilizes multitask learning (MTL). Computat. Intell. Neurosci.Dec. 10 (2018).
[7]
Houda Bouamor, Nizar Habash, and Kemal Oflazer. 2014. A multidialectal parallel corpus of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), 1240–1245. Retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/523_Paper.pdf.
[8]
Houda Bouamor, Nizar Habash, Mohammad Salameh, Wajdi Zaghouani, Owen Rambow, Dana Abdulrahim, Ossama Obeid, Salam Khalifa, Fadhl Eryani, Alexander Erdmann, and Kemal Oflazer. 2018. The MADAR Arabic dialect corpus and lexicon. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association. Retrieved from https://www.aclweb.org/anthology/L18-1535.
[9]
Rahma Boujelbane, Mariem Ellouze Khemekhem, and Lamia Hadrich Belguith. 2013. Mapping rules for building a Tunisian dialect lexicon and generating corpora. In Proceedings of the 6th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 419–428. Retrieved from https://www.aclweb.org/anthology/I13-1048.
[10]
Kehai Chen, Rui Wang, Masao Utiyama, Lemao Liu, Akihiro Tamura, Eiichiro Sumita, and Tiejun Zhao. 2017. Neural machine translation with source dependency representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv:1810.04805 [cs.CL].
[12]
Fatma El-zahraa El-taher, Alaa Aldin Hammouda, and Salah Abdel-Mageid. 2016. Automation of understanding textual contents in social networks. In Proceedings of the International Conference on Selected Topics in Mobile Wireless Networking (MoWNeT). 1–7. DOI:
[13]
Gao Fei, Zhu Jinhua, Wu Lijun, Xia Yingce, Qin Tao, Cheng Xueqi, Zhou Wengang, and Liu Tie-Yan. 2019. Soft contextual data augmentation for neural machine translation. In Proceedings of the Association for Computational Linguistics.
[14]
M. Graja, M. Jaoua, and L. Hadrich Belguith. 2015. Statistical framework with knowledge base integration for robust speech understanding of the Tunisian dialect. IEEE/ACM Trans. Audio, Speech Lang. Process. 23, 12 (2015), 2311–2321. DOI:
[15]
Ahmed Hamdi, Rahma Boujelbane, Nizar Habash, and Alexis Nasr. 2013. The effects of factorizing root and pattern mapping in bidirectional Tunisian–standard Arabic machine translation. In Proceedings of the MT Summit. Retrieved from https://hal.archives-ouvertes.fr/hal-00908761.
[16]
Serena Jeblee, Weston Feely, Houda Bouamor, Alon Lavie, Nizar Habash, and Kemal Oflazer. 2014. Domain and dialect adaptation for machine translation into Egyptian Arabic. In Proceedings of the EMNLP Workshop on Arabic Natural Language Processing (ANLP). Association for Computational Linguistics, 196–206. DOI:
[17]
Zhang Jinyi and Matsumoto Tadahiro. 2019. Corpus augmentation by sentence segmentation for low-resource neural machine translation. CoRR abs/1905.08945 (2019).
[18]
Karima Meftouh, Salima Harrat, S. Jamoussi, M. Abbas, and Kamel Smaïli. 2015. Machine translation experiments on PADIC: A parallel Arabic dialect corpus. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation.26–34.
[19]
Saméh Kchaou, Rahma Boujelbane, and Lamia Hadrich-Belguith. 2020. Parallel resources for Tunisian Arabic dialect translation. In Proceedings of the 5th Arabic Natural Language Processing Workshop. Association for Computational Linguistics, 200–206. Retrieved from https://www.aclweb.org/anthology/2020.wanlp-1.18.
[20]
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations. Association for Computational Linguistics.
[21]
Julia Kreutzer, Jasmijn Bastings, and Stefan Riezler. 2020. Joey NMT: A Minimalist NMT Toolkit for Novices. arxiv:1907.12484 [cs.CL].
[22]
Surafel M. Lakew, Mauro Cettolo, and Marcello Federico. 2018. A Comparison of Transformer and Recurrent Neural Networks on Multilingual Neural Machine Translation. arxiv:1806.06957 [cs.CL].
[23]
Y. Li, X. Li, Y. Yang, and R. Dong. 2020. A diverse data augmentation strategy for low-resource neural machine translation. Information11, 255 (2020),2078–2489.
[24]
Fadaee Marzieh, Bisazza Arianna, and Monz Christof. 2017. Data augmentation for low-resource neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.567–573.
[25]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arxiv:1912.01703 [cs.LG].
[26]
Aquia Richburg, Ramy Eskander, Smaranda Muresan, and Marine Carpuat. 2020. An evaluation of subword segmentation strategies for neural machine translation of morphologically rich languages. In Proceedings of the the 4th Widening Natural Language Processing Workshop. Association for Computational Linguistics, 151–155. DOI:
[27]
Alexander Rush. 2018. The annotated transformer. In Proceedings of the Workshop for NLP Open Source Software (NLP-OSS). Association for Computational Linguistics, 52–60. DOI:
[28]
Wael Salloum and Nizar Habash. 2012. Elissa: A dialectal to standard Arabic machine translation system. In Proceedings of COLING 2012: Demonstration Papers. The COLING 2012 Organizing Committee, 385–392. Retrieved from https://www.aclweb.org/anthology/C12-3048.
[29]
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 464–468. DOI:
[30]
Bashar Talafha, Mohammad Ali, Muhy Eddin Za’ter, Haitham Seelawi, Ibraheem Tuffaha, Mostafa Samir, Wael Farhan, and Hussein T. Al-Natsheh. 2020. Multi-dialect Arabic BERT for Country-level Dialect Identification. arxiv:2007.05612 [cs.CL].
[31]
Allahsera Auguste Tapo, Bakary Coulibaly, Sébastien Diarra, Christopher Homan, Julia Kreutzer, Sarah Luger, Arthur Nagashima, Marcos Zampieri, and Michael Leventhal. 2020. Neural machine translation for extremely low-resource African languages: A case study on Bambara. In Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages. Association for Computational Linguistics.
[32]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. CoRR abs/1706.03762 (2017).
[33]
Shijie Wu and Mark Dredze. 2019. Beto, Bentz, Becas: The Surprising Cross-lingual Effectiveness of BERT. arxiv:1904.09077 [cs.CL].
[34]
Inès Zribi, M. Ellouze, L. Belguith, and P. Blache. 2017. Morphological disambiguation of Tunisian dialect. J. King Saud Univ. Comput. Inf. Sci. 29 (2017), 147–155.

Cited By

View all
  • (2024)A Survey on Machine Translation of Low-Resource Arabic Dialects2024 15th International Conference on Information and Communication Systems (ICICS)10.1109/ICICS63486.2024.10638285(1-6)Online publication date: 13-Aug-2024
  • (2024)Latest Research in Data Augmentation for Low Resource Language Text Translation: A Review2024 International Conference on Computer, Control, Informatics and its Applications (IC3INA)10.1109/IC3INA64086.2024.10732042(185-190)Online publication date: 9-Oct-2024
  • (2024)BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10826131(1635-1644)Online publication date: 15-Dec-2024
  • Show More Cited By

Index Terms

  1. Hybrid Pipeline for Building Arabic Tunisian Dialect-standard Arabic Neural Machine Translation Model from Scratch

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 3
    March 2023
    570 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3579816
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 April 2023
    Online AM: 02 November 2022
    Accepted: 05 October 2022
    Received: 26 January 2022
    Published in TALLIP Volume 22, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Neural Machine Translation
    2. data augmentation
    3. Arabic Tunisian Dialect
    4. Modern Standard Arabic

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)92
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 16 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Survey on Machine Translation of Low-Resource Arabic Dialects2024 15th International Conference on Information and Communication Systems (ICICS)10.1109/ICICS63486.2024.10638285(1-6)Online publication date: 13-Aug-2024
    • (2024)Latest Research in Data Augmentation for Low Resource Language Text Translation: A Review2024 International Conference on Computer, Control, Informatics and its Applications (IC3INA)10.1109/IC3INA64086.2024.10732042(185-190)Online publication date: 9-Oct-2024
    • (2024)BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10826131(1635-1644)Online publication date: 15-Dec-2024
    • (2024)Crossing Linguistic Barriers: A Hybrid Attention Framework for Chinese-Arabic Machine Translation2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)10.1109/ACDSA59508.2024.10467398(1-6)Online publication date: 1-Feb-2024
    • (2024)Arabic Text Formality Modification: A Review and Future Research DirectionsIEEE Access10.1109/ACCESS.2024.3511661(1-1)Online publication date: 2024
    • (2023)Challenges and Progress in Constructing Arabic Dialect Corpora and Linguistic tools: A Focus on Moroccan and Tunisian Dialects2023 7th IEEE Congress on Information Science and Technology (CiSt)10.1109/CiSt56084.2023.10410009(293-298)Online publication date: 16-Dec-2023

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media

    Get Access

    Get Access

    Login options

    Full Access

    References

    References

    [1]
    R. Al-Ibrahim and R. M. Duwairi. 2020. Neural machine translation from Jordanian dialect to modern standard Arabic. In Proceedings of the 11th International Conference on Information and Communication Systems (ICICS). 173–178. DOI:
    [2]
    Alina Karakanta, Jon Dehdari, and Josef van Genabith. 2018. Neural machine translation for low-resource languages without parallel corpora. Mach. Translat. 32 (2018),167–189.
    [3]
    Ebtesam H. Almansor and Ahmed Al-Ani. 2018. A hybrid neural machine translation technique for translating low resource languages. In Machine Learning and Data Mining in Pattern Recognition. Springer International Publishing, Cham, 347–356.
    [4]
    Ebtesam H. Almansor and Ahmed Al-Ani. 2017. Translating dialectal Arabic as low resource language using word embedding. In Proceedings of the International Conference Recent Advances in Natural Language Processing. INCOMA Ltd., 52–57. DOI:
    [5]
    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2016. Neural Machine Translation by Jointly Learning to Align and Translate. arxiv:1409.0473 [cs.CL].
    [6]
    Laith H. Baniata, Seyoung Park, and Seong-Bae Park. 2018. A neural machine translation model for Arabic dialects that utilizes multitask learning (MTL). Computat. Intell. Neurosci.Dec. 10 (2018).
    [7]
    Houda Bouamor, Nizar Habash, and Kemal Oflazer. 2014. A multidialectal parallel corpus of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), 1240–1245. Retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/523_Paper.pdf.
    [8]
    Houda Bouamor, Nizar Habash, Mohammad Salameh, Wajdi Zaghouani, Owen Rambow, Dana Abdulrahim, Ossama Obeid, Salam Khalifa, Fadhl Eryani, Alexander Erdmann, and Kemal Oflazer. 2018. The MADAR Arabic dialect corpus and lexicon. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association. Retrieved from https://www.aclweb.org/anthology/L18-1535.
    [9]
    Rahma Boujelbane, Mariem Ellouze Khemekhem, and Lamia Hadrich Belguith. 2013. Mapping rules for building a Tunisian dialect lexicon and generating corpora. In Proceedings of the 6th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 419–428. Retrieved from https://www.aclweb.org/anthology/I13-1048.
    [10]
    Kehai Chen, Rui Wang, Masao Utiyama, Lemao Liu, Akihiro Tamura, Eiichiro Sumita, and Tiejun Zhao. 2017. Neural machine translation with source dependency representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
    [11]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv:1810.04805 [cs.CL].
    [12]
    Fatma El-zahraa El-taher, Alaa Aldin Hammouda, and Salah Abdel-Mageid. 2016. Automation of understanding textual contents in social networks. In Proceedings of the International Conference on Selected Topics in Mobile Wireless Networking (MoWNeT). 1–7. DOI:
    [13]
    Gao Fei, Zhu Jinhua, Wu Lijun, Xia Yingce, Qin Tao, Cheng Xueqi, Zhou Wengang, and Liu Tie-Yan. 2019. Soft contextual data augmentation for neural machine translation. In Proceedings of the Association for Computational Linguistics.
    [14]
    M. Graja, M. Jaoua, and L. Hadrich Belguith. 2015. Statistical framework with knowledge base integration for robust speech understanding of the Tunisian dialect. IEEE/ACM Trans. Audio, Speech Lang. Process. 23, 12 (2015), 2311–2321. DOI:
    [15]
    Ahmed Hamdi, Rahma Boujelbane, Nizar Habash, and Alexis Nasr. 2013. The effects of factorizing root and pattern mapping in bidirectional Tunisian–standard Arabic machine translation. In Proceedings of the MT Summit. Retrieved from https://hal.archives-ouvertes.fr/hal-00908761.
    [16]
    Serena Jeblee, Weston Feely, Houda Bouamor, Alon Lavie, Nizar Habash, and Kemal Oflazer. 2014. Domain and dialect adaptation for machine translation into Egyptian Arabic. In Proceedings of the EMNLP Workshop on Arabic Natural Language Processing (ANLP). Association for Computational Linguistics, 196–206. DOI:
    [17]
    Zhang Jinyi and Matsumoto Tadahiro. 2019. Corpus augmentation by sentence segmentation for low-resource neural machine translation. CoRR abs/1905.08945 (2019).
    [18]
    Karima Meftouh, Salima Harrat, S. Jamoussi, M. Abbas, and Kamel Smaïli. 2015. Machine translation experiments on PADIC: A parallel Arabic dialect corpus. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation.26–34.
    [19]
    Saméh Kchaou, Rahma Boujelbane, and Lamia Hadrich-Belguith. 2020. Parallel resources for Tunisian Arabic dialect translation. In Proceedings of the 5th Arabic Natural Language Processing Workshop. Association for Computational Linguistics, 200–206. Retrieved from https://www.aclweb.org/anthology/2020.wanlp-1.18.
    [20]
    Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations. Association for Computational Linguistics.
    [21]
    Julia Kreutzer, Jasmijn Bastings, and Stefan Riezler. 2020. Joey NMT: A Minimalist NMT Toolkit for Novices. arxiv:1907.12484 [cs.CL].
    [22]
    Surafel M. Lakew, Mauro Cettolo, and Marcello Federico. 2018. A Comparison of Transformer and Recurrent Neural Networks on Multilingual Neural Machine Translation. arxiv:1806.06957 [cs.CL].
    [23]
    Y. Li, X. Li, Y. Yang, and R. Dong. 2020. A diverse data augmentation strategy for low-resource neural machine translation. Information11, 255 (2020),2078–2489.
    [24]
    Fadaee Marzieh, Bisazza Arianna, and Monz Christof. 2017. Data augmentation for low-resource neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.567–573.
    [25]
    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arxiv:1912.01703 [cs.LG].
    [26]
    Aquia Richburg, Ramy Eskander, Smaranda Muresan, and Marine Carpuat. 2020. An evaluation of subword segmentation strategies for neural machine translation of morphologically rich languages. In Proceedings of the the 4th Widening Natural Language Processing Workshop. Association for Computational Linguistics, 151–155. DOI:
    [27]
    Alexander Rush. 2018. The annotated transformer. In Proceedings of the Workshop for NLP Open Source Software (NLP-OSS). Association for Computational Linguistics, 52–60. DOI:
    [28]
    Wael Salloum and Nizar Habash. 2012. Elissa: A dialectal to standard Arabic machine translation system. In Proceedings of COLING 2012: Demonstration Papers. The COLING 2012 Organizing Committee, 385–392. Retrieved from https://www.aclweb.org/anthology/C12-3048.
    [29]
    Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 464–468. DOI:
    [30]
    Bashar Talafha, Mohammad Ali, Muhy Eddin Za’ter, Haitham Seelawi, Ibraheem Tuffaha, Mostafa Samir, Wael Farhan, and Hussein T. Al-Natsheh. 2020. Multi-dialect Arabic BERT for Country-level Dialect Identification. arxiv:2007.05612 [cs.CL].
    [31]
    Allahsera Auguste Tapo, Bakary Coulibaly, Sébastien Diarra, Christopher Homan, Julia Kreutzer, Sarah Luger, Arthur Nagashima, Marcos Zampieri, and Michael Leventhal. 2020. Neural machine translation for extremely low-resource African languages: A case study on Bambara. In Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages. Association for Computational Linguistics.
    [32]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. CoRR abs/1706.03762 (2017).
    [33]
    Shijie Wu and Mark Dredze. 2019. Beto, Bentz, Becas: The Surprising Cross-lingual Effectiveness of BERT. arxiv:1904.09077 [cs.CL].
    [34]
    Inès Zribi, M. Ellouze, L. Belguith, and P. Blache. 2017. Morphological disambiguation of Tunisian dialect. J. King Saud Univ. Comput. Inf. Sci. 29 (2017), 147–155.