skip to main content
research-article

Tokenization of Tunisian Arabic: A Comparison between Three Machine Learning Models

Published:20 July 2023Publication History
Skip Abstract Section

Abstract

Tokenization represents the way of segmenting a piece of text into smaller units called tokens. Since Arabic is an agglutinating language by nature, this treatment becomes a crucial preprocessing step for many Natural Language Processing (NLP) applications such as morphological analysis, parsing, machine translation, information extraction, and so on. In this article, we investigate word tokenization task with a rewriting process to rewrite the orthography of the stem. For this task, we are using Tunisian Arabic (TA) text. To the best of the researchers’ knowledge, this is the first study that uses TA for word tokenization. Therefore, we start by collecting and preparing various TA corpora from different sources. Then, we present a comparison of three character-based tokenizers based on Conditional Random Fields (CRF), Support Vector Machines (SVM) and Deep Neural Networks (DNN). The best proposed model using CRF achieved an F-measure result of 88.9%.

REFERENCES

  1. [1] Abdelali Ahmed, Darwish Kareem, Durrani Nadir, and Mubarak Hamdy. 2016. Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 1116.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Abdul-Mageed Muhammad, Diab Mona, and Kübler Sandra. 2013. ASMA: A system for automatic segmentation and morpho-syntactic disambiguation of modern standard Arabic. In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013. INCOMA Ltd. Shoumen, BULGARIA, Hissar, Bulgaria, 18.Google ScholarGoogle Scholar
  3. [3] Aizenberg Igor N., Aizenberg Naum N., and Vandewalle Joos. 2000. Multiple-valued threshold logic and multi-valued neurons. In Proceedings of the Multi-Valued and Universal Binary Neurons. Springer, 2580.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Al-taani Ahmad and Al-rub Salah Abu. 2009. A rule-based approach for tagging non-vocalized Arabic words. The International Arab Journal of Information Technology 6, 3 (2009), 320328.Google ScholarGoogle Scholar
  5. [5] Almuhareb Abdulrahman, Alsanie Waleed, and Al-thubaity Abdulmohsen. 2019. Arabic word segmentation with long short-term memory neural networks and word embedding. IEEE Access 7 (2019), 1287912887. https://ieeexplore.ieee.org/document/8620203.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Alqrainy Shihadeh, AlSerhan Hasan Muaidi, and Ayesh Aladdin. 2008. Pattern-based algorithm for Part-of-Speech tagging Arabic text. In Proceedings of the 2008 International Conference on Computer Engineering Systems. 119124. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Benajiba Yassine and Zitouni Imed. 2010. Arabic word segmentation for better unit of analysis. In Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010. Valletta, Malta.Google ScholarGoogle Scholar
  8. [8] Boudlal Abderrahim, Lakhouja Abdelhak, Mazroui Azzedine, Meziane Abdelouafi, Bebah Mohamed Ould Abdallahi Ould, and Shoul Mohamed. 2010. Alkhalil Morpho Sys: A morphosyntactic analysis system for Arabic texts. In Proceedings of the ACIT2010. Riyadh, Saudi Arabia.Google ScholarGoogle Scholar
  9. [9] Boujelbane Rahma, Ellouze Mariem, Béchet Frédéric, and Belguith Lamia. 2014. De l’arabe standard vers l’arabe dialectal: Projection de corpus et ressources linguistiques en vue du traitement automatique de l’oral dans les médias tunisiens. TAL. 2. Traitement Automatique du Langage Parlé 55 (2014), 7396. https://hal.science/halshs-01193325/.Google ScholarGoogle Scholar
  10. [10] Boujelbane Rahma, Mallek Mariem, Ellouze Mariem, and Belguith Lamia Hadrich. 2014. Fine-grained POS tagging of Spoken Tunisian Dialect Corpora. In Proceedings of the Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer, 5962. Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Darwish Kareem, Abdelali Ahmed, and Mubarak Hamdy. 2014. Using stem-templates to improve Arabic POS and gender / number tagging. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), 29262931.Google ScholarGoogle Scholar
  12. [12] Darwish Kareem, Magdy Walid, and Mourad Ahmed. 2012. Language processing for Arabic microblog retrieval. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Diab Mona. 2009. Second generation AMIRA tools for Arabic processing: Fast and robust second generation Amira tools for Arabic processing: Fast and robust Tokenization, POS tagging, and base phrase chunking. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools.Google ScholarGoogle Scholar
  14. [14] Eldesouki Mohamed, Samih Younes, Abdelali Ahmed, Attia Mohammed, Mubarak Hamdy, Darwish Kareem, and Laura Kallmeyer. 2017. Arabic multi-dialect segmentation: bi-LSTM-CRF vs. SVM. CoRR abs/1708.05891 (2017). http://arxiv.org/abs/1708.05891Google ScholarGoogle Scholar
  15. [15] Graff David. 2003. Arabic gigaword corpus. Philadelphia, PA: Linguistic Data Consortium (2003).Google ScholarGoogle Scholar
  16. [16] Gunn Steve R.. 1998. Support vector machines for classification and regression, technical report. Southampton, England: Faculty of Engineering, Science and Mathematics, School of Electronics and Computer Science, University of Southampton 14, 1 (1998), 516.Google ScholarGoogle Scholar
  17. [17] Habash Nizar and Rambow Owen. 2006. MAGEAD: A morphological analyzer and generator for the Arabic dialects. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL. Sydney, Australia, 681688.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Hadni Meryeme, Ouatik Said Alaoui, Lachkar Abdelmonaime, and Meknassi Mohammed. 2013. Hybrid part-of-speech Tagger for non-vocalized Arabic text. International Journal on Natural Language Computing (IJNLC) 2, 6 (2013), 115.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Hamdi Ahmed, Boujelbane Rahma, Habash Nizar, and Nasr Alexis. 2013. The effects of factorizing root and pattern mapping in bidirectional Tunisian - standard Arabic machine translation. In Proceedings of the MT Summit 2013. France.Google ScholarGoogle Scholar
  20. [20] Joachims Thorsten. 2016. Training linear SVMs in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD). Google ScholarGoogle Scholar
  21. [21] Kashina Anna. 2020. Case study of language preferences in social media of Tunisia. Advances in Social Science, Education and Humanities Research 489 (2020), 111115. https://www.atlantis-press.com/proceedings/icdatmi-20/125948610.Google ScholarGoogle Scholar
  22. [22] Kastner Itamar and Adriaans Frans. 2018. Linguistic constraints on statistical word segmentation: The role of consonants in Arabic and English. Cognitive Science 42, S2: Special Issue: Word Learning and Language Acquisition (2018), 125. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Khoja Shereen. 2001. APT : Arabic part-of-speech tagger. In Proceedings of the Student Work. NAACL. 2025.Google ScholarGoogle Scholar
  24. [24] Lafferty John and McCallum Andrew. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data conditional random fields: Probabilistic models for segmenting and. In Proceedings of the 18th International Conference on Machine Learning, ICML, Vol. 1. 282289.Google ScholarGoogle Scholar
  25. [25] Maamouri Mohamed, Bies Ann, and Buckwalter Tim. 2004. The Penn Arabic Treebank: Building a large scale annotated Arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools. Cairo, Egypt.Google ScholarGoogle Scholar
  26. [26] Maamouri Mohamed, Bies Ann, and Kulick Seth. 2012. Expanding Arabic Treebank to speech: Results from broadcast news. In Proceedings of the LREC. Citeseer, 18561861.Google ScholarGoogle Scholar
  27. [27] Maamouri Mohamed, Bies Ann, Kulick Seth, Krouna Sondos, Tabassi Dalila, and Ciul Michael. 2012. Egyptian Arabic treebank DF Part 2 V2.0. In Proceedings of the LDC Catalog Number LDC2012E98.Google ScholarGoogle Scholar
  28. [28] Mejri Salah, Said Mosbah, and Sfar Inès. 2009. Pluringuisme et diglossie en Tunisie. Synergies Tunisie 1 (2009), 5374. https://gerflint.fr/Base/Tunisie1/salah1.pdf.Google ScholarGoogle Scholar
  29. [29] Mekki Asma, Zribi Inès, Ellouze Mariem, and Belguith Lamia Hadrich. 2022. Sarcasm detection in Tunisian social media comments: Case of COVID-19. Language Resources and Evaluation 56, 1 (2022), 44–51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Mekki Asma, Zribi Inès, Ellouze Mariem, and Belguith Lamia Hadrich. 2022. Sarcasm detection in Tunisian social media comments: Case of COVID-19. In Foundations of Intelligent Systems. Ceci Michelangelo, Flesca Sergio, Masciari Elio, Manco Giuseppe, and Raś Zbigniew W. (Eds.), Springer International Publishing, Cham, 4451. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Mekki Asma, Zribi Inès, Ellouze Mariem, and Belguith Lamia Hadrich. 2020. Treebank creation and parser generation for Tunisian social media text. In Proceedings of the 17th ACS/IEEE International Conference on Computer Systems and Applications AICCSA 2020. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Mekki Asma, Zribi Inès, Khmekhem Mariem Ellouze, and Belguith Lamia Hadrich. 2018. Critical description of TA linguistic resources. In Proceedings of the 4th International Conference on Arabic Computational Linguistics (ACLing 2018) & Procedia Computer Science, November 17-19 2018. Dubai, United Arab Emirates.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Mekki Asma, Zribi Inès, Khemakhem Mariem Ellouze, and Belguith Lamia Hadrich. 2017. Syntactic analysis of the Tunisian Arabic. In Proceedings of the International Workshop on Language Processing and Knowledge Management.Google ScholarGoogle Scholar
  34. [34] Mekki Asma, Zribi Inès, Khemakhem Mariem Ellouze, and Belguith Lamia Hadrich. 2021. Sentence boundary detection of various forms of Tunisian Arabic. Language Resources and Evaluation (2021).Google ScholarGoogle Scholar
  35. [35] Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg S., and Dean Jeff. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26 (2013). https://papers.nips.cc/paper_files/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.Google ScholarGoogle Scholar
  36. [36] Mikolov Tomáš, Yih Wen-tau, and Zweig Geoffrey. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 746751.Google ScholarGoogle Scholar
  37. [37] Mohamed Emad, Mohit Behrang, and Oflazer Kemal. 2012. Annotating and learning morphological segmentation of Egyptian colloquial Arabic. In Proceedings of the Language Resources and Evaluation (LREC 2012). 873877.Google ScholarGoogle Scholar
  38. [38] Monroe Will, Green Spence, and Manning Christopher D.. 2014. Word segmentation of informal Arabic with domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Pasha Arfath, Al-badrashiny Mohamed, Diab Mona, Kholy Ahmed El, Eskander Ramy, Habash Nizar, Pooleery Manoj, Rambow Owen, and Roth Ryan M.. 2014. MADAMIRA : A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 10941101.Google ScholarGoogle Scholar
  40. [40] Sayahi Lotfi. 2014. Diglossia and Language Contact: Language Variation and Change in North Africa. Cambridge University Press.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Silberztein Max. 2005. NooJ: A linguistic annotation system for corpus processing. In Proceedings of HLT/EMNLP 2005 Interactive Demonstrations. 1011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Srivastava Nitish, Hinton Geoffrey, Krizhevsky Alex, Sutskever Ilya, and Salakhutdinov Ruslan. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 19291958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Torjmen Roua and Haddar Kais. 2018. Morphological aanalyzer for the Tunisian dialect. In International Conference on Text, Speech, and Dialogue (TSD 2018). Springer, Cham, 180187. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Vapnik Vladimir N.. 1995. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., New York, NY, USA. Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Younes Jihene, Achour Hadhemi, and Souissi Emna. 2015. Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web. In Current Trends in Web Engineering - 15th International Conference, ICWE 2015 Rotterdam, The Netherlands. 314.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Zribi Chiraz Ben Othman, Torjmen Aroua, and Ahmed Mohamed Ben. 2007. A multi-agent system for POS-tagging vocalized Arabic texts. The International Arab Journal of Information Technology 4, November 2007 (2007), 322329.Google ScholarGoogle Scholar
  47. [47] Zribi Inès, Boujelbane Rahma, Masmoudi Abir, Ellouze Mariem, Belguith Lamia Hadrich, and Habash Nizar. 2014. A conventional orthography for Tunisian Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland, May 26-31, 2014.European Language Resources Association (ELRA), 23552361.Google ScholarGoogle Scholar
  48. [48] Zribi Inès, Ellouze Mariem, Belguith Lamia Hadrich, and Blache Philippe. 2015. Spoken Tunisian Arabic corpus STAC: Transcription and annotation. Research in Computing Science 90 (2015).Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Zribi Inès, Ellouze Mariem, Belguith Lamia Hadrich, and Blache Philippe. 2017. Morphological disambiguation of Tunisian dialect. Journal of King Saud University - Computer and Information Sciences 29, 2 (2017), 147155. Arabic Natural Language Processing: Models, Systems and Applications.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Zribi Inès, Kammoun Inès, Ellouze Mariem, Belguith Lamia Hadrich, and Blache Philippe. 2016. Sentence boundary detection for transcribed Tunisian Arabic. In 12th Edition of the Konvens Conference. Bochum, Germany.Google ScholarGoogle Scholar
  51. [51] Zribi Inès, Khemakhem Mariem Ellouze, and Belguith Lamia Hadrich. 2013. Morphological analysis of tunisian dialect. In Sixth International Joint Conference on Natural Language Processing, IJCNLP 2013, Nagoya, Japan, October 14–18, 2013. 992996.Google ScholarGoogle Scholar

Index Terms

  1. Tokenization of Tunisian Arabic: A Comparison between Three Machine Learning Models

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 7
          July 2023
          422 pages
          ISSN:2375-4699
          EISSN:2375-4702
          DOI:10.1145/3610376
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 20 July 2023
          • Online AM: 24 May 2023
          • Accepted: 19 May 2023
          • Revised: 30 August 2022
          • Received: 15 May 2021
          Published in tallip Volume 22, Issue 7

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)132
          • Downloads (Last 6 weeks)10

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text