research-article

Tokenization of Tunisian Arabic: A Comparison between Three Machine Learning Models

Authors:
Asma Mekki

ANLP Research Group, MIRACL Lab., University of Sfax, Tunisia

ANLP Research Group, MIRACL Lab., University of Sfax, Tunisia

0000-0003-3140-3171
View Profile

,
Inès Zribi

ANLP Research Group, MIRACL Lab., ISIMa, University of Monastir, Tunisia

ANLP Research Group, MIRACL Lab., ISIMa, University of Monastir, Tunisia

0000-0002-2065-7873
View Profile

,
Mariem Ellouze

ANLP Research Group, MIRACL Lab., University of Sfax, Tunisia

ANLP Research Group, MIRACL Lab., University of Sfax, Tunisia

0000-0003-1864-2602
View Profile

,
Lamia Hadrich Belguith

ANLP Research Group, MIRACL Lab., University of Sfax, Tunisia

ANLP Research Group, MIRACL Lab., University of Sfax, Tunisia

0000-0002-4868-657X
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22 Issue 7Article No.: 194pp 1–19https://doi.org/10.1145/3599234

Published:20 July 2023Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Tokenization represents the way of segmenting a piece of text into smaller units called tokens. Since Arabic is an agglutinating language by nature, this treatment becomes a crucial preprocessing step for many Natural Language Processing (NLP) applications such as morphological analysis, parsing, machine translation, information extraction, and so on. In this article, we investigate word tokenization task with a rewriting process to rewrite the orthography of the stem. For this task, we are using Tunisian Arabic (TA) text. To the best of the researchers’ knowledge, this is the first study that uses TA for word tokenization. Therefore, we start by collecting and preparing various TA corpora from different sources. Then, we present a comparison of three character-based tokenizers based on Conditional Random Fields (CRF), Support Vector Machines (SVM) and Deep Neural Networks (DNN). The best proposed model using CRF achieved an F-measure result of 88.9%.

REFERENCES

[1] Abdelali Ahmed, Darwish Kareem, Durrani Nadir, and Mubarak Hamdy. 2016. Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 11–16.Google ScholarCross Ref
[2] Abdul-Mageed Muhammad, Diab Mona, and Kübler Sandra. 2013. ASMA: A system for automatic segmentation and morpho-syntactic disambiguation of modern standard Arabic. In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013. INCOMA Ltd. Shoumen, BULGARIA, Hissar, Bulgaria, 1–8.Google Scholar
[3] Aizenberg Igor N., Aizenberg Naum N., and Vandewalle Joos. 2000. Multiple-valued threshold logic and multi-valued neurons. In Proceedings of the Multi-Valued and Universal Binary Neurons. Springer, 25–80.Google ScholarCross Ref
[4] Al-taani Ahmad and Al-rub Salah Abu. 2009. A rule-based approach for tagging non-vocalized Arabic words. The International Arab Journal of Information Technology 6, 3 (2009), 320–328.Google Scholar
[5] Almuhareb Abdulrahman, Alsanie Waleed, and Al-thubaity Abdulmohsen. 2019. Arabic word segmentation with long short-term memory neural networks and word embedding. IEEE Access 7 (2019), 12879–12887. https://ieeexplore.ieee.org/document/8620203.Google ScholarCross Ref
[6] Alqrainy Shihadeh, AlSerhan Hasan Muaidi, and Ayesh Aladdin. 2008. Pattern-based algorithm for Part-of-Speech tagging Arabic text. In Proceedings of the 2008 International Conference on Computer Engineering Systems. 119–124. DOI:Google ScholarCross Ref
[7] Benajiba Yassine and Zitouni Imed. 2010. Arabic word segmentation for better unit of analysis. In Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010. Valletta, Malta.Google Scholar
[8] Boudlal Abderrahim, Lakhouja Abdelhak, Mazroui Azzedine, Meziane Abdelouafi, Bebah Mohamed Ould Abdallahi Ould, and Shoul Mohamed. 2010. Alkhalil Morpho Sys: A morphosyntactic analysis system for Arabic texts. In Proceedings of the ACIT2010. Riyadh, Saudi Arabia.Google Scholar
[9] Boujelbane Rahma, Ellouze Mariem, Béchet Frédéric, and Belguith Lamia. 2014. De l’arabe standard vers l’arabe dialectal: Projection de corpus et ressources linguistiques en vue du traitement automatique de l’oral dans les médias tunisiens. TAL. 2. Traitement Automatique du Langage Parlé 55 (2014), 73–96. https://hal.science/halshs-01193325/.Google Scholar
[10] Boujelbane Rahma, Mallek Mariem, Ellouze Mariem, and Belguith Lamia Hadrich. 2014. Fine-grained POS tagging of Spoken Tunisian Dialect Corpora. In Proceedings of the Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer, 59–62. Google ScholarCross Ref
[11] Darwish Kareem, Abdelali Ahmed, and Mubarak Hamdy. 2014. Using stem-templates to improve Arabic POS and gender / number tagging. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), 2926–2931.Google Scholar
[12] Darwish Kareem, Magdy Walid, and Mourad Ahmed. 2012. Language processing for Arabic microblog retrieval. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management.Google ScholarDigital Library
[13] Diab Mona. 2009. Second generation AMIRA tools for Arabic processing: Fast and robust second generation Amira tools for Arabic processing: Fast and robust Tokenization, POS tagging, and base phrase chunking. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools.Google Scholar
[14] Eldesouki Mohamed, Samih Younes, Abdelali Ahmed, Attia Mohammed, Mubarak Hamdy, Darwish Kareem, and Laura Kallmeyer. 2017. Arabic multi-dialect segmentation: bi-LSTM-CRF vs. SVM. CoRR abs/1708.05891 (2017). http://arxiv.org/abs/1708.05891Google Scholar
[15] Graff David. 2003. Arabic gigaword corpus. Philadelphia, PA: Linguistic Data Consortium (2003).Google Scholar
[16] Gunn Steve R.. 1998. Support vector machines for classification and regression, technical report. Southampton, England: Faculty of Engineering, Science and Mathematics, School of Electronics and Computer Science, University of Southampton 14, 1 (1998), 5–16.Google Scholar
[17] Habash Nizar and Rambow Owen. 2006. MAGEAD: A morphological analyzer and generator for the Arabic dialects. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL. Sydney, Australia, 681–688.Google ScholarDigital Library
[18] Hadni Meryeme, Ouatik Said Alaoui, Lachkar Abdelmonaime, and Meknassi Mohammed. 2013. Hybrid part-of-speech Tagger for non-vocalized Arabic text. International Journal on Natural Language Computing (IJNLC) 2, 6 (2013), 1–15.Google ScholarCross Ref
[19] Hamdi Ahmed, Boujelbane Rahma, Habash Nizar, and Nasr Alexis. 2013. The effects of factorizing root and pattern mapping in bidirectional Tunisian - standard Arabic machine translation. In Proceedings of the MT Summit 2013. France.Google Scholar
[20] Joachims Thorsten. 2016. Training linear SVMs in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD). Google Scholar
[21] Kashina Anna. 2020. Case study of language preferences in social media of Tunisia. Advances in Social Science, Education and Humanities Research 489 (2020), 111–115. https://www.atlantis-press.com/proceedings/icdatmi-20/125948610.Google Scholar
[22] Kastner Itamar and Adriaans Frans. 2018. Linguistic constraints on statistical word segmentation: The role of consonants in Arabic and English. Cognitive Science 42, S2: Special Issue: Word Learning and Language Acquisition (2018), 1–25. DOI:Google ScholarCross Ref
[23] Khoja Shereen. 2001. APT : Arabic part-of-speech tagger. In Proceedings of the Student Work. NAACL. 20–25.Google Scholar
[24] Lafferty John and McCallum Andrew. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data conditional random fields: Probabilistic models for segmenting and. In Proceedings of the 18th International Conference on Machine Learning, ICML, Vol. 1. 282–289.Google Scholar
[25] Maamouri Mohamed, Bies Ann, and Buckwalter Tim. 2004. The Penn Arabic Treebank: Building a large scale annotated Arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools. Cairo, Egypt.Google Scholar
[26] Maamouri Mohamed, Bies Ann, and Kulick Seth. 2012. Expanding Arabic Treebank to speech: Results from broadcast news. In Proceedings of the LREC. Citeseer, 1856–1861.Google Scholar
[27] Maamouri Mohamed, Bies Ann, Kulick Seth, Krouna Sondos, Tabassi Dalila, and Ciul Michael. 2012. Egyptian Arabic treebank DF Part 2 V2.0. In Proceedings of the LDC Catalog Number LDC2012E98.Google Scholar
[28] Mejri Salah, Said Mosbah, and Sfar Inès. 2009. Pluringuisme et diglossie en Tunisie. Synergies Tunisie 1 (2009), 53–74. https://gerflint.fr/Base/Tunisie1/salah1.pdf.Google Scholar
[29] Mekki Asma, Zribi Inès, Ellouze Mariem, and Belguith Lamia Hadrich. 2022. Sarcasm detection in Tunisian social media comments: Case of COVID-19. Language Resources and Evaluation 56, 1 (2022), 44–51. Google ScholarDigital Library
[30] Mekki Asma, Zribi Inès, Ellouze Mariem, and Belguith Lamia Hadrich. 2022. Sarcasm detection in Tunisian social media comments: Case of COVID-19. In Foundations of Intelligent Systems. Ceci Michelangelo, Flesca Sergio, Masciari Elio, Manco Giuseppe, and Raś Zbigniew W. (Eds.), Springer International Publishing, Cham, 44–51. Google ScholarDigital Library
[31] Mekki Asma, Zribi Inès, Ellouze Mariem, and Belguith Lamia Hadrich. 2020. Treebank creation and parser generation for Tunisian social media text. In Proceedings of the 17th ACS/IEEE International Conference on Computer Systems and Applications AICCSA 2020. IEEE.Google ScholarCross Ref
[32] Mekki Asma, Zribi Inès, Khmekhem Mariem Ellouze, and Belguith Lamia Hadrich. 2018. Critical description of TA linguistic resources. In Proceedings of the 4th International Conference on Arabic Computational Linguistics (ACLing 2018) & Procedia Computer Science, November 17-19 2018. Dubai, United Arab Emirates.Google ScholarDigital Library
[33] Mekki Asma, Zribi Inès, Khemakhem Mariem Ellouze, and Belguith Lamia Hadrich. 2017. Syntactic analysis of the Tunisian Arabic. In Proceedings of the International Workshop on Language Processing and Knowledge Management.Google Scholar
[34] Mekki Asma, Zribi Inès, Khemakhem Mariem Ellouze, and Belguith Lamia Hadrich. 2021. Sentence boundary detection of various forms of Tunisian Arabic. Language Resources and Evaluation (2021).Google Scholar
[35] Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg S., and Dean Jeff. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26 (2013). https://papers.nips.cc/paper_files/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.Google Scholar
[36] Mikolov Tomáš, Yih Wen-tau, and Zweig Geoffrey. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 746–751.Google Scholar
[37] Mohamed Emad, Mohit Behrang, and Oflazer Kemal. 2012. Annotating and learning morphological segmentation of Egyptian colloquial Arabic. In Proceedings of the Language Resources and Evaluation (LREC 2012). 873–877.Google Scholar
[38] Monroe Will, Green Spence, and Manning Christopher D.. 2014. Word segmentation of informal Arabic with domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. DOI:Google ScholarCross Ref
[39] Pasha Arfath, Al-badrashiny Mohamed, Diab Mona, Kholy Ahmed El, Eskander Ramy, Habash Nizar, Pooleery Manoj, Rambow Owen, and Roth Ryan M.. 2014. MADAMIRA : A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 1094–1101.Google Scholar
[40] Sayahi Lotfi. 2014. Diglossia and Language Contact: Language Variation and Change in North Africa. Cambridge University Press.Google ScholarCross Ref
[41] Silberztein Max. 2005. NooJ: A linguistic annotation system for corpus processing. In Proceedings of HLT/EMNLP 2005 Interactive Demonstrations. 10–11.Google ScholarDigital Library
[42] Srivastava Nitish, Hinton Geoffrey, Krizhevsky Alex, Sutskever Ilya, and Salakhutdinov Ruslan. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.Google ScholarDigital Library
[43] Torjmen Roua and Haddar Kais. 2018. Morphological aanalyzer for the Tunisian dialect. In International Conference on Text, Speech, and Dialogue (TSD 2018). Springer, Cham, 180–187. DOI:Google ScholarDigital Library
[44] Vapnik Vladimir N.. 1995. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., New York, NY, USA. Google ScholarCross Ref
[45] Younes Jihene, Achour Hadhemi, and Souissi Emna. 2015. Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web. In Current Trends in Web Engineering - 15th International Conference, ICWE 2015 Rotterdam, The Netherlands. 3–14.Google ScholarCross Ref
[46] Zribi Chiraz Ben Othman, Torjmen Aroua, and Ahmed Mohamed Ben. 2007. A multi-agent system for POS-tagging vocalized Arabic texts. The International Arab Journal of Information Technology 4, November 2007 (2007), 322–329.Google Scholar
[47] Zribi Inès, Boujelbane Rahma, Masmoudi Abir, Ellouze Mariem, Belguith Lamia Hadrich, and Habash Nizar. 2014. A conventional orthography for Tunisian Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland, May 26-31, 2014.European Language Resources Association (ELRA), 2355–2361.Google Scholar
[48] Zribi Inès, Ellouze Mariem, Belguith Lamia Hadrich, and Blache Philippe. 2015. Spoken Tunisian Arabic corpus STAC: Transcription and annotation. Research in Computing Science 90 (2015).Google ScholarCross Ref
[49] Zribi Inès, Ellouze Mariem, Belguith Lamia Hadrich, and Blache Philippe. 2017. Morphological disambiguation of Tunisian dialect. Journal of King Saud University - Computer and Information Sciences 29, 2 (2017), 147–155. Arabic Natural Language Processing: Models, Systems and Applications.Google ScholarDigital Library
[50] Zribi Inès, Kammoun Inès, Ellouze Mariem, Belguith Lamia Hadrich, and Blache Philippe. 2016. Sentence boundary detection for transcribed Tunisian Arabic. In 12th Edition of the Konvens Conference. Bochum, Germany.Google Scholar
[51] Zribi Inès, Khemakhem Mariem Ellouze, and Belguith Lamia Hadrich. 2013. Morphological analysis of tunisian dialect. In Sixth International Joint Conference on Natural Language Processing, IJCNLP 2013, Nagoya, Japan, October 14–18, 2013. 992–996.Google Scholar

Index Terms

Tokenization of Tunisian Arabic: A Comparison between Three Machine Learning Models
1. General and reference
  1. Cross-computing tools and techniques
    1. Experimentation
  2. Document types
    1. Computing standards, RFCs and guidelines
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods
      1. User models

Recommendations

Sentence boundary detection of various forms of Tunisian Arabic
Abstract
Sentence boundary detection (SBD) is an essential step for a very large number of natural language processing applications such as parsing, information retrieval, automatic summarization, machine translation, etc. In this paper, we tackle the ...
Read More
POS tagger for Urdu using Stochastic approaches
ICTCS '16: Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies

Part-of-Speech tagging is a problem of Natural language processing. It is a process of labeling an accurate part of speech for each word of a given corpus sentence. There are various approaches like rule based, stochastic and hybrid that are mainly used ...
Read More
Decision Tree Ensemble for Parts-of-Speech Tagging of Resource-poor Languages
FIRE '18: Proceedings of the 10th Annual Meeting of the Forum for Information Retrieval Evaluation

Ensemble POS taggers are a good choice to integrate and leverage benefits of various types of POS taggers. This can help the large number (6500+) of resource-poor languages which do not have much annotated training data by providing ways to integrate ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 7
July 2023
422 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3610376
Editor:
Imed Zitouni
Google, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 July 2023
- Online AM: 24 May 2023
- Accepted: 19 May 2023
- Revised: 30 August 2022
- Received: 15 May 2021
Published in tallip Volume 22, Issue 7

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Word tokenization
Tunisian Arabic
Arabic dialect
deep learning
SVM
CRF
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 132
  Total Downloads
- Downloads (Last 12 months)132
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Tokenization of Tunisian Arabic: A Comparison between Three Machine Learning Models

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Sentence boundary detection of various forms of Tunisian Arabic

POS tagger for Urdu using Stochastic approaches

Decision Tree Ensemble for Parts-of-Speech Tagging of Resource-poor Languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

Tokenization of Tunisian Arabic: A Comparison between Three Machine Learning Models

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Sentence boundary detection of various forms of Tunisian Arabic

POS tagger for Urdu using Stochastic approaches

Decision Tree Ensemble for Parts-of-Speech Tagging of Resource-poor Languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media