Abstract
Tokenization represents the way of segmenting a piece of text into smaller units called tokens. Since Arabic is an agglutinating language by nature, this treatment becomes a crucial preprocessing step for many Natural Language Processing (NLP) applications such as morphological analysis, parsing, machine translation, information extraction, and so on. In this article, we investigate word tokenization task with a rewriting process to rewrite the orthography of the stem. For this task, we are using Tunisian Arabic (TA) text. To the best of the researchers’ knowledge, this is the first study that uses TA for word tokenization. Therefore, we start by collecting and preparing various TA corpora from different sources. Then, we present a comparison of three character-based tokenizers based on Conditional Random Fields (CRF), Support Vector Machines (SVM) and Deep Neural Networks (DNN). The best proposed model using CRF achieved an F-measure result of 88.9%.
- [1] . 2016. Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 11–16.Google ScholarCross Ref
- [2] . 2013. ASMA: A system for automatic segmentation and morpho-syntactic disambiguation of modern standard Arabic. In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013. INCOMA Ltd. Shoumen, BULGARIA, Hissar, Bulgaria, 1–8.Google Scholar
- [3] . 2000. Multiple-valued threshold logic and multi-valued neurons. In Proceedings of the Multi-Valued and Universal Binary Neurons. Springer, 25–80.Google ScholarCross Ref
- [4] . 2009. A rule-based approach for tagging non-vocalized Arabic words. The International Arab Journal of Information Technology 6, 3 (2009), 320–328.Google Scholar
- [5] . 2019. Arabic word segmentation with long short-term memory neural networks and word embedding. IEEE Access 7 (2019), 12879–12887. https://ieeexplore.ieee.org/document/8620203.Google ScholarCross Ref
- [6] . 2008. Pattern-based algorithm for Part-of-Speech tagging Arabic text. In Proceedings of the 2008 International Conference on Computer Engineering Systems. 119–124.
DOI: Google ScholarCross Ref - [7] . 2010. Arabic word segmentation for better unit of analysis. In Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010. Valletta, Malta.Google Scholar
- [8] . 2010. Alkhalil Morpho Sys: A morphosyntactic analysis system for Arabic texts. In Proceedings of the ACIT2010. Riyadh, Saudi Arabia.Google Scholar
- [9] . 2014. De l’arabe standard vers l’arabe dialectal: Projection de corpus et ressources linguistiques en vue du traitement automatique de l’oral dans les médias tunisiens. TAL. 2. Traitement Automatique du Langage Parlé 55 (2014), 73–96. https://hal.science/halshs-01193325/.Google Scholar
- [10] . 2014. Fine-grained POS tagging of Spoken Tunisian Dialect Corpora. In Proceedings of the Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer, 59–62. Google ScholarCross Ref
- [11] . 2014. Using stem-templates to improve Arabic POS and gender / number tagging. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), 2926–2931.Google Scholar
- [12] . 2012. Language processing for Arabic microblog retrieval. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management.Google ScholarDigital Library
- [13] . 2009. Second generation AMIRA tools for Arabic processing: Fast and robust second generation Amira tools for Arabic processing: Fast and robust Tokenization, POS tagging, and base phrase chunking. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools.Google Scholar
- [14] . 2017. Arabic multi-dialect segmentation: bi-LSTM-CRF vs. SVM. CoRR abs/1708.05891 (2017). http://arxiv.org/abs/1708.05891Google Scholar
- [15] . 2003. Arabic gigaword corpus. Philadelphia, PA: Linguistic Data Consortium (2003).Google Scholar
- [16] . 1998. Support vector machines for classification and regression, technical report. Southampton, England: Faculty of Engineering, Science and Mathematics, School of Electronics and Computer Science, University of Southampton 14, 1 (1998), 5–16.Google Scholar
- [17] . 2006. MAGEAD: A morphological analyzer and generator for the Arabic dialects. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL. Sydney, Australia, 681–688.Google ScholarDigital Library
- [18] . 2013. Hybrid part-of-speech Tagger for non-vocalized Arabic text. International Journal on Natural Language Computing (IJNLC) 2, 6 (2013), 1–15.Google ScholarCross Ref
- [19] . 2013. The effects of factorizing root and pattern mapping in bidirectional Tunisian - standard Arabic machine translation. In Proceedings of the MT Summit 2013. France.Google Scholar
- [20] . 2016. Training linear SVMs in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD). Google Scholar
- [21] . 2020. Case study of language preferences in social media of Tunisia. Advances in Social Science, Education and Humanities Research 489 (2020), 111–115. https://www.atlantis-press.com/proceedings/icdatmi-20/125948610.Google Scholar
- [22] . 2018. Linguistic constraints on statistical word segmentation: The role of consonants in Arabic and English. Cognitive Science 42, S2: Special Issue: Word Learning and Language Acquisition (2018), 1–25.
DOI: Google ScholarCross Ref - [23] . 2001. APT : Arabic part-of-speech tagger. In Proceedings of the Student Work. NAACL. 20–25.Google Scholar
- [24] . 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data conditional random fields: Probabilistic models for segmenting and. In Proceedings of the 18th International Conference on Machine Learning, ICML, Vol. 1. 282–289.Google Scholar
- [25] . 2004. The Penn Arabic Treebank: Building a large scale annotated Arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools. Cairo, Egypt.Google Scholar
- [26] . 2012. Expanding Arabic Treebank to speech: Results from broadcast news. In Proceedings of the LREC. Citeseer, 1856–1861.Google Scholar
- [27] . 2012. Egyptian Arabic treebank DF Part 2 V2.0. In Proceedings of the LDC Catalog Number LDC2012E98.Google Scholar
- [28] . 2009. Pluringuisme et diglossie en Tunisie. Synergies Tunisie 1 (2009), 53–74. https://gerflint.fr/Base/Tunisie1/salah1.pdf.Google Scholar
- [29] . 2022. Sarcasm detection in Tunisian social media comments: Case of COVID-19. Language Resources and Evaluation 56, 1 (2022), 44–51. Google ScholarDigital Library
- [30] . 2022. Sarcasm detection in Tunisian social media comments: Case of COVID-19. In Foundations of Intelligent Systems. , , , , and (Eds.), Springer International Publishing, Cham, 44–51. Google ScholarDigital Library
- [31] . 2020. Treebank creation and parser generation for Tunisian social media text. In Proceedings of the 17th ACS/IEEE International Conference on Computer Systems and Applications AICCSA 2020. IEEE.Google ScholarCross Ref
- [32] . 2018. Critical description of TA linguistic resources. In Proceedings of the 4th International Conference on Arabic Computational Linguistics (ACLing 2018) & Procedia Computer Science, November 17-19 2018. Dubai, United Arab Emirates.Google ScholarDigital Library
- [33] . 2017. Syntactic analysis of the Tunisian Arabic. In Proceedings of the International Workshop on Language Processing and Knowledge Management.Google Scholar
- [34] . 2021. Sentence boundary detection of various forms of Tunisian Arabic. Language Resources and Evaluation (2021).Google Scholar
- [35] . 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26 (2013). https://papers.nips.cc/paper_files/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.Google Scholar
- [36] . 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 746–751.Google Scholar
- [37] . 2012. Annotating and learning morphological segmentation of Egyptian colloquial Arabic. In Proceedings of the Language Resources and Evaluation (LREC 2012). 873–877.Google Scholar
- [38] . 2014. Word segmentation of informal Arabic with domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
DOI: Google ScholarCross Ref - [39] . 2014. MADAMIRA : A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 1094–1101.Google Scholar
- [40] . 2014. Diglossia and Language Contact: Language Variation and Change in North Africa. Cambridge University Press.Google ScholarCross Ref
- [41] . 2005. NooJ: A linguistic annotation system for corpus processing. In Proceedings of HLT/EMNLP 2005 Interactive Demonstrations. 10–11.Google ScholarDigital Library
- [42] . 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.Google ScholarDigital Library
- [43] . 2018. Morphological aanalyzer for the Tunisian dialect. In International Conference on Text, Speech, and Dialogue (TSD 2018). Springer, Cham, 180–187.
DOI: Google ScholarDigital Library - [44] . 1995. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., New York, NY, USA. Google ScholarCross Ref
- [45] . 2015. Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web. In Current Trends in Web Engineering - 15th International Conference, ICWE 2015 Rotterdam, The Netherlands. 3–14.Google ScholarCross Ref
- [46] . 2007. A multi-agent system for POS-tagging vocalized Arabic texts. The International Arab Journal of Information Technology 4, November 2007 (2007), 322–329.Google Scholar
- [47] . 2014. A conventional orthography for Tunisian Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland, May 26-31, 2014.European Language Resources Association (ELRA), 2355–2361.Google Scholar
- [48] . 2015. Spoken Tunisian Arabic corpus STAC: Transcription and annotation. Research in Computing Science 90 (2015).Google ScholarCross Ref
- [49] . 2017. Morphological disambiguation of Tunisian dialect. Journal of King Saud University - Computer and Information Sciences 29, 2 (2017), 147–155.
Arabic Natural Language Processing: Models, Systems and Applications. Google ScholarDigital Library - [50] . 2016. Sentence boundary detection for transcribed Tunisian Arabic. In 12th Edition of the Konvens Conference. Bochum, Germany.Google Scholar
- [51] . 2013. Morphological analysis of tunisian dialect. In Sixth International Joint Conference on Natural Language Processing, IJCNLP 2013, Nagoya, Japan, October 14–18, 2013. 992–996.Google Scholar
Index Terms
- Tokenization of Tunisian Arabic: A Comparison between Three Machine Learning Models
Recommendations
Sentence boundary detection of various forms of Tunisian Arabic
AbstractSentence boundary detection (SBD) is an essential step for a very large number of natural language processing applications such as parsing, information retrieval, automatic summarization, machine translation, etc. In this paper, we tackle the ...
POS tagger for Urdu using Stochastic approaches
ICTCS '16: Proceedings of the Second International Conference on Information and Communication Technology for Competitive StrategiesPart-of-Speech tagging is a problem of Natural language processing. It is a process of labeling an accurate part of speech for each word of a given corpus sentence. There are various approaches like rule based, stochastic and hybrid that are mainly used ...
Decision Tree Ensemble for Parts-of-Speech Tagging of Resource-poor Languages
FIRE '18: Proceedings of the 10th Annual Meeting of the Forum for Information Retrieval EvaluationEnsemble POS taggers are a good choice to integrate and leverage benefits of various types of POS taggers. This can help the large number (6500+) of resource-poor languages which do not have much annotated training data by providing ways to integrate ...
Comments