Abstract
Language Identification is an NLP task which aims at predicting the language of a given text. For the Arabic dialects many attempts have been done to address this topic. In this paper, we present our approach to build a Language Identification system in order to distinguish between Moroccan Colloquial Arabic and Arabic languages using two different methods. The first is rule-based and relies on stop word frequency, while the second is statically-based and uses several machine learning classifiers. Obtained results show that the statistical approach outperforms the rule-based approach. Furthermore, the Support Vector Machines classifier is more accurate than other statistical classifiers. Our goal in this paper is to pave the way toward building advanced Moroccan dialect NLP tools such as morphological analyzer and machine translation system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
“World Arabic Language Day | United Nations Educational, Scientific and Cultural Organization”. www.unesco.org.
- 2.
- 3.
- 4.
Hidden Markow Model.
- 5.
- 6.
- 7.
- 8.
References
Adouane, W., Dobnik, S.: Identification of languages in Algerian Arabic. In: The Third Arabic Natural Language Processing Workshop (WANLP), pp. 1–8. Association for Computational Linguistics, Valencia (2017)
Alshutayri, A., Atwell, E., Alosaimy, A., Dickins, J., Ingleby, M., Watson, J.: Arabic language WEKA-based dialect classifier for Arabic automatic speech recognition transcripts. In: The Third Workshop on NLP for Similar Languages, Varieties and Dialects, Osaka, Japan, pp. 204–211 (2016)
Andreas, S.: SRILM — an extensible language modeling toolkit. In: The ICLSP Conference, Denver, USA, pp. 901–904 (2002)
Belinkov, Y., Glass, J.: Character-level Convolutional Neural Network for Distinguishing Similar Languages and Dialects. CoRR, abs/1609.07568 (2016)
Benmamoun, E.: Language identities in morocco in a historical context. Stud. Linguist. Sci. 31, 95–106 (2001)
Bouamor, H., Habash, N., Oflazer, K.: A multidialectal parallel corpus of Arabic. In: LREC 2014, pp. 1240–1245 (2014)
Elfardy, H., Diab, M.: Sentence level dialect identification in Arabic. In: 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp. 456–461 (2013)
Elfardy, H., Al-Badrashiny, M., Diab, M.: AIDA: identifying code switching in informal Arabic text. In: EMNLP 2014: Conference on Empirical Methods in Natural Language Processing, Doha, p. 94 (2014)
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
Jean, C.: Assessing agreement on classification tasks: the kappa statistic. Comput. Linguist. 22(2), 249–254 (1996)
Laghouat, M.: L’espace Dialectal Marocain, sa structure actuelle et son évolution récente. Dialectologie et Sciences Humaines au Maroc, pp. 9–41. Faculté des Lettres Mohammed V, Rabat (1995)
Malmasi, S., Refaee, E., Dras, M.: Arabic dialect identification using a parallel multidialectal corpus. In: Hasida, K., Purwarianti, A. (eds.) Computational Linguistics. CCIS, vol. 593, pp. 35–53. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-0515-2_3
Man, L., Moustafa, M.: LAHGA: Arabic dialect classifier. In: IR 2011, Colorado (2011)
Namli, D., Bouzoubaa, K., Tajmout, R., Tahir, Y., Khamar, H.: A complex Arabic stop-words list design. In: 2ème Journée Doctorale Nationale sur l’Ingénierie de la Langue Arabe (JDILA 2015), Fes (2015)
Pasha, A., Al-Badrashiny, M., ElKholy, A., Eskander, R., Diab, M., Habash, N., et al.: MADAMIRA: a fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In: The Ninth International Conference on Language Resources and Evaluation, Reykjavik (2014)
Peters, C., Braschler, M., Clough, P.: Multilingual Information Retrieval: From Research To Practice. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-23008-0
Sadat, F., Kazemi, F., Farzindar, A.: Automatic identification of Arabic language varieties and dialects in social media. In: Second Workshop on Natural Language Processing for Social Media (SocialNLP), Queensland, Australia, pp. 35–40 (2014)
Salia, R.: Between Arabic and French Lies the Dialect: Moroccan Code-Weaving on Facebook. Thesis, Columbia University (2011)
Samih, Y., Maier, W.: Detecting code-switching in moroccan Arabic social media. In: SocialNLP @ IJCAI-2016, New York (2016)
Smith, T.C., Frank, E.: Introducing machine learning concepts with WEKA. In: Mathé, E., Davis, S. (eds.) Statistical Genomics. MMB, vol. 1418, pp. 353–378. Springer, New York (2016). https://doi.org/10.1007/978-1-4939-3578-9_17
Tachicart, R., Bouzoubaa, K., Jaafar, H.: Building a moroccan dialect electronic dictionary (MDED). In: 5th International Conference on Arabic Language Processing CITALA, Oujda (2014)
Tachicart, R., Bouzoubaa, K., Jaafar, H.: Lexical differences and similarities between Moroccan dialect and Arabic. In: 4th IEEE International Colloquium on Information Science and Technology (CiSt), Tanger (2016)
Zaidan, O., Callison-Burch, C.: The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In: 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, Portland, Oregon, USA, pp. 37–41 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Tachicart, R., Bouzoubaa, K., Aouragh, S.L., Jaafa, H. (2018). Automatic Identification of Moroccan Colloquial Arabic. In: Lachkar, A., Bouzoubaa, K., Mazroui, A., Hamdani, A., Lekhouaja, A. (eds) Arabic Language Processing: From Theory to Practice. ICALP 2017. Communications in Computer and Information Science, vol 782. Springer, Cham. https://doi.org/10.1007/978-3-319-73500-9_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-73500-9_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73499-6
Online ISBN: 978-3-319-73500-9
eBook Packages: Computer ScienceComputer Science (R0)