Abstract
In the Arabic-speaking world, textual productions on social networks are often informal and generally characterized by the use of various dialects, which can be transcribed in Latin or Arabic characters. More specifically, electronic writing in Tunisia is characterized in large part by a mixture of Tunisian dialect with other languages and by a margin of individualization giving users the freedom to write without depending on orthographic or grammatical constraints. In this work, we address the problem of the automatic Tunisian dialect identification within the electronic writings that are produced on social networks using the Latin alphabet. We propose to study and experiment two different identification approaches. Our experiments show that the best performance is obtained using a machine learning based approach using Support Vector Machines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Arabic dialect written with Latin alphabet.
- 2.
References
Jalam, R.: Apprentisage Automatique et Catégorisation de Textes Multilingues. Ph.D. thesis, Université Lumière, Lyon (2003)
Tromp, E., Pechenizkiy, M.: Graph-based n-gram language identification on short texts. In: Proceedings of the 20th Machine Learning conference of Belgium and The Netherlands, The Hague (2011)
Winkelmolen, F., Mascardi, V.: Statistical language identification of short texts. In: Proceedings of the 3rd International Conference on Agents and Artificial Intelligence, Rome (2011)
Jalam, R., Teytaud, O.: Simplified Identification de la Langue et Catégorisation de Textes basées sur les N-grams. In: Journées Francophones d’ extraction et de gestion de connaissances, Montpellier (2002)
Dunning, T.: Statistical identification of language. In: Computing Research Laboratory Technical Memo MCCS 94--273, New Mexico State University, New Mexico (1994)
Giguet, E.: Méthode pour l’analyse automatique de structures formelles sur documents multilingues. Ph.D. thesis, Université de Caen, Normandy (1998)
Lins, R.D., Gonçalves, P.: Automatic language identification of written texts. In: Proceedings of the 2004 ACM Symposium on Applied Computing, Nicosia (2004)
Souter, C., Churcher, G., Hayes, J., Hughes, J., Johnson, S.: Natural language identification using corpus-based models. Hermes - J. Lang. Commun. Bus. 13, 183–203 (1994)
Martino, M.J., Paulsen, R.C.: Natural language determination using partial words. Google Patents (2001)
Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of the 3rd International Conference on the Statistical Analysis of Textual Data (JADT 1995), Rome (1995)
Cavnar, W.B., Trenkle, J.M.: n-Gram-based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas (1994)
Ahmed, B., Cha, S.H., Tappert. C.: Language identification from text using n-Gram based cumulative frequency addition. In: Proceedings of Student/Faculty Research Day, CSIS, New York (2004)
Bhargava, A., Kondrak, G.: Language identification of names with SVMs. In: Proceedings of HLT 2010 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, California (2010)
Simões, A., Almeida, J.J., Byers, S.D.: Language identification: a neural network approach. In: 3rd Symposium on Languages, Applications and Technologies (SLATE 2014), Bragança (2014)
Chittaranjan, G., Vyas, Y., Bali, K., Choudhury, M.: Word-level language identification using CRF: code-switching shared task report of MSR India system. In: Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha (2014)
Cotterell, R., Callison-Burch, C.: A multi-dialect, multi-genre corpus of informal written Arabic. In: 9th International Conference on Language Resources and Evaluation, Reykjavik, pp. 241–245 (2014)
Darwish, K., Sajjad, H., Mubarak, H.: Verifiably effective Arabic dialect identification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha (2014)
Younes, J., Achour, H., Souissi, E.: Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web. In: Daniel, F., Diaz, O. (eds.) Current Trends in Web Engineering: 15th International Conference, ICWE 2015 Work-shops, (NLPIT), Rotterdam (2015)
Hassoun, M., Belhadj, S.: Les nouveaux défis du TAL Exploration des médias sociaux pour l’analyse des sentiments: Cas de l’Arabish. In: Actes du colloque de Ghardaïa (2014)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
Vinot, R., Grabar, N., Valette, M.: Application d’algorithmes de classification automatique pour la détection des contenus racistes sur l’Internet. In: Proceedings of the 10th Annual Conference on Natural Language Processing TALN, Batz-sur-Mer (2003)
Joachims, T.: Text categorization with support vector machines. In: Proceedings of the 10th European Conference on Machine Learning, Chemnitz (1998)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Aridhi, C., Achour, H., Souissi, E., Younes, J. (2017). Word-Level Identification of Romanized Tunisian Dialect. In: Frasincar, F., Ittoo, A., Nguyen, L., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2017. Lecture Notes in Computer Science(), vol 10260. Springer, Cham. https://doi.org/10.1007/978-3-319-59569-6_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-59569-6_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59568-9
Online ISBN: 978-3-319-59569-6
eBook Packages: Computer ScienceComputer Science (R0)