Word-Level Identification of Romanized Tunisian Dialect

Aridhi, Chaima; Achour, Hadhemi; Souissi, Emna; Younes, Jihene

doi:10.1007/978-3-319-59569-6_19

Chaima Aridhi¹⁷,
Hadhemi Achour¹⁸,
Emna Souissi¹⁷ &
…
Jihene Younes¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10260))

Included in the following conference series:

International Conference on Applications of Natural Language to Information Systems

1833 Accesses
3 Citations
1 Altmetric

Abstract

In the Arabic-speaking world, textual productions on social networks are often informal and generally characterized by the use of various dialects, which can be transcribed in Latin or Arabic characters. More specifically, electronic writing in Tunisia is characterized in large part by a mixture of Tunisian dialect with other languages and by a margin of individualization giving users the freedom to write without depending on orthographic or grammatical constraints. In this work, we address the problem of the automatic Tunisian dialect identification within the electronic writings that are produced on social networks using the Latin alphabet. We propose to study and experiment two different identification approaches. Our experiments show that the best performance is obtained using a machine learning based approach using Support Vector Machines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Arabic dialect written with Latin alphabet.
2.
https://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html.

References

Jalam, R.: Apprentisage Automatique et Catégorisation de Textes Multilingues. Ph.D. thesis, Université Lumière, Lyon (2003)
Google Scholar
Tromp, E., Pechenizkiy, M.: Graph-based n-gram language identification on short texts. In: Proceedings of the 20th Machine Learning conference of Belgium and The Netherlands, The Hague (2011)
Google Scholar
Winkelmolen, F., Mascardi, V.: Statistical language identification of short texts. In: Proceedings of the 3rd International Conference on Agents and Artificial Intelligence, Rome (2011)
Google Scholar
Jalam, R., Teytaud, O.: Simplified Identification de la Langue et Catégorisation de Textes basées sur les N-grams. In: Journées Francophones d’ extraction et de gestion de connaissances, Montpellier (2002)
Google Scholar
Dunning, T.: Statistical identification of language. In: Computing Research Laboratory Technical Memo MCCS 94--273, New Mexico State University, New Mexico (1994)
Google Scholar
Giguet, E.: Méthode pour l’analyse automatique de structures formelles sur documents multilingues. Ph.D. thesis, Université de Caen, Normandy (1998)
Google Scholar
Lins, R.D., Gonçalves, P.: Automatic language identification of written texts. In: Proceedings of the 2004 ACM Symposium on Applied Computing, Nicosia (2004)
Google Scholar
Souter, C., Churcher, G., Hayes, J., Hughes, J., Johnson, S.: Natural language identification using corpus-based models. Hermes - J. Lang. Commun. Bus. 13, 183–203 (1994)
Google Scholar
Martino, M.J., Paulsen, R.C.: Natural language determination using partial words. Google Patents (2001)
Google Scholar
Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of the 3rd International Conference on the Statistical Analysis of Textual Data (JADT 1995), Rome (1995)
Google Scholar
Cavnar, W.B., Trenkle, J.M.: n-Gram-based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas (1994)
Google Scholar
Ahmed, B., Cha, S.H., Tappert. C.: Language identification from text using n-Gram based cumulative frequency addition. In: Proceedings of Student/Faculty Research Day, CSIS, New York (2004)
Google Scholar
Bhargava, A., Kondrak, G.: Language identification of names with SVMs. In: Proceedings of HLT 2010 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, California (2010)
Google Scholar
Simões, A., Almeida, J.J., Byers, S.D.: Language identification: a neural network approach. In: 3rd Symposium on Languages, Applications and Technologies (SLATE 2014), Bragança (2014)
Google Scholar
Chittaranjan, G., Vyas, Y., Bali, K., Choudhury, M.: Word-level language identification using CRF: code-switching shared task report of MSR India system. In: Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha (2014)
Google Scholar
Cotterell, R., Callison-Burch, C.: A multi-dialect, multi-genre corpus of informal written Arabic. In: 9th International Conference on Language Resources and Evaluation, Reykjavik, pp. 241–245 (2014)
Google Scholar
Darwish, K., Sajjad, H., Mubarak, H.: Verifiably effective Arabic dialect identification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha (2014)
Google Scholar
Younes, J., Achour, H., Souissi, E.: Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web. In: Daniel, F., Diaz, O. (eds.) Current Trends in Web Engineering: 15th International Conference, ICWE 2015 Work-shops, (NLPIT), Rotterdam (2015)
Google Scholar
Hassoun, M., Belhadj, S.: Les nouveaux défis du TAL Exploration des médias sociaux pour l’analyse des sentiments: Cas de l’Arabish. In: Actes du colloque de Ghardaïa (2014)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
MATH Google Scholar
Vinot, R., Grabar, N., Valette, M.: Application d’algorithmes de classification automatique pour la détection des contenus racistes sur l’Internet. In: Proceedings of the 10th Annual Conference on Natural Language Processing TALN, Batz-sur-Mer (2003)
Google Scholar
Joachims, T.: Text categorization with support vector machines. In: Proceedings of the 10th European Conference on Machine Learning, Chemnitz (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Université de Tunis, ENSIT, 1008, Montfleury, Tunisia
Chaima Aridhi & Emna Souissi
Université de Tunis, ISGT, LR99ES04 BESTMOD, 2000, Le Bardo, Tunisia
Hadhemi Achour & Jihene Younes

Authors

Chaima Aridhi
View author publications
You can also search for this author in PubMed Google Scholar
Hadhemi Achour
View author publications
You can also search for this author in PubMed Google Scholar
Emna Souissi
View author publications
You can also search for this author in PubMed Google Scholar
Jihene Younes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chaima Aridhi .

Editor information

Editors and Affiliations

Erasmus University Rotterdam, Rotterdam, The Netherlands
Flavius Frasincar
University of Liège , Liège, Belgium
Ashwin Ittoo
Japan Advanced Institute of Science and Technology, Nomi, Japan
Le Minh Nguyen
Conservatoire National des Arts et Métiers, Paris, France
Elisabeth Métais

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aridhi, C., Achour, H., Souissi, E., Younes, J. (2017). Word-Level Identification of Romanized Tunisian Dialect. In: Frasincar, F., Ittoo, A., Nguyen, L., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2017. Lecture Notes in Computer Science(), vol 10260. Springer, Cham. https://doi.org/10.1007/978-3-319-59569-6_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-59569-6_19
Published: 02 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59568-9
Online ISBN: 978-3-319-59569-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics