Abstract
Arabic Dialects (AD) have recently begun to receive more attention from the speech science and technology communities. The use of dialects in language technologies will contribute to improve the development process and the usability of applications such speech recognition, speech comprehension, or speech synthesis. However, AD faces the problem of lack of resources compared to the Modern Standard Arabic (MSA). This paper deals with the problem of tagging an AD: The Tunisian Dialect (TD). We present, in this work, a method for building a fine grained POS (Part Of Speech tagger) for the TD. This method consists on adapting a MSA POS tagger by generating a training TD corpus from a MSA corpus using a bilingual lexicon MSA-TD. The evaluation of the TD tagger on a corpus of text transcriptions achieved an accuracy of 78.5%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Boujelbane, R., Khemekhem, M.E., Belguith, L.H.: Mapping Rules for Building a Tunisian Dialect Lexicon and Generating Corpora. In: Proceeding of International Joint Conference on Natural Language Processing (IJCNLP), Nagoya, Japan (2013)
Graja, M., Jaoua, M., Belguith, L.H.: Towards Understanding Spoken Tunisian Dialect. In: Lu, B.-L., Zhang, L., Kwok, J. (eds.) ICONIP 2011, Part III. LNCS, vol. 7064, pp. 131–138. Springer, Heidelberg (2011)
Hamdi, A., Boujelbane, R., Habash, N., Nasr, A.: Un systme de traduction de verbes entre arabe standard et arabe dialectal par analyse morphologique profonde. Traitement Automatique des Langues Naturelles (2013)
Habash, N., Rambow, O., Kiraz, G.: Morphological analysis and generation for Arabic dialects. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages (2005)
Zribi, I., Boujelbane, R., Masmoudi, A., Khemakhem, M.E., Belguith, L., Habash, N.: A Conventional Orthography for Tunisian Arabic. In: The Language Resources and Evaluation Conference (LREC), 9th edn., Iceland (2014)
Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics (2000)
Maamouri, M., Bies, A., Buckwalter, T., Mekki, W.: The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus. In: NEMLAR Conference on Arabic Language Resources and Tools, Cairo, Egypt (2004)
Habash, N., Rambow, O., Roth, R.: MADA+ TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In: Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Boujelbane, R., Mallek, M., Ellouze, M., Belguith, L.H. (2014). Fine-Grained POS Tagging of Spoken Tunisian Dialect Corpora. In: Métais, E., Roche, M., Teisseire, M. (eds) Natural Language Processing and Information Systems. NLDB 2014. Lecture Notes in Computer Science, vol 8455. Springer, Cham. https://doi.org/10.1007/978-3-319-07983-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-07983-7_9
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07982-0
Online ISBN: 978-3-319-07983-7
eBook Packages: Computer ScienceComputer Science (R0)