Abstract
The Tunisian Dialect (TD) is an under-resourced language which lacks both corpora and Natural Language Processing (NLP) tools despite being increasingly used in spoken and written forms. In this paper, we presented our endeavour to build linguistic resources for TD in order to process disfluencies. First, we created the Disfluencies Corpus from Tunisian Arabic Transcriptions (DisCoTAT), which is a set of manual transcriptions with several disfluency phenomena. Also, we constructed the Tunisian Dialect Wordnet (TD-WordNet) from existing TD lexicons to annotate words with morpho-syntactic tags. Then, we developed the Disfluency Annotation Tool (DisAnT) in order to annotate DisCoTAT. DisAnT provides two levels of annotation: morpho-syntactic tagging and disfluency annotation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We used the Buckwalter transliteration.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
A city located in the center of Tunisia.
References
Abbassi, H., Bahou, Y., Maaloul, M.H.: L’apport d’une approche hybride dans la compréhension de l’oral arabe spontané. In: 29th of Proceedings of International Business Information Management Association, pp. 2145–2157. Vienna, Austria, May 2017
Ben Ahmed, Y.: Constitution d’un corpus d’arabe tunisien parlé à orléans. In: Actes des 9éme Journées Internationales de la Linguistique de corpus, p. 173 (2017)
Ben Ltaief, A., Estève, Y., Graja, M., Belguith Hadrich, L.: Automatic speech recognition for Tunisian Dialect. In: Proceedings of the First Conference on Language Processing and Knowledge Management, LPKM 2017. Kerkennah (Sfax), Tunisia, September 2017
Bouchlaghem, R., Elkhlifi, A., Faiz, R.: Tunisian dialect wordnet creation and enrichment using web resources and other wordnets. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing, pp. 104–113 (2014)
Boughariou, E., Bahou, Y., Maaloul, M.H.: Application d’une méthode numérique à base d’apprentissage pour la segmentation conceptuelle de l’oral arabe spontané. In: 29th of Proceedings of International Business Information Management Association, pp. 2820–2835. Vienna, Austria, May 2017
Boujelbane, R., Khemekhem Ellouze, M., Béchet, F., Belguith Hadrich, L.: De l’arabe standard vers l’arabe dialectal: projection de corpus et ressources linguistiques en vue du traitement automatique de l’oral dans les médias tunisiens. In: Revue TAL (2015)
Boujelbane, R., Khemekhem Ellouze, M., Ben Ayed, S., Belguith Hadrich, L.: Building bilingual lexicon to create Dialect Tunisian corpora and adapt language model. In: Proceedings of the Second Workshop on Hybrid Approaches to Translation, pp. 88–93 (2013)
Boujelbane, R., Zribi, I., Kharroubi, S., Khemekhem Ellouze, M.: An automatic process for Tunisian Arabic orthography normalization (2016)
Christodoulides, G., Avanzi, M., Goldman, J.P.: DisMo: a morphosyntactic, disfluency and multi-word unit annotator. an evaluation on a corpus of french spontaneous and read speech. arXiv preprint. arXiv:1802.02926 (2018)
Graja, M., Jaoua, M., Belguith Hadrich, L.: Lexical study of a spoken dialogue corpus in Tunisian dialect. In: The International Arab Conference on Information Technology. Benghazi, Libya (2010)
Habash, N., Diab, M.T., Rambow, O.: Conventional orthography for dialectal Arabic. In: LREC, pp. 711–718 (2012)
Hamdi, A., Boujelbane, R., Habash, N., Nasr, A.: Un système de traduction de verbes entre arabe standard et arabe dialectal par analyse morphologique profonde. In: Traitement Automatique des Langues Naturelles, pp. 396–406 (2013)
Hamdi, A., Nasr, A., Habash, N., Gala, N.: POS-tagging of tunisian dialect using standard Arabic resources and tools. In: Workshop on Arabic Natural Language Processing, pp. 59–68 (2015)
Karoui, J., Graja, M., Boudabous, M.M., Belguith Hadrich, L.: Domain ontology construction from a Tunisian spoken dialogue corpus. In: International Conference on Web and Information Technologies (2013)
Labiadh, M., Bahou, Y., Maaloul, M.H.: Complex disfluencies processing in spontaneous Arabic speech. In: Language Processing and Knowledge Management International Conference, LPKM 2018 (2018)
Maamouri, M., Bies, A., Buckwalter, T., Mekki, W.: The penn Arabic treebank: building a large-scale annotated Arabic corpus. In: NEMLAR Conference on Arabic Language Resources and Tools, vol. 27, Cairo, Egypt. pp. 466–467 (2004)
Masmoudi, A., Khmekhem, M.E., Esteve, Y., Belguith Hadrich, L., Habash, N.: A corpus and phonetic dictionary for Tunisian Arabic speech recognition. In: LREC. pp. 306–310 (2014)
Moussa, N.K.B., Soussou, H., Alimi, Adel, M.: Tunisian arabic aeb wordnet: current state and future extensions. In: First International Conference on Arabic Computational Linguistics (ACLing), pp. 3–8 (2015)
Neifar, W., Bahou, Y., Graja, M., Jaoua, M.: Implementation of a symbolic method for the Tunisian dialect understanding. In: Proceedings of 5th International Conference on Arabic Language Processing. Oujda, Maroc, November 2014
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The kaldi speech recognition toolkit, Tech. rep. IEEE Signal Processing Society (2011)
Rasooli, M.S., Tetreault, J.: Joint parsing and disfluency detection in linear time. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 124–129 (2013)
Shriberg, E.E.: Preliminaries to a theory of speech disfluencies. Ph.D. thesis, University of California, Berkeley (1994)
Zayats, V., Ostendorf, M., Hajishirzi, H.: Disfluency detection using a bidirectional LSTM. arXiv preprint. arXiv:1604.03209 (2016)
Zribi, I., Boujelbane, R., Masmoudi, A., Khemekhem Ellouze, M., Belguith Hadrich, L., Habash, N.: A conventional orthography for Tunisian Arabic. In: LREC, pp. 2355–2361 (2014)
Zribi, I., Kammoun, I., Khemekhem Ellouze, M., Belguith Hadrich, L., Blache, P.: Sentence boundary detection for transcribed Tunisian Arabic. In: Bochumer Linguistische Arbeitsberichte, pp. 223–231 (2016)
Zribi, I., Khemekhem Ellouze, M., Belguith Hadrich, L.: Morphological analysis of Tunisian dialect. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 992–996 (2013)
Zribi, I., Khemekhem Ellouze, M., Belguith Hadrich, L., Blache, P.: Spoken Tunisian Arabic corpus “STAC”: transcription and annotation. Res. Comput. Sci. 90, 123–135 (2015)
Zribi, I., Khemekhem Ellouze, M., Belguith Hadrich, L., Blache, P.: Morphological disambiguation of Tunisian dialect, pp. 147–155 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Boughariou, E., Bahou, Y., Bleguith, L.H. (2019). Linguistic Resources Construction: Towards Disfluency Processing in Spontaneous Tunisian Dialect Speech. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_27
Download citation
DOI: https://doi.org/10.1007/978-3-030-27947-9_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)