Skip to main content

Linguistic Resources Construction: Towards Disfluency Processing in Spontaneous Tunisian Dialect Speech

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11697))

Included in the following conference series:

Abstract

The Tunisian Dialect (TD) is an under-resourced language which lacks both corpora and Natural Language Processing (NLP) tools despite being increasingly used in spoken and written forms. In this paper, we presented our endeavour to build linguistic resources for TD in order to process disfluencies. First, we created the Disfluencies Corpus from Tunisian Arabic Transcriptions (DisCoTAT), which is a set of manual transcriptions with several disfluency phenomena. Also, we constructed the Tunisian Dialect Wordnet (TD-WordNet) from existing TD lexicons to annotate words with morpho-syntactic tags. Then, we developed the Disfluency Annotation Tool (DisAnT) in order to annotate DisCoTAT. DisAnT provides two levels of annotation: morpho-syntactic tagging and disfluency annotation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We used the Buckwalter transliteration.

  2. 2.

    https://www.fichier-pdf.fr/2010/08/31/m14401m/dico-karmous.pdf.

  3. 3.

    http://www.arabetunisien.com/.

  4. 4.

    https://files.eric.ed.gov/fulltext/ED183017.pdf.

  5. 5.

    https://fieldsupport.dliflc.edu/productList.aspx?v=lsk.

  6. 6.

    https://www.happyscribe.co/.

  7. 7.

    A city located in the center of Tunisia.

References

  1. Abbassi, H., Bahou, Y., Maaloul, M.H.: L’apport d’une approche hybride dans la compréhension de l’oral arabe spontané. In: 29th of Proceedings of International Business Information Management Association, pp. 2145–2157. Vienna, Austria, May 2017

    Google Scholar 

  2. Ben Ahmed, Y.: Constitution d’un corpus d’arabe tunisien parlé à orléans. In: Actes des 9éme Journées Internationales de la Linguistique de corpus, p. 173 (2017)

    Google Scholar 

  3. Ben Ltaief, A., Estève, Y., Graja, M., Belguith Hadrich, L.: Automatic speech recognition for Tunisian Dialect. In: Proceedings of the First Conference on Language Processing and Knowledge Management, LPKM 2017. Kerkennah (Sfax), Tunisia, September 2017

    Google Scholar 

  4. Bouchlaghem, R., Elkhlifi, A., Faiz, R.: Tunisian dialect wordnet creation and enrichment using web resources and other wordnets. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing, pp. 104–113 (2014)

    Google Scholar 

  5. Boughariou, E., Bahou, Y., Maaloul, M.H.: Application d’une méthode numérique à base d’apprentissage pour la segmentation conceptuelle de l’oral arabe spontané. In: 29th of Proceedings of International Business Information Management Association, pp. 2820–2835. Vienna, Austria, May 2017

    Google Scholar 

  6. Boujelbane, R., Khemekhem Ellouze, M., Béchet, F., Belguith Hadrich, L.: De l’arabe standard vers l’arabe dialectal: projection de corpus et ressources linguistiques en vue du traitement automatique de l’oral dans les médias tunisiens. In: Revue TAL (2015)

    Google Scholar 

  7. Boujelbane, R., Khemekhem Ellouze, M., Ben Ayed, S., Belguith Hadrich, L.: Building bilingual lexicon to create Dialect Tunisian corpora and adapt language model. In: Proceedings of the Second Workshop on Hybrid Approaches to Translation, pp. 88–93 (2013)

    Google Scholar 

  8. Boujelbane, R., Zribi, I., Kharroubi, S., Khemekhem Ellouze, M.: An automatic process for Tunisian Arabic orthography normalization (2016)

    Google Scholar 

  9. Christodoulides, G., Avanzi, M., Goldman, J.P.: DisMo: a morphosyntactic, disfluency and multi-word unit annotator. an evaluation on a corpus of french spontaneous and read speech. arXiv preprint. arXiv:1802.02926 (2018)

  10. Graja, M., Jaoua, M., Belguith Hadrich, L.: Lexical study of a spoken dialogue corpus in Tunisian dialect. In: The International Arab Conference on Information Technology. Benghazi, Libya (2010)

    Google Scholar 

  11. Habash, N., Diab, M.T., Rambow, O.: Conventional orthography for dialectal Arabic. In: LREC, pp. 711–718 (2012)

    Google Scholar 

  12. Hamdi, A., Boujelbane, R., Habash, N., Nasr, A.: Un système de traduction de verbes entre arabe standard et arabe dialectal par analyse morphologique profonde. In: Traitement Automatique des Langues Naturelles, pp. 396–406 (2013)

    Google Scholar 

  13. Hamdi, A., Nasr, A., Habash, N., Gala, N.: POS-tagging of tunisian dialect using standard Arabic resources and tools. In: Workshop on Arabic Natural Language Processing, pp. 59–68 (2015)

    Google Scholar 

  14. Karoui, J., Graja, M., Boudabous, M.M., Belguith Hadrich, L.: Domain ontology construction from a Tunisian spoken dialogue corpus. In: International Conference on Web and Information Technologies (2013)

    Google Scholar 

  15. Labiadh, M., Bahou, Y., Maaloul, M.H.: Complex disfluencies processing in spontaneous Arabic speech. In: Language Processing and Knowledge Management International Conference, LPKM 2018 (2018)

    Google Scholar 

  16. Maamouri, M., Bies, A., Buckwalter, T., Mekki, W.: The penn Arabic treebank: building a large-scale annotated Arabic corpus. In: NEMLAR Conference on Arabic Language Resources and Tools, vol. 27, Cairo, Egypt. pp. 466–467 (2004)

    Google Scholar 

  17. Masmoudi, A., Khmekhem, M.E., Esteve, Y., Belguith Hadrich, L., Habash, N.: A corpus and phonetic dictionary for Tunisian Arabic speech recognition. In: LREC. pp. 306–310 (2014)

    Google Scholar 

  18. Moussa, N.K.B., Soussou, H., Alimi, Adel, M.: Tunisian arabic aeb wordnet: current state and future extensions. In: First International Conference on Arabic Computational Linguistics (ACLing), pp. 3–8 (2015)

    Google Scholar 

  19. Neifar, W., Bahou, Y., Graja, M., Jaoua, M.: Implementation of a symbolic method for the Tunisian dialect understanding. In: Proceedings of 5th International Conference on Arabic Language Processing. Oujda, Maroc, November 2014

    Google Scholar 

  20. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The kaldi speech recognition toolkit, Tech. rep. IEEE Signal Processing Society (2011)

    Google Scholar 

  21. Rasooli, M.S., Tetreault, J.: Joint parsing and disfluency detection in linear time. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 124–129 (2013)

    Google Scholar 

  22. Shriberg, E.E.: Preliminaries to a theory of speech disfluencies. Ph.D. thesis, University of California, Berkeley (1994)

    Google Scholar 

  23. Zayats, V., Ostendorf, M., Hajishirzi, H.: Disfluency detection using a bidirectional LSTM. arXiv preprint. arXiv:1604.03209 (2016)

  24. Zribi, I., Boujelbane, R., Masmoudi, A., Khemekhem Ellouze, M., Belguith Hadrich, L., Habash, N.: A conventional orthography for Tunisian Arabic. In: LREC, pp. 2355–2361 (2014)

    Google Scholar 

  25. Zribi, I., Kammoun, I., Khemekhem Ellouze, M., Belguith Hadrich, L., Blache, P.: Sentence boundary detection for transcribed Tunisian Arabic. In: Bochumer Linguistische Arbeitsberichte, pp. 223–231 (2016)

    Google Scholar 

  26. Zribi, I., Khemekhem Ellouze, M., Belguith Hadrich, L.: Morphological analysis of Tunisian dialect. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 992–996 (2013)

    Google Scholar 

  27. Zribi, I., Khemekhem Ellouze, M., Belguith Hadrich, L., Blache, P.: Spoken Tunisian Arabic corpus “STAC”: transcription and annotation. Res. Comput. Sci. 90, 123–135 (2015)

    Google Scholar 

  28. Zribi, I., Khemekhem Ellouze, M., Belguith Hadrich, L., Blache, P.: Morphological disambiguation of Tunisian dialect, pp. 147–155 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emna Boughariou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Boughariou, E., Bahou, Y., Bleguith, L.H. (2019). Linguistic Resources Construction: Towards Disfluency Processing in Spontaneous Tunisian Dialect Speech. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27947-9_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27946-2

  • Online ISBN: 978-3-030-27947-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics