Skip to main content
Log in

Automatic Processing of Algerian Dialect: Corpus Construction and Segmentation

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

The Arabic language is the official language of multiple countries pertaining to the Asian and African continents. It is used in all aspects of written and spoken communication such as newspapers, magazines, television, administrative correspondence, and schools. Due to the diversity in cultures and history, Arabic dialects are rich and diverse as well, with each country having its own dialect(s). However, these dialects’ diversity raises an issue in that the automation process of their linguistic and grammatical rules become difficult, and the Algerian dialect poses the most issues. In this paper, we explore the challenges that prevent the automatic processing of the Algerian dialect. In addition to that, we present linguistic resources for the automatic processing of the Algerian dialect starting from a written corpus. The latter goes through multiple stages mainly the segmentation of the written texts using the grammar rules extracted from the corpus and the translation of some rules from other dialects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. https://www.amazon.com/

  2. https://www.fon.hum.uva.nl/praat/

References

  1. Abdaoui A, Berrimi M, Oussalah M, Moussaoui A. Dziribert: a pre-trained language model for the algerian dialect. 2021. arXiv preprint arXiv:2109.12346 .

  2. Alkhudaydi M, Gutub A. Securing data via cryptography and arabic text steganography. SN Comp Sci. 2021;2(1):1–18.

    Google Scholar 

  3. Awotunde JB, Ajagbe SA, Oladipupo MA, Awokola JA, Afolabi OS, Mathew TO, Oguns YJ. An improved machine learnings diagnosis technique for covid-19 pandemic using chest x-ray images. In: Applied Informatics: Fourth International Conference, ICAI 2021, Buenos Aires, Argentina, October 28–30, 2021, Proceedings 4, pp. 319–330. Springer.

  4. Bahou Y, Maaloul MH, Abbassi H. Hybrid approach for conceptual segmentation of spontaneous arabic oral utterances. Procedia Comp Sci. 2017;117:233–40.

    Article  Google Scholar 

  5. Barhoumi A. Une approche neuronale pour l’analyse d’opinions en arabe. Ph. D. thesis, Le Mans. 2020

  6. Barhoumi A, Aloulou C, Camelin N, Estève Y, Belguith L. Arabic sentiment analysis: an empirical study of machine translation’s impact. In: Language Processing and Knowledge Management International Conference. 2018. (LPKM2018).

  7. Belguith LH. Analyse et résumé automatiques de documents: Problèmes, conception et réalisation. Habilitation universitaire en informatique, Faculté des Sciences Économiques et de Gestion de Sfax (FSEG-SFAX): 53 . 2009.

  8. Belguith LH, Baccour L, Ghassan M. Segmentation de textes arabes basée sur l’analyse contextuelle des signes de ponctuations et de certaines particules. In: Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts, 2005. p. 451–456.

  9. Boersma P. Praat: doing phonetics by computer. http://www. praat. org/ . 2006.

  10. Bouamor H, Habash N, Salameh M, Zaghouani W, Rambow O, Abdulrahim D, Obeid O,  Khalifa S,  Eryani F, Erdmann A, et al. The madar arabic dialect corpus and lexicon. In LREC. 2018.

  11. Bougrine S,  Chorana A,  Lakhdari A,  Cherroun H. Toward a web-based speech corpus for algerian dialectal arabic varieties. In: Proceedings of the Third Arabic Natural Language Processing Workshop, 2017. pp. 138–146.

  12. Buckwalter T. Buckwalter arabic morphological analyzer version 1.0. Linguistic Data Consortium, University of Pennsylvania . 2002.

  13. Djellab M, Amrouche A, Bouridane A, Mehallegue N. Algerian modern colloquial arabic speech corpus (amcasc): regional accents recognition within complex socio-linguistic environments. Language Res Evaluat. 2017;51(3):613–41.

    Article  Google Scholar 

  14. Elouardighi A,  Maghfour M,  Hammia H, Aazi FZ. Analyse des sentiments à partir des commentaires facebook publiés en arabe standard ou dialectal marocain par une approche d’apprentissage automatique. In EGC, 2018. p. 329–334.

  15. Farghaly A, Shaalan K. Arabic natural language processing: Challenges and solutions. ACM Transact Asian Language Inf Process (TALIP). 2009;8(4):1–22.

    Article  Google Scholar 

  16. Gubrium JF, Holstein JA. Handbook of interview research: Context and method. Sage Publications. 2001.

  17. Guellil I,  Adeel A,  Azouaou F,  Hussain A. Sentialg: Automated corpus annotation for algerian sentiment analysis. In International conference on brain inspired cognitive systems, 2018. p. 557–567. Springer.

  18. Guellil I,  Azouaou F. Asda: Analyseur syntaxique du dialecte alg \(\{\)\(\backslash\)’e\(\}\) rien dans un but d’analyse s \(\{\)\(\backslash\)’e\(\}\) mantique. arXiv preprint arXiv:1707.08998 . 2017.

  19. Guellil I,  Azouaou F,  Saâdane H,  Semmar N. Une approche fondée sur les lexiques d’analyse de sentiments du dialecte algérien. Revue TAL . 2017.

  20. Harrat S, Meftouh K, Abbas M, Hidouci WK, Smaili K. An algerian dialect: Study and resources. Int J Adv Comp Sci Appl (IJACSA). 2016;7(3):384–96.

    Google Scholar 

  21. Harrat S,  Meftouh K,  Abbas M,  Smaili K. Building resources for algerian arabic dialects. In: 15th Annual Conference of the International Communication Association Interspeech. 2014.

  22. Kang Y, Cai Z, Tan CW, Huang Q, Liu H. Natural language processing (nlp) in management research: A literature review. J Manag Anal. 2020;7(2):139–72.

    Google Scholar 

  23. Landis JR, Koch GG. The measurement of observer agreement for categorical data. biometrics: 159–174 . 1977.

  24. Lorcin PM. Kabyles, arabes, français: identités coloniales. Limoges: Presses Univ; 2005.

    Google Scholar 

  25. Maâloul MH. Approche hybride pour le résumé automatique de textes. Application à la langue arabe. Ph. D. thesis, Université de Provence-Aix-Marseille I. 2012.

  26. Mataoui M, Zelmati O, Boumechache M. A proposed lexicon-based sentiment analysis approach for the vernacular algerian arabic. Res Comp Sci. 2016;110(1):55–70.

    Article  Google Scholar 

  27. Meftouh K,  Harrat S,  Jamoussi S,  Abbas M,  Smaili K. Machine translation experiments on padic: A parallel arabic dialect corpus. In Proceedings of the 29th Pacific Asia conference on language, information and computation, 2015; pp. 26–34.

  28. Moudjari L,  Akli-Astouati K,  Benamara F. An algerian corpus and an annotation platform for opinion and emotion analysis. In Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020; pp. 1202–1210.

  29. Saadane H,  Habash N. A conventional orthography for algerian arabic. In the Second Workshop on Arabic Natural Language Processing, 2015l; p. 69–79.

  30. Safi H,  Jaoua M, Belguith LH. Pirat: a personalized information retrieval system in arabic texts based on a hybrid representation of a user profile. In International Conference on Applications of Natural Language to Information Systems, 2016; p. 326–334. Springer.

  31. Seddah D,  Essaidi F, Fethi A,  Futeral M,  Muller B, Suarez PO, Sagot B,  Srivastava A. Building a user-generated content north-african arabizi treebank: Tackling hell. In Proceedings of the 58th annual meeting of the Association for Computational Linguistics, 2020; p. 1139–1150.

  32. Selouani SA, Boudraa M. Algerian arabic speech database (algasd): corpus design and automatic speech recognition application. Arab J Sci Eng. 2010;35(2):157–66.

    Google Scholar 

  33. Shaalan K, Siddiqui S, Alkhatib M, Abdel Monem A. Challenges in arabic natural language processing, Computational linguistics, speech and image processing for arabic language. World Scientific; 2019. p. 59–83.

    Google Scholar 

  34. Stieglitz S, Mirbabaie M, Ross B, Neuberger C. Social media analytics-challenges in topic discovery, data collection, and data preparation. Int J Informat Manag. 2018;39:156–68.

    Article  Google Scholar 

  35. Stoica A, Suignard P, Pepin L. Twitter: Extraction, regroupement et visualisation pour la veille stratégique. Ajaccio, France: In Actes de la conférence Veille Stratégique et Technologique; 2012.

    Google Scholar 

  36. Waibel A,  Schultz T,  Vogel S, Fugen C, Honal M,  Kolss M,  Reichert J,  Stuker S. Towards language portability in statistical speech translation. In: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004; Volume 3, pp. iii–765. IEEE.

  37. Zribi I. Traitement automatique du dialecte tunisien: construction de ressources linguistiques. Ph. D. thesis, Université de Sfax (Tunisie). 2016.

  38. Zribi I, Ellouze M, Belguith LH, Blache P. Spoken tunisian arabic corpus stac: transcription and annotation. Research in computing science. 2015;90:123–35.

    Article  Google Scholar 

  39. Zribi I, Hammami SM, Belguith LH. L’apport d’une approche hybride pour la reconnaissance des entités nommées en langue arabe. In Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts, 2010; pp. 183–188.

  40. Zribi I,  Kammoun I, Ellouze M, Belguith L, Blache P. Sentence boundary detection for transcribed tunisian arabic. Bochumer Linguistische Arbeitsberichte 323 . 2016.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdelhakim Benali.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Benali, A., Maaloul, M.H. & Belguith, L.H. Automatic Processing of Algerian Dialect: Corpus Construction and Segmentation. SN COMPUT. SCI. 4, 597 (2023). https://doi.org/10.1007/s42979-023-02097-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-023-02097-1

Keywords

Navigation