Skip to main content
Log in

Sentence boundary detection of various forms of Tunisian Arabic

  • Project Notes
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Sentence boundary detection (SBD) is an essential step for a very large number of natural language processing applications such as parsing, information retrieval, automatic summarization, machine translation, etc. In this paper, we tackle the problem of SBD of dialectal Arabic, especially for the Tunisian dialect. We compare the efficiency of three learning algorithms: Deep Neuronal Networks (DNN), Support Vector Machines (SVM) and Conditional Random Fields (CRF) to detect the boundaries of sentences written in different types of dialect. The best model achieved an F-measure of 84.37% using CRF which is a popular formalism for structured prediction in NLP and it has been widely applied in text segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. It is the informal Arabic chat alphabet.

  2. https://www.ted.com/talks?language=en.

  3. https://catalog.ldc.upenn.edu/LDC2011T11.

  4. https://ar.wikiquote.org/wiki/أمثال_تونسية.

  5. https://github.com/AsmaMekki/TA-Segmentation-Corpus.

  6. TensorFlow is an end-to-end open-source deep learning framework developed and maintained by Google.

References

  • Aizenberg, I., Aizenberg, N., & Vandewalle, J. (2000). Multiple-valued threshold logic and multi-valued neurons. In Multi-valued and universal binary neurons (pp. 25–80). Springer. https://doi.org/10.1007/978-1-4757-3115-6_2.

  • Al-Subaihin, A. A., Al-Khalifa, H. S., & Al-Salman, A. S. (2011). Sentence boundary detection in colloquial Arabic text: A preliminary result. In International conference on Asian language processing (pp. 30–32). https://doi.org/10.1109/IALP.2011.38.

  • Ashley, K. D. (2017) Using conditional random fields to detect different functional types of content in decisions of United States Courts with example application to sentence boundary detection ∗. In The second workshop on automated detection, extraction and analysis of semantic information in legal texts (ASAIL), London, Great Britain.

  • Belguith, L. H., Baccour, L., & Mourad, G. (2005). Segmentation de textes arabes basée sur l’analyse contextuelle des signes de ponctuations et de certaines particules Mots clés. In Traitement Automatique Des Langues Naturelles TALN 2005. Dourdan.

  • Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., Abdulrahim, D., Obeid, O., Khalifa, S., Eryani, F., & Erdmann, A. (2018). The MADAR Arabic dialect corpus and lexicon. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), Miyazaki, Japan (pp. 3387–3396).

  • Boughariou, E., Bahou, Y., & Hadrich Belguith, L. (2019). Linguistic resources construction: Towards disfluency processing. In International conference on text, speech, and dialogue (pp. 316–328).

  • Boujelbane, R., Ellouze, M., Béchet, F., & Belguith, L. (2014). De l’arabe standard vers l’arabe dialectal: Projection de corpus et ressources linguistiques en vue du traitement automatique de l’oral dans les médias tunisiens. Revue TAL, 55, 73–96.

    Google Scholar 

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.

    Article  Google Scholar 

  • Elmadany, A. A., & Abdou, S. M. (2015). Turn segmentation into utterances for Arabic spontaneous dialogues and instant messages. International Journal on Natural Language Computing. https://doi.org/10.5121/ijnlc.2015.4208

    Article  Google Scholar 

  • Frank, E., & Witten, I. H. (1998). Generating accurate rule sets without global optimization. In J. Shavlik (Ed.), Fifteenth international conference on machine learning (pp. 144–151). Morgan Kaufmann.

  • González-Gallardo, C.-E., & Torres-Moreno, J.-M. (2018). Sentence boundary detection for French with subword-level information vectors and convolutional neural networks (pp. 2–6).

  • Gunn, S. R. (1998). Support vector machines for classification and regression. Royaume-Uni.

  • Habash, N., Eskander, R., & Hawwari, A. (2012). A morphological analyzer for Egyptian Arabic. In Proceedings of the twelfth meeting of Special Interest Group on Computational Morphology and Phonology. SIGMORPHON2012 (pp. 1–9).

  • Hoffer, E., & Soudry, D. (2017). Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. In Advances in neural information processing systems (pp. 1731–1741).

  • Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Peter Tang, P. T. (2017). On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International conference on learning representations, Toulon, France (pp. 1–16).

  • Keskes, I., Benamara, F., & Belguith, L. H. (2012). Clause-based discourse segmentation of Arabic texts. In Proceedings of the eighth international conference on language resources and evaluation (LREC'12).

  • Klibi, S., Hamraoui, S., Ben Abda, S., Gaddes, C., Horcheni, F., & Maalla, A. (2014). La constitution Tunisienne. Tunisia.

  • Lafferty, J., & Mccallum, A. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data conditional random fields: Probabilistic models for segmenting. In Proceedings of the eighteenth international conference on machine learning, ICML (pp. 282–289).

  • Meftouh, K., Harrat, S., Jamoussi, S., Abbas, M., & Smaili, K. (2015). Machine translation experiments on PADIC: A parallel Arabic dialect corpus. In 29th Pacific Asia conference on language, information and computation (PACLIC 2015), Shanghai, China (pp. 26–34).

  • Mejri, S., Said, M., & Sfar, I. (2009). Pluringuisme et diglossie en Tunisie. Synergies Tunisie, 1, 53–74.

    Google Scholar 

  • Mekki, A., Zribi, I., Ellouze Khmekhem, M., & Hadrich Belguith, L. (2019). Automatic normalization of Tunisian social networks texts. In 16th International conference of the Pacific Association for Computational Linguistics (PACLING 2019), Hanoi, Vietnam.

  • Mekki, A., Zribi, I., Ellouze Khmekhem, M., & Hadrich Belguith, L. (2018). Critical description of TA linguistic resources. In The 4th international conference on Arabic computational linguistics (ACLing 2018) and Procedia computer science, November 17–19 2018, Dubai, United Arab Emirates.

  • Mekki, A., Zribi, I., Ellouze, M., & Belguith, L. H. (2017). Syntactic analysis of the Tunisian Arabic. In International workshop on language processing and knowledge management LPKM2017.

  • Mekki, A., Zribi, I., Ellouze, M., & Hadrich Belguith, L. (2020). Treebank creation and parser generation for Tunisian Social Media text. In 17th ACS/IEEE international conference on computer systems and applications AICCSA 2020. IEEE.

  • Nursuriati, J., Ramli, M. I., & Noraini, S. (2015). Sentence boundary detection without speech recognition: A case of an under-resourced language. Journal of Electrical Systems, 11(3), 308–318.

    Google Scholar 

  • Pasha, A., Al-Badrashiny, M., Diab, M., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., & Roth, R. M. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the ninth international conference on language resources and evaluation (LREC’14) (pp. 1094–1101).

  • Rehman, Z., & Anwar, W. (2012). A hybrid approach for Urdu sentence boundary disambiguation. The International Arab Journal of Information Technology, 9(3), 250–255.

    Google Scholar 

  • Rudrapal, D., Jamatia, A., Chakma, K., Das, A., & Gambäck, B. (2015). Sentence boundary detection for social media text. In 12th International conference on natural language processing, Trivandrum, India (pp. 254–260).

  • Saadane, H. (2015). Le traitement automatique de l’arabe dialectalisé : aspects méthodologiques et algorithmiques. Grenoble Alpes.

  • Sadat, F., Mallek, F., Sallemi, R., Boudabous, M. M., & Farzindar, A. (2014). Collaboratively constructed linguistic resources for language variants and their exploitation in NLP applications—The case of Tunisian Arabic and the social media. In The workshop on lexical and grammatical resources for language processing (pp. 102–110).

  • Saetia, C., Chuangsuwanich, E., Chalothorn, T., & Vateekul, P. (2019). Semi-supervised Thai Sentence segmentation using local and distant word representations. arXiv preprint arXiv:1908.01294, 1–19.

  • Sanchez, G. (2019). Sentence boundary detection in legal text. In Proceedings of the natural legal language processing workshop 2019 (pp. 31–38). Association for Computational Linguistics.

  • Smith, S. L., Kindermans, P., Ying, C., Le, Q. V., & Brain, G. (2018). Don’t decay the learning rate, increase the Batch Size. In The sixth international conference on learning representations, Vancouver, Canada (pp. 1–11).

  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.

    Google Scholar 

  • Touir, A. A., Mathkour, H., & Al-Sanea, W. (2008). Semantic-based segmentation of Arabic texts. Information Technology Journal. https://doi.org/10.3923/itj.2008.1009.1015

    Article  Google Scholar 

  • Vapnik, V. N. (1995). The nature of statistical learning theory. Springer.

  • Younes, J., Achour, H., & Souissi, E. (2015). Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web. In International conference on web engineering ICWE 2015: Current trends in web engineering (pp. 3–14).

  • Zribi, I., Boujelbane, R., Masmoudi, A., Ellouze, M., Belguith, L., & Habash, N. (2014). A conventional orthography for Tunisian Arabic. In The ninth international conference on language resources and evaluation (LREC’14) (pp. 2355–2361). https://doi.org/10.13140/2.1.3168.0324.

  • Zribi, I., Ellouze, M., Belguith, L. H., & Blache, P. (2015). Spoken Tunisian Arabic corpus “STAC ”: Transcription and annotation. Research in Computer Science Journal, 90, 1–13.

    Google Scholar 

  • Zribi, I., Graja, M., Khemakhem, M. E., Jaoua, M., & Belguith, L. H. (2013a). Orthographic transcription for spoken Tunisian Arabic. In CICLing 2013, Part I, LNCS 7816 (pp. 153–163).

  • Zribi, I., Kammoun, I., Ellouze, M., Belguith, L. H., & Blache, P. (2016). Sentence boundary detection for transcribed Tunisian Arabic. In Konvens 2016 (pp. 323–331).

  • Zribi, I., Khemakhem, M. E., & Belguith, L. H. (2013b). Morphological analysis of Tunisian dialect. In International joint conference on natural language processing (pp. 992–996). https://doi.org/10.1016/j.jksuci.2017.01.004.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Asma Mekki.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mekki, A., Zribi, I., Ellouze, M. et al. Sentence boundary detection of various forms of Tunisian Arabic. Lang Resources & Evaluation 56, 357–385 (2022). https://doi.org/10.1007/s10579-021-09538-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-021-09538-4

Keywords

Navigation