Sentence boundary detection of various forms of Tunisian Arabic

Mekki, Asma; Zribi, Inès; Ellouze, Mariem; Belguith, Lamia Hadrich

doi:10.1007/s10579-021-09538-4

Sentence boundary detection of various forms of Tunisian Arabic

Project Notes
Published: 20 April 2021

Volume 56, pages 357–385, (2022)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Asma Mekki ORCID: orcid.org/0000-0003-3140-3171¹,
Inès Zribi¹,
Mariem Ellouze¹ &
…
Lamia Hadrich Belguith¹

331 Accesses
5 Citations
Explore all metrics

Abstract

Sentence boundary detection (SBD) is an essential step for a very large number of natural language processing applications such as parsing, information retrieval, automatic summarization, machine translation, etc. In this paper, we tackle the problem of SBD of dialectal Arabic, especially for the Tunisian dialect. We compare the efficiency of three learning algorithms: Deep Neuronal Networks (DNN), Support Vector Machines (SVM) and Conditional Random Fields (CRF) to detect the boundaries of sentences written in different types of dialect. The best model achieved an F-measure of 84.37% using CRF which is a popular formalism for structured prediction in NLP and it has been widely applied in text segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

It is the informal Arabic chat alphabet.
https://www.ted.com/talks?language=en.
https://catalog.ldc.upenn.edu/LDC2011T11.
https://ar.wikiquote.org/wiki/أمثال_تونسية.
https://github.com/AsmaMekki/TA-Segmentation-Corpus.
TensorFlow is an end-to-end open-source deep learning framework developed and maintained by Google.

References

Aizenberg, I., Aizenberg, N., & Vandewalle, J. (2000). Multiple-valued threshold logic and multi-valued neurons. In Multi-valued and universal binary neurons (pp. 25–80). Springer. https://doi.org/10.1007/978-1-4757-3115-6_2.
Al-Subaihin, A. A., Al-Khalifa, H. S., & Al-Salman, A. S. (2011). Sentence boundary detection in colloquial Arabic text: A preliminary result. In International conference on Asian language processing (pp. 30–32). https://doi.org/10.1109/IALP.2011.38.
Ashley, K. D. (2017) Using conditional random fields to detect different functional types of content in decisions of United States Courts with example application to sentence boundary detection ∗. In The second workshop on automated detection, extraction and analysis of semantic information in legal texts (ASAIL), London, Great Britain.
Belguith, L. H., Baccour, L., & Mourad, G. (2005). Segmentation de textes arabes basée sur l’analyse contextuelle des signes de ponctuations et de certaines particules Mots clés. In Traitement Automatique Des Langues Naturelles TALN 2005. Dourdan.
Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., Abdulrahim, D., Obeid, O., Khalifa, S., Eryani, F., & Erdmann, A. (2018). The MADAR Arabic dialect corpus and lexicon. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), Miyazaki, Japan (pp. 3387–3396).
Boughariou, E., Bahou, Y., & Hadrich Belguith, L. (2019). Linguistic resources construction: Towards disfluency processing. In International conference on text, speech, and dialogue (pp. 316–328).
Boujelbane, R., Ellouze, M., Béchet, F., & Belguith, L. (2014). De l’arabe standard vers l’arabe dialectal: Projection de corpus et ressources linguistiques en vue du traitement automatique de l’oral dans les médias tunisiens. Revue TAL, 55, 73–96.
Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
Article Google Scholar
Elmadany, A. A., & Abdou, S. M. (2015). Turn segmentation into utterances for Arabic spontaneous dialogues and instant messages. International Journal on Natural Language Computing. https://doi.org/10.5121/ijnlc.2015.4208
Article Google Scholar
Frank, E., & Witten, I. H. (1998). Generating accurate rule sets without global optimization. In J. Shavlik (Ed.), Fifteenth international conference on machine learning (pp. 144–151). Morgan Kaufmann.
González-Gallardo, C.-E., & Torres-Moreno, J.-M. (2018). Sentence boundary detection for French with subword-level information vectors and convolutional neural networks (pp. 2–6).
Gunn, S. R. (1998). Support vector machines for classification and regression. Royaume-Uni.
Habash, N., Eskander, R., & Hawwari, A. (2012). A morphological analyzer for Egyptian Arabic. In Proceedings of the twelfth meeting of Special Interest Group on Computational Morphology and Phonology. SIGMORPHON2012 (pp. 1–9).
Hoffer, E., & Soudry, D. (2017). Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. In Advances in neural information processing systems (pp. 1731–1741).
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Peter Tang, P. T. (2017). On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International conference on learning representations, Toulon, France (pp. 1–16).
Keskes, I., Benamara, F., & Belguith, L. H. (2012). Clause-based discourse segmentation of Arabic texts. In Proceedings of the eighth international conference on language resources and evaluation (LREC'12).
Klibi, S., Hamraoui, S., Ben Abda, S., Gaddes, C., Horcheni, F., & Maalla, A. (2014). La constitution Tunisienne. Tunisia.
Lafferty, J., & Mccallum, A. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data conditional random fields: Probabilistic models for segmenting. In Proceedings of the eighteenth international conference on machine learning, ICML (pp. 282–289).
Meftouh, K., Harrat, S., Jamoussi, S., Abbas, M., & Smaili, K. (2015). Machine translation experiments on PADIC: A parallel Arabic dialect corpus. In 29th Pacific Asia conference on language, information and computation (PACLIC 2015), Shanghai, China (pp. 26–34).
Mejri, S., Said, M., & Sfar, I. (2009). Pluringuisme et diglossie en Tunisie. Synergies Tunisie, 1, 53–74.
Google Scholar
Mekki, A., Zribi, I., Ellouze Khmekhem, M., & Hadrich Belguith, L. (2019). Automatic normalization of Tunisian social networks texts. In 16th International conference of the Pacific Association for Computational Linguistics (PACLING 2019), Hanoi, Vietnam.
Mekki, A., Zribi, I., Ellouze Khmekhem, M., & Hadrich Belguith, L. (2018). Critical description of TA linguistic resources. In The 4th international conference on Arabic computational linguistics (ACLing 2018) and Procedia computer science, November 17–19 2018, Dubai, United Arab Emirates.
Mekki, A., Zribi, I., Ellouze, M., & Belguith, L. H. (2017). Syntactic analysis of the Tunisian Arabic. In International workshop on language processing and knowledge management LPKM2017.
Mekki, A., Zribi, I., Ellouze, M., & Hadrich Belguith, L. (2020). Treebank creation and parser generation for Tunisian Social Media text. In 17th ACS/IEEE international conference on computer systems and applications AICCSA 2020. IEEE.
Nursuriati, J., Ramli, M. I., & Noraini, S. (2015). Sentence boundary detection without speech recognition: A case of an under-resourced language. Journal of Electrical Systems, 11(3), 308–318.
Google Scholar
Pasha, A., Al-Badrashiny, M., Diab, M., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., & Roth, R. M. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the ninth international conference on language resources and evaluation (LREC’14) (pp. 1094–1101).
Rehman, Z., & Anwar, W. (2012). A hybrid approach for Urdu sentence boundary disambiguation. The International Arab Journal of Information Technology, 9(3), 250–255.
Google Scholar
Rudrapal, D., Jamatia, A., Chakma, K., Das, A., & Gambäck, B. (2015). Sentence boundary detection for social media text. In 12th International conference on natural language processing, Trivandrum, India (pp. 254–260).
Saadane, H. (2015). Le traitement automatique de l’arabe dialectalisé : aspects méthodologiques et algorithmiques. Grenoble Alpes.
Sadat, F., Mallek, F., Sallemi, R., Boudabous, M. M., & Farzindar, A. (2014). Collaboratively constructed linguistic resources for language variants and their exploitation in NLP applications—The case of Tunisian Arabic and the social media. In The workshop on lexical and grammatical resources for language processing (pp. 102–110).
Saetia, C., Chuangsuwanich, E., Chalothorn, T., & Vateekul, P. (2019). Semi-supervised Thai Sentence segmentation using local and distant word representations. arXiv preprint arXiv:1908.01294, 1–19.
Sanchez, G. (2019). Sentence boundary detection in legal text. In Proceedings of the natural legal language processing workshop 2019 (pp. 31–38). Association for Computational Linguistics.
Smith, S. L., Kindermans, P., Ying, C., Le, Q. V., & Brain, G. (2018). Don’t decay the learning rate, increase the Batch Size. In The sixth international conference on learning representations, Vancouver, Canada (pp. 1–11).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.
Google Scholar
Touir, A. A., Mathkour, H., & Al-Sanea, W. (2008). Semantic-based segmentation of Arabic texts. Information Technology Journal. https://doi.org/10.3923/itj.2008.1009.1015
Article Google Scholar
Vapnik, V. N. (1995). The nature of statistical learning theory. Springer.
Younes, J., Achour, H., & Souissi, E. (2015). Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web. In International conference on web engineering ICWE 2015: Current trends in web engineering (pp. 3–14).
Zribi, I., Boujelbane, R., Masmoudi, A., Ellouze, M., Belguith, L., & Habash, N. (2014). A conventional orthography for Tunisian Arabic. In The ninth international conference on language resources and evaluation (LREC’14) (pp. 2355–2361). https://doi.org/10.13140/2.1.3168.0324.
Zribi, I., Ellouze, M., Belguith, L. H., & Blache, P. (2015). Spoken Tunisian Arabic corpus “STAC ”: Transcription and annotation. Research in Computer Science Journal, 90, 1–13.
Google Scholar
Zribi, I., Graja, M., Khemakhem, M. E., Jaoua, M., & Belguith, L. H. (2013a). Orthographic transcription for spoken Tunisian Arabic. In CICLing 2013, Part I, LNCS 7816 (pp. 153–163).
Zribi, I., Kammoun, I., Ellouze, M., Belguith, L. H., & Blache, P. (2016). Sentence boundary detection for transcribed Tunisian Arabic. In Konvens 2016 (pp. 323–331).
Zribi, I., Khemakhem, M. E., & Belguith, L. H. (2013b). Morphological analysis of Tunisian dialect. In International joint conference on natural language processing (pp. 992–996). https://doi.org/10.1016/j.jksuci.2017.01.004.

Download references

Author information

Authors and Affiliations

ANLP Research Group, MIRACL, University of Sfax, Sfax, Tunisia
Asma Mekki, Inès Zribi, Mariem Ellouze & Lamia Hadrich Belguith

Authors

Asma Mekki
View author publications
You can also search for this author in PubMed Google Scholar
Inès Zribi
View author publications
You can also search for this author in PubMed Google Scholar
Mariem Ellouze
View author publications
You can also search for this author in PubMed Google Scholar
Lamia Hadrich Belguith
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Asma Mekki.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mekki, A., Zribi, I., Ellouze, M. et al. Sentence boundary detection of various forms of Tunisian Arabic. Lang Resources & Evaluation 56, 357–385 (2022). https://doi.org/10.1007/s10579-021-09538-4

Download citation

Accepted: 09 March 2021
Published: 20 April 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10579-021-09538-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sentence boundary detection of various forms of Tunisian Arabic

Abstract

Access this article

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation