OMCD: Offensive Moroccan Comments Dataset

Essefar, Kabil; Ait Baha, Hassan; El Mahdaouy, Abdelkader; El Mekki, Abdellah; Berrada, Ismail

doi:10.1007/s10579-023-09663-2

OMCD: Offensive Moroccan Comments Dataset

Original Paper
Published: 05 June 2023

Volume 57, pages 1745–1765, (2023)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Kabil Essefar ORCID: orcid.org/0000-0002-5352-1550¹,
Hassan Ait Baha¹,
Abdelkader El Mahdaouy²,
Abdellah El Mekki¹ &
…
Ismail Berrada¹

223 Accesses
1 Citation
Explore all metrics

Abstract

Offensive content, such as verbal attacks, demeaning comments, or hate speech, has become widespread on social media. Automatic detection of this content is considered an important and challenging task. Although several research works have been proposed to address this challenge for high-resource languages, research on detecting offensive content in Dialectal Arabic (DA) remains under-explored. Recently, the detection of offensive language in DA has gained increasing interest among researchers in Natural Language Processing (NLP). However, only a limited number of annotated datasets have been introduced for single or multiple coarse-grained dialects. In this paper, we introduce Offensive Moroccan Comments Dataset (OMCD), the first dataset for offensive language detection for the Moroccan dialect. First, we present the data collection steps, the statistical analysis, and the annotation guidelines of the introduced dataset. Then, we evaluate several state-of-the-art Machine Learning (ML) and Deep Learning (DL) based models on the OMCD dataset. Finally, we highlight the impact of emojis on the evaluated models for offensive language detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OLID-BR: offensive language identification dataset for Brazilian Portuguese

Article 03 May 2023

Classifying Offensive Speech of Bangla Text and Analysis Using Explainable AI

Detection of Homophobia & Transphobia in Malayalam and Tamil: Exploring Deep Learning Methods

Notes

https://socialblade.com/.
The dataset is publicly available at: https://github.com/kabilessefar/OMCD-Offensive-Moroccan-Comments-Dataset.

References

Abdelali, A., Hassan, S., Mubarak, H., Darwish, K., & Samih, Y. (2021). Pre-training BERT on Arabic tweets: Practical considerations. CoRR. http://arxiv.org/2102.10684
Abdul-Mageed, M., Elmadany, A. A., & Nagoudi, E. M. B. (2021). ARBERT & MARBERT: Deep bidirectional transformers for Arabic. CoRR. arXiv:abs/2101.01785
Abozinadah, E. A., Mbaziira, A. V., & Jones, J. (2015). Detection of abusive accounts with Arabic tweets. International Journal of Knowledge Engineering-IACSIT, 1(2), 113–119.
Article Google Scholar
Agarwal, S., & Sureka, A. (2014). A focused crawler for mining hate and extremism promoting videos on YouTube. In Proceedings of the 25th ACM conference on hypertext and social media. HT ’14 (pp. 294–296). Association for Computing Machinery. https://doi.org/10.1145/2631775.2631776
Alakrot, A., Murray, L., & Nikolov, N. S. (2018a). Dataset construction for the detection of anti-social behaviour in online communication in Arabic. Procedia Computer Science, 142, 174–181. https://doi.org/10.1016/j.procs.2018.10.473
Article Google Scholar
Alakrot, A., Murray, L., & Nikolov, N. S. (2018b). Towards accurate detection of offensive language in online communication in Arabic. Procedia Computer Science, 142, 315–320. https://doi.org/10.1016/j.procs.2018.10.491
Article Google Scholar
Albadi, N., Kurdi, M., & Mishra, S. (2018). Are they our brothers? Analysis and detection of religious hate speech in the Arabic Twittersphere. In 2018 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM) (pp. 69–76).
Antoine, J.-Y., Villaneau, J., & Lefeuvre, A. (2014). Weighted Krippendorff’s alpha is a more reliable metrics for multi-coders ordinal annotations: Experimental studies on emotion, opinion and coreference annotation. In Proceedings of the 14th conference of the European chapter of the Association for Computational Linguistics (pp. 550–559). Association for Computational Linguistics. https://doi.org/10.3115/v1/E14-1058, https://www.aclweb.org/anthology/E14-1058
Antoun, W., Baly, F., & Hajj, H. M. (2020). Arabert: Transformer-based model for Arabic language understanding. CoRR. arXiv:abs/2003.00104
Artstein, R., & Poesio, M. (2008). Survey article: Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596. https://doi.org/10.1162/coli.07-034-R2
Article Google Scholar
Barbieri, F., Ballesteros, M., & Saggion, H. (2017). Are emojis predictable? CoRR. arXiv:abs/1702.07285
Baudhuin, E. S. (1973). Obscene language and evaluative response: An empirical study. Psychological Reports, 32(2), 399–402.
Article Google Scholar
Burnap, P., & Williams, M. L. (2015). Cyber hate speech on Twitter: An application of machine classification and statistical modeling for policy and decision making. Policy & Internet, 7(2), 223–242. https://doi.org/10.1002/poi3.85
Article Google Scholar
Chatzakou, D., Kourtellis, N., Blackburn, J., De Cristofaro, E., Stringhini, G., & Vakali, A. (2017). Mean birds: Detecting aggression and bullying on twitter. In Proceedings of the 2017 ACM on web science conference. WebSci ’17 (pp. 13–22). Association for Computing Machinery. https://doi.org/10.1145/3091478.3091487
Chowdhury, S .A., Mubarak, H., Abdelali, A., Jung, S.-G., Jansen, B. J., & Salminen, J. (2020). A multi-platform Arabic news comment dataset for offensive language detection. In Proceedings of the 12th language resources and evaluation conference (pp. 6203–6212). European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.761
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104
Article Google Scholar
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
Article Google Scholar
Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215–232.
Google Scholar
Dai, W., Yu, T., Liu, Z., & Fung, P. (2020). Kungfupanda at SemEval-2020 task 12: BERT-based multi-task learning for offensive language detection. In Proceedings of the fourteenth workshop on semantic evaluation (pp. 2060–2066). International Committee for Computational Linguistics. https://doi.org/10.18653/v1/2020.semeval-1.272, https://aclanthology.org/2020.semeval-1.272
Darwish, K., Habash, N., Abbas, M., Al-Khalifa, H., Al-Natsheh, H. T., Bouamor, H., Bouzoubaa, K., Cavalli-Sforza, V., El-Beltagy, S. R., El-Hajj, W., Jarrar, M., & Mubarak, H. (2021). A panoramic survey of natural language processing in the Arab world. Communications of the ACM, 64(4), 72–81. https://doi.org/10.1145/3447735
Article Google Scholar
Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated hate speech detection and the problem of offensive language. Proceedings of the International AAAI Conference on Web and Social Media, 11(1), 512–515.
Article Google Scholar
El Mekki, A., El Mahdaouy, A., Berrada, I., & Khoumsi, A. (2021a). Domain adaptation for Arabic cross-domain and cross-dialect sentiment analysis from contextualized word embedding. In Proceedings of the 2021 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 2824–2837). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.226, https://aclanthology.org/2021.naacl-main.226
El Mekki, A., El Mahdaouy, A., Berrada, I., & Khoumsi, A. (2021b). On the role of orthographic variations in building multidialectal Arabic word embeddings. In Proceedings of the Canadian conference on artificial intelligence. https://doi.org/10.21428/594757db.5febef29, https://caiac.pubpub.org/pub/pdf9jqoh
El Mekki, A., El Mahdaouy, A., Essefar, K., El Mamoun, N., Berrada, I., & Khoumsi, A. (2021c). BERT-based multi-task model for country and province level MSA and dialectal Arabic identification. In Proceedings of the sixth Arabic natural language processing workshop (pp. 271–275). Association for Computational Linguistics, Kyiv (Virtual). https://aclanthology.org/2021.wanlp-1.31
Erdmann, A., Zalmout, N., & Habash, N. (2018). Addressing noise in multidialectal word embeddings. In Proceedings of the 56th annual meeting of the Association for Computational Linguistics (Vol. 2: Short Papers, pp. 558–565).
Eryani, F., Habash, N., Bouamor, H., & Khalifa, S. (2020). A spelling correction corpus for multiple Arabic dialects. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 12th language resources and evaluation conference, LREC 2020, Marseille, May 11–16, 2020 (pp. 4130–4138). European Language Resources Association.
Essefar, K., El Mekki, A., El Mahdaouy, A., El Mamoun, N., & Berrada, I. (2021). CS-UM6P at SemEval-2021 task 7: Deep multi-task learning model for detecting and rating humor and offense. In Proceedings of the 15th international workshop on semantic evaluation (SemEval-2021) (pp. 1135–1140). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.semeval-1.159, https://aclanthology.org/2021.semeval-1.159
Guellil, I., Saâdane, H., Azouaou, F., Gueni, B., & Nouvel, D. (2021). Arabic natural language processing: An overview. Journal of King Saud University: Computer and Information Sciences, 33(5), 497–507. https://doi.org/10.1016/j.jksuci.2019.02.006
Article Google Scholar
Gwet, Kilem. (2011). On the Krippendorff’s alpha coefficient. Retrieved October 2, 2011
Haddad, H., Mulki, H., & Oueslati, A. (2019). T-HSAB: A Tunisian hate speech and abusive dataset. In K. Smaïli (Ed.), Arabic language processing: From theory to practice (pp. 251–263). Springer.
Chapter Google Scholar
Hinduja, S., & Patchin, J. W. (2010). Bullying, cyberbullying, and suicide. Archives of Suicide Research, 14(3), 206–221.
Article Google Scholar
Ho, T. K. (1995). Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition (Vol. 1, pp. 278–282). IEEE.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Hodeib, C. (2021). Variability in perceptions of (im)politeness in Syrian Arabic: The observers’ perspective. Argumentum, 17, 125–160.
Article Google Scholar
Husain, F., & Uzuner, O. (2021a). Exploratory Arabic offensive language dataset analysis. arXiv Preprint. http://arxiv.org/abs/2101.11434
Husain, F., & Uzuner, O. (2021b). A survey of offensive language detection for the Arabic language. ACM Transactions on Asian and Low-Resource Language Information Processing. https://doi.org/10.1145/3421504
Article Google Scholar
Inoue, G., Alhafni, B., Baimukan, N., Bouamor, H., & Habash, N. (2021). The interplay of variant, size, and task type in Arabic pre-trained language models. CoRR. http://arxiv.org/2103.06678
Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.
Article Google Scholar
Khaddaj, A., Hajj, H., & El-Hajj, W. (2019). Improved generalization of Arabic text classifiers. In Proceedings of the fourth Arabic natural language processing workshop (pp. 167–174). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4618. https://aclanthology.org/W19-4618
Krippendorff, K. (2004). Content analysis: An introduction to its methodology (p. 241). Sage.
Kumar, R., Ojha, A. K., Malmasi, S., & Zampieri, M. (2018). Benchmarking aggression identification in social media. In Proceedings of the first workshop on trolling, aggression and cyberbullying (TRAC-2018) (pp. 1–11). Association for Computational Linguistics. https://aclanthology.org/W18-4401
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
Article Google Scholar
LaValle, S. M., Branicky, M. S., & Lindemann, S. R. (2004). On the relationship between classical grid search and probabilistic roadmaps. The International Journal of Robotics Research, 23(7–8), 673–692.
Article Google Scholar
Liu, Y., Yang, M., Ramsay, M., Li, X., & Coid, J. (2011). A comparison of logistic regression, classification and regression tree, and neural networks models in predicting violent re-offending. Journal of Quantitative Criminology, 27, 547–573. https://doi.org/10.1007/s10940-011-9137-7
Article Google Scholar
McCallum, A., Nigam, K., et al. (1998). A comparison of event models for Naive Bayes text classification. In AAAI-98 workshop on learning for text categorization (Vol. 752, pp. 41–48). CiteSeer.
Mengü, M., & Mengü, S. (2015). Violence and social media. Athens Journal of Mass Media and Communications, 1(3), 211–227.
Article Google Scholar
Mouheb, D., Ismail, R., Qaraghuli, S. A., Aghbari, Z. A., & Kamel, I. (2018). Detection of offensive messages in Arabic social media communications. In 2018 international conference on innovations in information technology (IIT) (pp. 24–29). https://doi.org/10.1109/INNOVATIONS.2018.8606030
Mubarak, H., Darwish, K., & Magdy, W. (2017). Abusive language detection on Arabic social media. In Proceedings of the first workshop on abusive language online (pp. 52–56). Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-3008. https://www.aclweb.org/anthology/W17-3008
Mubarak, H., Darwish, K., Magdy, W., Elsayed, T., & Al-Khalifa, H. (2020a). Overview of OSACT4 Arabic offensive language detection shared task. In Proceedings of the 4th workshop on open-source Arabic corpora and processing tools, with a shared task on offensive language detection (pp. 48–52). European Language Resource Association. https://www.aclweb.org/anthology/2020.osact-1.7
Mubarak, H., Hassan, S., & Chowdhury, S. A. (2022). Emojis as anchors to detect Arabic offensive language and hate speech. CoRR. arXiv:abs/2201.06723
Mubarak, H., Rashed, A., Darwish, K., Samih, Y., & Abdelali, A. (2020b). Arabic offensive language on twitter: Analysis and experiments. arXiv Preprint. arXiv:2004.02192
Mubarak, H., Rashed, A., Darwish, K., Samih, Y., & Abdelali, A. (2020c). Arabic offensive language on twitter: Analysis and experiments. CoRR. arXiv:2004.02192
Muhammad, A.-M., Chiyu, Z., Houda, B., & Nizar, H. (2020). NADI 2020: The first nuanced Arabic dialect identification shared task. arXiv:2010.11334 arXiv:2010.11334
Mulki, H., Haddad, H., Bechikh Ali, C., & Alshabani, H. (2019). L-HSAB: A Levantine Twitter dataset for hate speech and abusive language. In Proceedings of the third workshop on abusive language online (pp. 111–118). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-3512. https://aclanthology.org/W19-3512
Obeid, O., Zalmout, N., Khalifa, S., Taji, D., Oudah, M., Alhafni, B., Inoue, G., Eryani, F., Erdmann, A., & Habash, N. (2020). CAMeL tools: An open source python toolkit for Arabic natural language processing. In Proceedings of the 12th language resources and evaluation conference (pp. 7022–7032). European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.868
Ousidhoum, N., Lin, Z., Zhang, H., Song, Y., & Yeung, D.-Y. (2019). Multilingual and multi-aspect hate speech analysis. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 4675–4684). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1474, https://aclanthology.org/D19-1474
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Google Scholar
Qwaider, C., Chatzikyriakidis, S., & Dobnik, S. (2019). Can modern standard Arabic approaches be used for Arabic dialects? Sentiment analysis as a case study. In Proceedings of the 3rd workshop on Arabic corpus linguistics (pp. 40–50). Association for Computational Linguistics. https://aclanthology.org/W19-5606
Rainie, H., Anderson, J. Q., & Albright, J. (2017). The future of free speech, trolls, anonymity and fake news online. Washington, DC: Pew Research Center.
Google Scholar
Saadane, H., & Habash, N. (2015). A conventional orthography for Algerian Arabic. In N. Habash, S. Vogel, & K. Darwish (Eds.), Proceedings of the second workshop on Arabic natural language processing, ANLP@ACL 2015, Beijing, July 30, 2015 (pp. 69–79). Association for Computational Linguistics. https://doi.org/10.18653/v1/W15-3208
Sarika. (2022). 84 YouTube statistics you can’t ignore in 2022. https://invideo.io/blog/youtube-statistics/
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Transactions on Signal Processing, 45(11), 2673–2681. https://doi.org/10.1109/78.650093
Article Google Scholar
Waseem, Z., & Hovy, D. (2016). Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In Proceedings of the NAACL student research workshop (pp. 88–93). Association for Computational Linguistics. https://doi.org/10.18653/v1/N16-2013, https://www.aclweb.org/anthology/N16-2013
Whillock, R. K., & Slayden, D. (1995). Hate speech. ERIC.
Younes, J., Souissi, E., Achour, H., & Ferchichi, A. (2020). Language resources for Maghrebi Arabic dialects’ NLP: A survey. Language Resources and Evaluation, 54(4), 1079–1142. https://doi.org/10.1007/s10579-020-09490-9
Article Google Scholar
Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., & Kumar, R. (2019). Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies (Vol. 1 (Long and Short Papers), pp. 1415–1420). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1144, https://aclanthology.org/N19-1144

Download references

Author information

Authors and Affiliations

School of Computer Sciences, Mohammed VI Polytechnic University, Ben Guerir, Morocco
Kabil Essefar, Hassan Ait Baha, Abdellah El Mekki & Ismail Berrada
Modeling, Simulation and Data Analysis (MSDA), Mohammed VI Polytechnic University, Ben Guerir, Morocco
Abdelkader El Mahdaouy

Authors

Kabil Essefar
View author publications
You can also search for this author in PubMed Google Scholar
Hassan Ait Baha
View author publications
You can also search for this author in PubMed Google Scholar
Abdelkader El Mahdaouy
View author publications
You can also search for this author in PubMed Google Scholar
Abdellah El Mekki
View author publications
You can also search for this author in PubMed Google Scholar
Ismail Berrada
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kabil Essefar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Essefar, K., Ait Baha, H., El Mahdaouy, A. et al. OMCD: Offensive Moroccan Comments Dataset. Lang Resources & Evaluation 57, 1745–1765 (2023). https://doi.org/10.1007/s10579-023-09663-2

Download citation

Accepted: 26 April 2023
Published: 05 June 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10579-023-09663-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OMCD: Offensive Moroccan Comments Dataset

Abstract

Access this article

Similar content being viewed by others

OLID-BR: offensive language identification dataset for Brazilian Portuguese

Classifying Offensive Speech of Bangla Text and Analysis Using Explainable AI

Detection of Homophobia & Transphobia in Malayalam and Tamil: Exploring Deep Learning Methods

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

OMCD: Offensive Moroccan Comments Dataset

Abstract

Access this article

Similar content being viewed by others

OLID-BR: offensive language identification dataset for Brazilian Portuguese

Classifying Offensive Speech of Bangla Text and Analysis Using Explainable AI

Detection of Homophobia & Transphobia in Malayalam and Tamil: Exploring Deep Learning Methods

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation