Abstract
Preprocessing of input text can play a key role in text classification by reducing dimensionality and removing unnecessary content. This study aims to investigate the impact of preprocessing on Arabic offensive language classification. We explore six preprocessing techniques: conversion of emojis to Arabic textual labels, normalization of different forms of Arabic letters, normalization of selected nouns from dialectal Arabic to Modern Standard Arabic, conversion of selected hyponyms to hypernyms, hashtag segmentation, and basic cleaning such as removing numbers, kashidas, diacritics, and HTML tags. We also experiment with raw text and a combination of all six preprocessing techniques. We apply different types of classifiers in our experiments including traditional machine learning, ensemble machine learning, Artificial Neural Networks, and Bidirectional Encoder Representations from Transformers (BERT)-based models to analyze the impact of preprocessing. Our results demonstrate significant variations in the effects of preprocessing on each classifier type and on each dataset. Classifiers that are based on BERT do not benefit from preprocessing, while traditional machine learning classifiers do. However, these results can benefit from validation on larger datasets that cover broader domains and dialects.
- [1] . 2020. Combining character and word embeddings for the detection of offensive language in Arabic. In Proceedings of the 4th Workshop on Open-source Arabic Corpora and Processing Tools with a Shared Task on Offensive Language Detection. European Language Resource Association, 91–96. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.15.Google Scholar
- [2] . 2015. Survey on arabic sentiment analysis in twitter, In World Academy of Science, Engineering and Technology. Int. J. Soc. Behav. Edu. Econ. Bus. Industr. Eng. 9, 1, 364–378.Google Scholar
- [3] . 2020. AraBERT: Transformer-based model for Arabic language understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools with a Shared Task on Offensive Language Detection. European Language Resource Association, 9–15. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.2.Google Scholar
- [4] . 2018. Sentiment analysis in Arabic: A review of the literature. Ain Shams Eng. J. 9, 4 (2018), 2479–2490. https://doi.org/10.1016/j.asej.2017.04.007Google ScholarCross Ref
- [5] . 2014. Arabizi detection and conversion to Arabic. In Proceedings of the EMNLP Workshop on Arabic Natural Language Processing (ANLP’14). Association for Computational Linguistics, 217–224. https://doi.org/10.3115/v1/W14-3629Google ScholarCross Ref
- [6] . 2014. A study of the effects of preprocessing strategies on sentiment analysis for Arabic text. J. Info. Sci. 40, 4 (2014), 501–513. https://doi.org/10.1177/0165551514534143 arXiv: https://doi.org/10.1177/0165551514534143 Google ScholarDigital Library
- [7] . 2018.
Series on Language Processing, Pattern Recognition, and Intelligent Systems , Vol. 4. World Scientific, Singapore. 1–288. https://doi.org/10.1142/10693Google Scholar - [8] . 2010.
Synthesis Lectures on Human Language Technologies , 1st ed., Vol. 3. Morgan & Claypool Publishers. 1–187. https://doi.org/10.2200/S00277ED1V01Y201008HLT010Google Scholar - [9] . 2020. The influence of preprocessing on text classification using a bag-of-words representation. PLoS ONE 15 (2020).Google ScholarCross Ref
- [10] . 2019. T-HSAB: A tunisian hate speech and abusive dataset. In Arabic Language Processing: From Theory to Practice, (Ed.). Springer International Publishing, Cham, 251–263.Google Scholar
- [11] . 2020. Arabic Offensive Language Detection Using Machine Learning and Ensemble Machine Learning Approaches. Retrieved from https://arXiv:2005.08946.Google Scholar
- [12] . 2020. OSACT4 shared task on offensive language detection: Intensive preprocessing-based approach. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. European Language Resource Association, 53–60. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.8.Google Scholar
- [13] . 2020. OSACT4 shared task on offensive language detection: Intensive preprocessing based approach. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT4). Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC’20).Google Scholar
- [14] . 2020. SalamNET at SemEval-2020 Task12: Deep learning approach for arabic offensive language detection. In Proceedings of the 14th International Workshop on Semantic Evaluation. 2133–2139.Google Scholar
- [15] . 2021. A survey of offensive language detection for the arabic language. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 1, Article 12 (
Mar. 2021), 44 pages. https://doi.org/10.1145/3421504 Google ScholarDigital Library - [16] . 2020. Overview of OSACT4 Arabic Offensive Language Detection Shared Task. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT’20), with a Shared Task on Offensive Language Detection, Vol. 4. European Language Resource Association, Marseille, France. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.2.Google Scholar
- [17] . 2019. L-HSAB: A Levantine Twitter dataset for hate speech and abusive language. In Proceedings of the 3rd Workshop on Abusive Language Online. Association for Computational Linguistics, 111–118. https://doi.org/10.18653/v1/W19-3512Google ScholarCross Ref
- [18] . 2018. Aggressive language identification using word embeddings and sentiment features. In Proceedings of the 1st Workshop on Trolling, Aggression, and Cyberbullying (TRAC’18). Association for Computational Linguistics, 113–119. Retrieved from https://www.aclweb.org/anthology/W18-4414.Google Scholar
- [19] . 2010. The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification. Ph.D. Dissertation. https://doi.org/10.13140/2.1.4677.2164Google Scholar
- [20] . 2020. OSACT4 shared tasks: Ensembled stacked classification for offensive and hate speech in Arabic Tweets. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. European Language Resource Association, 71–75. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.11.Google Scholar
- [21] . 2020. KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media. In Proceedings of the 14th Workshop on Semantic Evaluation. International Committee for Computational Linguistics, 2054–2059. Retrieved from https://www.aclweb.org/anthology/2020.semeval-1.271.Google ScholarCross Ref
- [22] . 2017. AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP. Procedia Comput. Sci. 117 (2017), 256–265. https://doi.org/10.1016/j.procs.2017.10.117
Arabic Computational Linguistics. Google ScholarCross Ref - [23] . 2020. Validation of text data preprocessing using a neural network model. Math. Problems Eng. 2020 (2020), 1958149. https://doi.org/10.1155/2020/1958149Google ScholarCross Ref
Index Terms
- Investigating the Effect of Preprocessing Arabic Text on Offensive Language and Hate Speech Detection
Recommendations
Towards Accurate Detection of Offensive Language in Online Communication in Arabic
AbstractWe present the results of predictive modelling for the detection of anti-social behaviour in online communication in Arabic, such as comments which contain obscene or offensive words and phrases. We collected and labelled a large dataset of ...
Sentence Boundary Detection in Colloquial Arabic Text: A Preliminary Result
IALP '11: Proceedings of the 2011 International Conference on Asian Language ProcessingRecently, natural language processing tasks are more frequently conducted over online content. This poses a special problem for applications over Arabic language. Online Arabic content is usually written in informal colloquial Arabic, which is ...
Online Recognition System for Handwritten Arabic Chemical Symbols
ICCCE '14: Proceedings of the 2014 International Conference on Computer and Communication EngineeringArabic chemical symbols are remarkably different from Latin chemical symbols which written by Arabic characters. On the other hand, Arabic chemical symbols follow Latin chemical symbols from the structure of writing the symbols. Although, Arabic symbols ...
Comments