skip to main content
research-article

Investigating the Effect of Preprocessing Arabic Text on Offensive Language and Hate Speech Detection

Published:19 January 2022Publication History
Skip Abstract Section

Abstract

Preprocessing of input text can play a key role in text classification by reducing dimensionality and removing unnecessary content. This study aims to investigate the impact of preprocessing on Arabic offensive language classification. We explore six preprocessing techniques: conversion of emojis to Arabic textual labels, normalization of different forms of Arabic letters, normalization of selected nouns from dialectal Arabic to Modern Standard Arabic, conversion of selected hyponyms to hypernyms, hashtag segmentation, and basic cleaning such as removing numbers, kashidas, diacritics, and HTML tags. We also experiment with raw text and a combination of all six preprocessing techniques. We apply different types of classifiers in our experiments including traditional machine learning, ensemble machine learning, Artificial Neural Networks, and Bidirectional Encoder Representations from Transformers (BERT)-based models to analyze the impact of preprocessing. Our results demonstrate significant variations in the effects of preprocessing on each classifier type and on each dataset. Classifiers that are based on BERT do not benefit from preprocessing, while traditional machine learning classifiers do. However, these results can benefit from validation on larger datasets that cover broader domains and dialects.

REFERENCES

  1. [1] Alharbi Abdullah I. and Lee Mark. 2020. Combining character and word embeddings for the detection of offensive language in Arabic. In Proceedings of the 4th Workshop on Open-source Arabic Corpora and Processing Tools with a Shared Task on Offensive Language Detection. European Language Resource Association, 9196. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.15.Google ScholarGoogle Scholar
  2. [2] Alhumoud Sarah, Altuwaijri Mawaheb, Albuhairi Tarfa, and Alohaideb Wejdan. 2015. Survey on arabic sentiment analysis in twitter, In World Academy of Science, Engineering and Technology. Int. J. Soc. Behav. Edu. Econ. Bus. Industr. Eng. 9, 1, 364378.Google ScholarGoogle Scholar
  3. [3] Antoun Wissam, Baly Fady, and Hajj Hazem. 2020. AraBERT: Transformer-based model for Arabic language understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools with a Shared Task on Offensive Language Detection. European Language Resource Association, 915. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.2.Google ScholarGoogle Scholar
  4. [4] Boudad Naaima, Faizi Rdouan, Thami Rachid Oulad Haj, and Chiheb Raddouane. 2018. Sentiment analysis in Arabic: A review of the literature. Ain Shams Eng. J. 9, 4 (2018), 24792490. https://doi.org/10.1016/j.asej.2017.04.007Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Darwish Kareem. 2014. Arabizi detection and conversion to Arabic. In Proceedings of the EMNLP Workshop on Arabic Natural Language Processing (ANLP’14). Association for Computational Linguistics, 217224. https://doi.org/10.3115/v1/W14-3629Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Duwairi Rehab and El-Orfali Mahmoud. 2014. A study of the effects of preprocessing strategies on sentiment analysis for Arabic text. J. Info. Sci. 40, 4 (2014), 501513. https://doi.org/10.1177/0165551514534143 arXiv: https://doi.org/10.1177/0165551514534143 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Gayar Neamat El and Suen Ching. 2018. Series on Language Processing, Pattern Recognition, and Intelligent Systems, Vol. 4. World Scientific, Singapore. 1–288. https://doi.org/10.1142/10693Google ScholarGoogle Scholar
  8. [8] Habash Nizar Y.. 2010. Synthesis Lectures on Human Language Technologies, 1st ed., Vol. 3. Morgan & Claypool Publishers. 1–187. https://doi.org/10.2200/S00277ED1V01Y201008HLT010Google ScholarGoogle Scholar
  9. [9] HaCohen-Kerner Yaakov, Miller D., and Yigal Yair. 2020. The influence of preprocessing on text classification using a bag-of-words representation. PLoS ONE 15 (2020).Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Haddad Hatem, Mulki Hala, and Oueslati Asma. 2019. T-HSAB: A tunisian hate speech and abusive dataset. In Arabic Language Processing: From Theory to Practice, Smaïli Kamel (Ed.). Springer International Publishing, Cham, 251263.Google ScholarGoogle Scholar
  11. [11] Husain Fatemah. 2020. Arabic Offensive Language Detection Using Machine Learning and Ensemble Machine Learning Approaches. Retrieved from https://arXiv:2005.08946.Google ScholarGoogle Scholar
  12. [12] Husain Fatemah. 2020. OSACT4 shared task on offensive language detection: Intensive preprocessing-based approach. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. European Language Resource Association, 5360. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.8.Google ScholarGoogle Scholar
  13. [13] Husain Fatemah. 2020. OSACT4 shared task on offensive language detection: Intensive preprocessing based approach. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT4). Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC’20).Google ScholarGoogle Scholar
  14. [14] Husain Fatemah, Lee Jooyeon, Henry Samuel, and Uzuner Ozlem. 2020. SalamNET at SemEval-2020 Task12: Deep learning approach for arabic offensive language detection. In Proceedings of the 14th International Workshop on Semantic Evaluation. 21332139.Google ScholarGoogle Scholar
  15. [15] Husain Fatemah and Uzuner Ozlem. 2021. A survey of offensive language detection for the arabic language. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 1, Article 12 (Mar. 2021), 44 pages. https://doi.org/10.1145/3421504 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Mubarak Hamdy, Darwish Kareem, Magdy Walid, Elsayed Tamer, and Al-Khalifa Hend. 2020. Overview of OSACT4 Arabic Offensive Language Detection Shared Task. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT’20), with a Shared Task on Offensive Language Detection, Vol. 4. European Language Resource Association, Marseille, France. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.2.Google ScholarGoogle Scholar
  17. [17] Mulki Hala, Haddad Hatem, Ali Chedi Bechikh, and Alshabani Halima. 2019. L-HSAB: A Levantine Twitter dataset for hate speech and abusive language. In Proceedings of the 3rd Workshop on Abusive Language Online. Association for Computational Linguistics, 111118. https://doi.org/10.18653/v1/W19-3512Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Orăsan Constantin. 2018. Aggressive language identification using word embeddings and sentiment features. In Proceedings of the 1st Workshop on Trolling, Aggression, and Cyberbullying (TRAC’18). Association for Computational Linguistics, 113119. Retrieved from https://www.aclweb.org/anthology/W18-4414.Google ScholarGoogle Scholar
  19. [19] Saad Motaz. 2010. The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification. Ph.D. Dissertation. https://doi.org/10.13140/2.1.4677.2164Google ScholarGoogle Scholar
  20. [20] Saeed Hafiz Hassaan, Calders Toon, and Kamiran Faisal. 2020. OSACT4 shared tasks: Ensembled stacked classification for offensive and hate speech in Arabic Tweets. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. European Language Resource Association, 7175. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.11.Google ScholarGoogle Scholar
  21. [21] Safaya Ali, Abdullatif Moutasem, and Yuret Deniz. 2020. KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media. In Proceedings of the 14th Workshop on Semantic Evaluation. International Committee for Computational Linguistics, 20542059. Retrieved from https://www.aclweb.org/anthology/2020.semeval-1.271.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Soliman Abu Bakr, Eissa Kareem, and El-Beltagy Samhaa R.. 2017. AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP. Procedia Comput. Sci. 117 (2017), 256265. https://doi.org/10.1016/j.procs.2017.10.117Arabic Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Woo HoSung, Kim JaMee, and Lee WonGyu. 2020. Validation of text data preprocessing using a neural network model. Math. Problems Eng. 2020 (2020), 1958149. https://doi.org/10.1155/2020/1958149Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Investigating the Effect of Preprocessing Arabic Text on Offensive Language and Hate Speech Detection

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 4
      July 2022
      464 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3511099
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 January 2022
      • Accepted: 1 November 2021
      • Revised: 1 October 2021
      • Received: 1 December 2020
      Published in tallip Volume 21, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format