research-article

Investigating the Effect of Preprocessing Arabic Text on Offensive Language and Hate Speech Detection

Authors:
Fatemah Husain

Kuwait University, Safat, Kuwait

Kuwait University, Safat, Kuwait

0000-0003-3470-229X
View Profile

,
Ozlem Uzuner

George Mason University, Fairfax, VA, USA

George Mason University, Fairfax, VA, USA
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 21 Issue 4Article No.: 73pp 1–20https://doi.org/10.1145/3501398

Published:19 January 2022Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Preprocessing of input text can play a key role in text classification by reducing dimensionality and removing unnecessary content. This study aims to investigate the impact of preprocessing on Arabic offensive language classification. We explore six preprocessing techniques: conversion of emojis to Arabic textual labels, normalization of different forms of Arabic letters, normalization of selected nouns from dialectal Arabic to Modern Standard Arabic, conversion of selected hyponyms to hypernyms, hashtag segmentation, and basic cleaning such as removing numbers, kashidas, diacritics, and HTML tags. We also experiment with raw text and a combination of all six preprocessing techniques. We apply different types of classifiers in our experiments including traditional machine learning, ensemble machine learning, Artificial Neural Networks, and Bidirectional Encoder Representations from Transformers (BERT)-based models to analyze the impact of preprocessing. Our results demonstrate significant variations in the effects of preprocessing on each classifier type and on each dataset. Classifiers that are based on BERT do not benefit from preprocessing, while traditional machine learning classifiers do. However, these results can benefit from validation on larger datasets that cover broader domains and dialects.

REFERENCES

[1] Alharbi Abdullah I. and Lee Mark. 2020. Combining character and word embeddings for the detection of offensive language in Arabic. In Proceedings of the 4th Workshop on Open-source Arabic Corpora and Processing Tools with a Shared Task on Offensive Language Detection. European Language Resource Association, 91–96. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.15.Google Scholar
[2] Alhumoud Sarah, Altuwaijri Mawaheb, Albuhairi Tarfa, and Alohaideb Wejdan. 2015. Survey on arabic sentiment analysis in twitter, In World Academy of Science, Engineering and Technology. Int. J. Soc. Behav. Edu. Econ. Bus. Industr. Eng. 9, 1, 364–378.Google Scholar
[3] Antoun Wissam, Baly Fady, and Hajj Hazem. 2020. AraBERT: Transformer-based model for Arabic language understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools with a Shared Task on Offensive Language Detection. European Language Resource Association, 9–15. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.2.Google Scholar
[4] Boudad Naaima, Faizi Rdouan, Thami Rachid Oulad Haj, and Chiheb Raddouane. 2018. Sentiment analysis in Arabic: A review of the literature. Ain Shams Eng. J. 9, 4 (2018), 2479–2490. https://doi.org/10.1016/j.asej.2017.04.007Google ScholarCross Ref
[5] Darwish Kareem. 2014. Arabizi detection and conversion to Arabic. In Proceedings of the EMNLP Workshop on Arabic Natural Language Processing (ANLP’14). Association for Computational Linguistics, 217–224. https://doi.org/10.3115/v1/W14-3629Google ScholarCross Ref
[6] Duwairi Rehab and El-Orfali Mahmoud. 2014. A study of the effects of preprocessing strategies on sentiment analysis for Arabic text. J. Info. Sci. 40, 4 (2014), 501–513. https://doi.org/10.1177/0165551514534143 arXiv: https://doi.org/10.1177/0165551514534143 Google ScholarDigital Library
[7] Gayar Neamat El and Suen Ching. 2018. Series on Language Processing, Pattern Recognition, and Intelligent Systems, Vol. 4. World Scientific, Singapore. 1–288. https://doi.org/10.1142/10693Google Scholar
[8] Habash Nizar Y.. 2010. Synthesis Lectures on Human Language Technologies, 1st ed., Vol. 3. Morgan & Claypool Publishers. 1–187. https://doi.org/10.2200/S00277ED1V01Y201008HLT010Google Scholar
[9] HaCohen-Kerner Yaakov, Miller D., and Yigal Yair. 2020. The influence of preprocessing on text classification using a bag-of-words representation. PLoS ONE 15 (2020).Google ScholarCross Ref
[10] Haddad Hatem, Mulki Hala, and Oueslati Asma. 2019. T-HSAB: A tunisian hate speech and abusive dataset. In Arabic Language Processing: From Theory to Practice, Smaïli Kamel (Ed.). Springer International Publishing, Cham, 251–263.Google Scholar
[11] Husain Fatemah. 2020. Arabic Offensive Language Detection Using Machine Learning and Ensemble Machine Learning Approaches. Retrieved from https://arXiv:2005.08946.Google Scholar
[12] Husain Fatemah. 2020. OSACT4 shared task on offensive language detection: Intensive preprocessing-based approach. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. European Language Resource Association, 53–60. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.8.Google Scholar
[13] Husain Fatemah. 2020. OSACT4 shared task on offensive language detection: Intensive preprocessing based approach. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT4). Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC’20).Google Scholar
[14] Husain Fatemah, Lee Jooyeon, Henry Samuel, and Uzuner Ozlem. 2020. SalamNET at SemEval-2020 Task12: Deep learning approach for arabic offensive language detection. In Proceedings of the 14th International Workshop on Semantic Evaluation. 2133–2139.Google Scholar
[15] Husain Fatemah and Uzuner Ozlem. 2021. A survey of offensive language detection for the arabic language. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 1, Article 12 (Mar. 2021), 44 pages. https://doi.org/10.1145/3421504 Google ScholarDigital Library
[16] Mubarak Hamdy, Darwish Kareem, Magdy Walid, Elsayed Tamer, and Al-Khalifa Hend. 2020. Overview of OSACT4 Arabic Offensive Language Detection Shared Task. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT’20), with a Shared Task on Offensive Language Detection, Vol. 4. European Language Resource Association, Marseille, France. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.2.Google Scholar
[17] Mulki Hala, Haddad Hatem, Ali Chedi Bechikh, and Alshabani Halima. 2019. L-HSAB: A Levantine Twitter dataset for hate speech and abusive language. In Proceedings of the 3rd Workshop on Abusive Language Online. Association for Computational Linguistics, 111–118. https://doi.org/10.18653/v1/W19-3512Google ScholarCross Ref
[18] Orăsan Constantin. 2018. Aggressive language identification using word embeddings and sentiment features. In Proceedings of the 1st Workshop on Trolling, Aggression, and Cyberbullying (TRAC’18). Association for Computational Linguistics, 113–119. Retrieved from https://www.aclweb.org/anthology/W18-4414.Google Scholar
[19] Saad Motaz. 2010. The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification. Ph.D. Dissertation. https://doi.org/10.13140/2.1.4677.2164Google Scholar
[20] Saeed Hafiz Hassaan, Calders Toon, and Kamiran Faisal. 2020. OSACT4 shared tasks: Ensembled stacked classification for offensive and hate speech in Arabic Tweets. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. European Language Resource Association, 71–75. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.11.Google Scholar
[21] Safaya Ali, Abdullatif Moutasem, and Yuret Deniz. 2020. KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media. In Proceedings of the 14th Workshop on Semantic Evaluation. International Committee for Computational Linguistics, 2054–2059. Retrieved from https://www.aclweb.org/anthology/2020.semeval-1.271.Google ScholarCross Ref
[22] Soliman Abu Bakr, Eissa Kareem, and El-Beltagy Samhaa R.. 2017. AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP. Procedia Comput. Sci. 117 (2017), 256–265. https://doi.org/10.1016/j.procs.2017.10.117Arabic Computational Linguistics.Google ScholarCross Ref
[23] Woo HoSung, Kim JaMee, and Lee WonGyu. 2020. Validation of text data preprocessing using a neural network model. Math. Problems Eng. 2020 (2020), 1958149. https://doi.org/10.1155/2020/1958149Google ScholarCross Ref

Index Terms

Investigating the Effect of Preprocessing Arabic Text on Offensive Language and Hate Speech Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction

Recommendations

Towards Accurate Detection of Offensive Language in Online Communication in Arabic
Abstract
We present the results of predictive modelling for the detection of anti-social behaviour in online communication in Arabic, such as comments which contain obscene or offensive words and phrases. We collected and labelled a large dataset of ...
Read More
Sentence Boundary Detection in Colloquial Arabic Text: A Preliminary Result
IALP '11: Proceedings of the 2011 International Conference on Asian Language Processing

Recently, natural language processing tasks are more frequently conducted over online content. This poses a special problem for applications over Arabic language. Online Arabic content is usually written in informal colloquial Arabic, which is ...
Read More
Online Recognition System for Handwritten Arabic Chemical Symbols
ICCCE '14: Proceedings of the 2014 International Conference on Computer and Communication Engineering

Arabic chemical symbols are remarkably different from Latin chemical symbols which written by Arabic characters. On the other hand, Arabic chemical symbols follow Latin chemical symbols from the structure of writing the symbols. Although, Arabic symbols ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 21, Issue 4
July 2022
464 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3511099
Editor:
Imed Zitouni
Google, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 January 2022
- Accepted: 1 November 2021
- Revised: 1 October 2021
- Received: 1 December 2020
Published in tallip Volume 21, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Artificial neural networks
offensive language detection
natural language processing
Arabic language
machine learning
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 893
  Total Downloads
- Downloads (Last 12 months)173
- Downloads (Last 6 weeks)20
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

Investigating the Effect of Preprocessing Arabic Text on Offensive Language and Hate Speech Detection

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Towards Accurate Detection of Offensive Language in Online Communication in Arabic

Sentence Boundary Detection in Colloquial Arabic Text: A Preliminary Result

Online Recognition System for Handwritten Arabic Chemical Symbols

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

HTML Format

Caption

Investigating the Effect of Preprocessing Arabic Text on Offensive Language and Hate Speech Detection

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Towards Accurate Detection of Offensive Language in Online Communication in Arabic

Sentence Boundary Detection in Colloquial Arabic Text: A Preliminary Result

Online Recognition System for Handwritten Arabic Chemical Symbols

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

HTML Format

Share this Publication link

Share on Social Media