research-article

Abusive and Hate speech Classification in Arabic Text Using Pre-trained Language Models and Data Augmentation

Authors:

Nabil Badri,

Ferihane Kboubi,

Anja Habacha ChaibiAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 11

Article No.: 155, Pages 1 - 28

https://doi.org/10.1145/3679049

Published: 21 November 2024 Publication History

Get Access

Abstract

Hateful content on social media is a worldwide problem that adversely affects not just the targeted individuals but also anyone whose content is accessible. The majority of studies that looked at the automatic identification of inappropriate content addressed the English language, given the availability of resources. Therefore, there are still a number of low-resource languages that need more attention from the community. This article focuses on the Arabic dialect, which has several specificities that make the use of non-Arabic models inappropriate. Our hypothesis is that leveraging pre-trained language models (PLMs) specifically designed for Arabic, along with data augmentation techniques, can significantly enhance the detection of hate speech in Arabic mono- and multi-dialect texts.

To test this hypothesis, we conducted a series of experiments addressing three key research questions: (RQ1) Does text augmentation enhance the final results compared to using an unaugmented dataset? (RQ2) Do Arabic PLMs outperform other models utilizing techniques such as fastText and AraVec word embeddings? (RQ3) Does training and fine-tuning models on a multilingual dataset yield better results than training them on a monolingual dataset?

Our methodology involved the comparison of PLMs based on transfer learning, specifically examining the performance of DziriBERT, AraBERT v2, and BERT-base-arabic models. We implemented text augmentation techniques and evaluated their impact on model performance. The tools used included fastText and AraVec for word embeddings, as well as various PLMs for transfer learning.

The results demonstrate a notable improvement in classification accuracy, with augmented datasets showing an increase in performance metrics (accuracy, precision, recall, and F1-score) by up to 15–21% compared to non-augmented datasets. This underscores the potential of data augmentation in enhancing the models’ ability to generalize across the nuanced spectrum of Arabic dialects.

References

[1]

Naganna Chetty and Sreejith Alathur. 2018. Hate speech review in the context of online social networks. Aggress. Viol. Behav. 40 (2018), 108–118.

Abstract

References

Cited By

Index Terms

Recommendations

Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi

Hate Speech Detection on Code-Mixed Dataset Using a Fusion of Custom and Pre-trained Models with Profanity Vector Augmentation

Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Full Text

Share

Share this Publication link

Share on social media

Affiliations