1 Introduction
Sexism means showing hatred, bigotry, or stereotyping someone based on their sex [
1]. In this day and age where online activities like browsing, and social media take up most of our time, the presence of sexism has been an issue. With the increase in social platforms introducing anonymous and unverified accounts, these sexist comments have reached an all-time high. Various platforms have tried eradicating this issue by placing content filters and blockers [
26]. However, these systems showed little performance, and several cases were seen where regular texts were misclassified as sexist. Detecting sexist texts on social media can be challenging due to inconsistent and mixed sarcasm. While sexist comments may be considered hate speech, sexist sentences have a broader scope, and research on sexism detection is extensive [
17]. In the fight against online sexism, identifying and categorizing such content into relevant sub-types is essential for moderation efforts. Sexism can manifest in various forms, such as derogatory language, stereotyping, or objectification, and these different types of sexist comments may require distinct responses [
9]. Therefore, developing a system that detects sexist content and classifies it into multiple categories allows for a more nuanced understanding and response. Multiclass classification has become increasingly feasible with the rise of transformer-based models, such as BERT, RoBERTa, and GPT, which can capture context and subtleties in language better than traditional machine learning algorithms.
While transformer-based models have shown state-of-the-art performance on various natural language processing tasks, including text classification, their black-box nature remains a critical limitation [
3]. In particular, users and moderators need more than just accurate predictions—they require insight into why certain comments are classified as sexist. Explainable AI (XAI) tools, such as LIME (Local Interpretable Model-agnostic Explanations), have been introduced to fill this gap by highlighting the specific words or phrases that influenced the model’s decision. This interpretability is not only crucial for trust in automated systems but also for understanding and addressing the deeper societal and cultural biases that are embedded in online discourse. On the other hand, studies show that AI systems, trained on large datasets, can inherit biases present in the data, leading to biased classifications [
23]. By utilizing LIME, this study also seeks to uncover potential gender bias in transformer-based models. Specifically, it aims to understand whether similar sexist comments directed at men and women are treated differently by the model, revealing underlying discriminatory tendencies.
In this paper, we present a comprehensive approach to automatic detection and multi-class classification of online sexist comments using transformer-based large language models. The contribution of this research includes:
(1)
We perform binary, multi-class, and fine-grained vector classification of sexist comments using different transformer-based large language models.
(2)
We analyze the model’s prediction using LIME explainable AI to decode which parts of the sentence are sexist to the model.
(3)
We expose the potential gender discrimination during the classification by applying the framework to test on an independent dataset.
2 Related Work
This section gives an overview of the recent progress in natural language processing in the sexism detection field. There have been different approaches to identifying racist slurs, derogatory comments, and hate speech from texts. Earlier tasks had rule-based methods, that classified the patterns of abusive and non-abusive words [
11]. Due to hardware limitations, These approaches were superior to machine learning algorithms at that time. Despite that, the study in 2012 showed exceptional results while using simple linear regression methods [
31] in classifying Twitter texts. With the gradual increase of computational power, more complex algorithms were introduced. Decision tree [
6], logistic regression [
30], support vector machines [
8] and many statistical machine learning approaches saw good performance over the years. These models mainly used a bag of word method under the hood [
6,
30]. Later n-gram feature engineering was introduced for text classification tasks and was used along the TF-IDF method [
10].
Besides, Deep Learning methods later saw a rush with the introduction of powerful GPUs being available. Recurrent neural networks (RNN) [
27], Long Short-Term Memory (LSTM) [
29], and Bi-directional LSTMs [
33] have been the pioneering deep learning architectures in text classification. Their ability to understand and remember sequence was groundbreaking in all sorts of text understanding. Everything changed with the introduction of transformers and attention mechanisms by [
28] in 2017. This enabled pre-training and transfer learning and made the models flexible and modular. BERT was one of the initial large language models that harnessed the power of transformers [
7]. After that, all recent natural language tasks have some level of transformer influence. Almost all the submissions of SemEval-2023 Task 10, focusing on the classification of online sexism, have used some type of BERT or other transformer models. These teams employed various pre-trained models, data augmentation techniques, and explainability methods, achieving varying degrees of success across Tasks A (binary sexism detection), B (category classification), and C (fine-grained classification).
Additionally, Nakwijit et al. leveraged BERT, BERTweet, and TwHIN-BERT forming a sexism lexicon using SHAP values [
18]. This method incorporated a bag of words approach for enhanced interpretability. Their performance was moderate, with F1 scores of 0.578 (Task B) and 0.262 (Task C), showcasing a trade-off between explainability and accuracy. Segura et al. used a diverse set of transformers, including BERT, DistilBERT, RoBERTa, and XLNet [
25]. Despite exploring data augmentation, it led to performance degradation. RoBERTa performed best on Task A with an F1 score of 0.832, while BERT achieved 0.599 on Task B, and XLNet 0.417 on Task C. Chang et al. implemented DeBERTa-v3-large, XLM-RoBERTa-large, and covidtwitter-BERT-v2 [
4], achieving strong results in Task A (F1 score of 0.8433 with DeBERTa), but only participated in this task. They pre-trained their models on the unlabeled GAB dataset. Hemati et al. introduced a random adversarial training layer, utilizing RoBERTa-Large and DeRoBERTa [
13]. Their model achieved 0.845 on Task A, 0.678 on Task B, and 0.525 on Task C, indicating robust overall performance without explainability. Pu and Zhou implemented an ensemble of RoBERTa-BiLSTM-BiGRU and SimCSE-RoBERTa, excelling in Task A with a high F1 score of 0.85 [
22].
Moreover, Guo et al. combined BERT and BiLSTM, attaining a modest 0.789 for Task A and enhancing the model with datasets from EXIST-22 and TRAC-22 for data augmentation [
12]. Obeidat et al. achieved competitive scores with an ensemble of BERT and RoBERTa [
19], performing well in all tasks: 0.8538 in Task A, 0.6417 in Task B, and 0.4774 in Task C. Das et al. fine-tuned RoBERTa [
5], achieving 0.8364 on Task A, 0.6588 on Task B, and 0.3320 on Task C. Overall, the use of ensemble models and transformer architectures such as RoBERTa and DeBERTa led to superior performance, especially in Task A. While SHAP-based explainability provided insights into the models’ decision-making, it often came at the cost of reduced performance, as seen in the Nakwijit et al. method. Moreover, data augmentation techniques sometimes degraded the models’ results, as evidenced by Segura’s experiments.
To our knowledge, no tasks have been performed using explainable AI to detect the gender bias issue in the sexist comment classifier in a similar setting.
4 Experimental Evaluation
We conducted the experiments on an NVIDIA RTX 3060 GPU and AMD RYZEN 7 5700X CPU-equipped laptop, with 16 GB RAM and 6 GB Video Memory. PyTorch was the deep-learning backbone [
21]. We used accuracy, precision, recall and F1-score to evaluate the classification performance of different models. The confusion matrix of the test set was also viewed to get an overview of the type I and II errors while classifying.
4.1 Classification Results
The study reveals that as the number of classes increases, the model’s overall accuracy tends to decrease, a common challenge in extreme multi-class classification problems. The varying performance across different classes in Tasks B and C indicates imbalances in the training data. Table
2 summarizes performance metrics for three tasks across four models: BERT, RoBERTa, DistilBERT, and SqueezeBERT. In Task A, RoBERTa outperforms the other models with the highest accuracy, precision, recall, and F1 score. BERT follows closely with an accuracy of 86.8%. DistilBERT and SqueezeBERT have similar performances with slightly lower scores around 85%). Task B’s performance is lower compared to Task A, with RoBERTa leading with 60.98% accuracy and an F1-score of 57.56%. DistilBERT and SqueezeBERT show reduced performance, with SqueezeBERT scoring the lowest. Lastly, all models struggle more with task C. RoBERTa maintains the highest performance with an accuracy of 52.75% and an F1-score of 49. BERT follows with 51.37% accuracy and a lower F1-score of 45.3. DistilBERT and SqueezeBERT perform significantly worse, with SqueezeBERT showing the lowest accuracy (46.08%) and very poor F1-score (35.78%).
Overall, RoBERTa consistently outperforms the other models across all tasks, while DistilBERT and SqueezeBERT offer more lightweight but less accurate alternatives, particularly struggling with more complex multiclass tasks. In each task, we will be showing the outputs from the RoBERTa model.
4.2 Task A
In Task A, we see a binary classification scenario, distinguishing between sexist and non-sexist comments. The model demonstrates strong performance, accurately identifying both categories with high precision. There’s a slight edge in recognizing non-sexist content, but overall, the model shows balanced and effective classification for this binary task, as seen in Fig.
2.
The classification report in table
3 summarizes the performance of a binary classifier. The model’s precision, recall, and F1-score for class “0” (the majority class) are quite high, with values above 95%, indicating strong predictive power for this class. For class “1” (the minority class), the precision is 90.8%, meaning most predictions are correct, though recall is lower at 84.3%, suggesting some instances of this class are missing. The F1-score for class “1” is 87.4%, balancing precision and recall. The overall accuracy of the model is 94.2%, with the macro and weighted averages reflecting strong but slightly imbalanced performance across the two classes.
4.3 Task B
Task B presents a more complex scenario with four distinct classes, as seen in Fig.
3. Here, the model’s performance varies more noticeably. It handles two of the classes quite well, particularly class 0 and class 1, but shows decreased accuracy for the other two. This suggests that some classes are more challenging for the model to distinguish, possibly due to overlapping features or subtle differences between categories.
The classification report in table
4, for the multi-class classification, model shows varying performance across four classes. Class “1” has the best performance with a precision of 64.8 a recall of 72.8%, and an F1-score of 68.6%, indicating the model does well in identifying this class. The other classes perform worse, with Class “0” having a precision of 63.4% but lower recall at 56.5%, reflecting that many instances of this class are missed. Class “2” and Class “3” also show lower precision, recall, and F1 scores, indicating difficulties in correctly predicting these classes. The overall accuracy is 61%, with the macro average (averaged equally across classes) showing balanced but moderate performance. The weighted average is slightly higher, reflecting the model’s better performance on the more frequent class “1”. Overall, the model shows moderate success, with room for improvement in identifying certain classes.
4.4 Task C
The most intricate classification task is represented in Task C, where the model grapples with eleven different classes. The performance here is mixed, with the model excelling at identifying certain classes while struggling with others. Some classes, like 0, 1, 7, 9, and 10, are recognized with high accuracy, indicating clear distinguishing features for these categories. However, the model faces challenges with other classes, particularly 4, 5, and 8, where accuracy drops significantly. This pattern suggests that some classes have more distinct characteristics, making them easier for the model to identify, while others may share similarities that lead to misclassifications.
In this vector classification report in table
5 the model shows varying performance across 11 classes. Class “1” achieves the best balance of precision, recall, and F1-score, indicating it is relatively well-predicted, though there is still room for improvement. Classes “2,” “3,” and “5” perform moderately, with fairly balanced precision and recall, though prediction errors are evident. Some classes, such as “8” and “4,” show significantly poor performance, with “8” being completely misclassified. The overall accuracy is moderate, and the macro average, which treats each class equally, suggests underperformance in several classes. The weighted average reflects a slightly better performance, as it takes the class distribution into account, but overall the model struggles with certain less frequent classes.
4.5 Gender Bias Analysis
We prepared a sample dataset of 20 sexist and 69 non-sexist sentences. In the sexist sentences only the word women were replaced with men. For example: women aren’t good drivers and men aren’t good drivers. 69 non-sexist sentences created an imbalance and made it difficult to predict. RoBERTa model being very good at binary classification, we put it to the test. All 20 sentences of women were classified as sexist, even some non-sexist ones were also identified as sexist. The F1 score of 85% suggests how good the model was while predicting sexist comments against women. During classifying the same sentences with men, it tagged only one sentence as sexist. A poor F1 score of only 8% shows the models struggle to predict any sexist comments against men. This dataset can be accessed
here. The results of using LIME to explain the predictions of the RoBERTa model on potentially sexist comments are visualized in Fig.
5a and
5b.
In Fig.
5a, The model classified this sentence as *sexist* with a high probability of 0.98 (98%) and non-sexist with only a probability of 0.02 (2%). The word “women” is highlighted in orange with the highest contribution (0.40), followed by “are” and “douchbags” contributing slightly less. The word “always” is highlighted in blue (contributing to the non-sexist prediction but minimally). This time, the model classifies the comment as highly sexist. The word “women” is a strong indicator of sexism in the model, suggesting that when the subject of a comment is female, the model is more likely to detect sexism, revealing a potential gender bias. In Fig.
5b, the model classified this sentence as non-sexist with a probability of 1.00 (100%) and sexist with a probability of 0.00 (0%). The words “always” and “douchbags” are highlighted in orange, showing that these words had some importance in influencing the model’s decision. Despite this, the model still predicted the comment as non-sexist. Even though the sentence is derogatory towards men, the model does not flag it as sexist, suggesting potential bias where comments aimed at men might be downplayed.
The comparison between the two scenarios shows a clear imbalance in how the RoBERTa model interprets potentially sexist language based on the gender being targeted. Comments directed toward women are flagged as sexist, while similar comments about men are not, indicating gender bias in the model’s predictions. This analysis, using LIME, supports the hypothesis that the model is more sensitive to sexism involving women than men, highlighting the need for further refinement to address this bias.