skip to main content
10.1145/3704522.3704524acmotherconferencesArticle/Chapter ViewFull TextPublication PagesnsyssConference Proceedingsconference-collections
research-article
Open access

Explainable AI Discloses Gender Bias in Sexism Detection Algorithm

Published: 03 January 2025 Publication History

Abstract

Online platforms have become breeding grounds for hate speech, with sexist comments posing a serious societal challenge. This paper employs a multi-category classification approach to detect and categorise sexist remarks. We experimented with several transformer-based models, including BERT, DistilBERT, SqueezeBERT, and- RoBERTa. While excelling in binary classification with state-of-the-art F1 scores, our primary focus was to uncover potential gender biases within the model’s predictions. We built and examined a custom gender bias dataset, where the same sexist sentences were gender-swapped, to prove our claims. The LIME explainable AI technique was then used to identify the most relied-on word feature, revealing the model’s inability to detect sexism in the same sentence for the other gender (in this case, men). This research emphasizes the critical need for integrating explainable and unbiased sexism detectors into social media platforms to ensure fair, transparent content moderation and help mitigate gender bias in automated systems.

1 Introduction

Sexism means showing hatred, bigotry, or stereotyping someone based on their sex [1]. In this day and age where online activities like browsing, and social media take up most of our time, the presence of sexism has been an issue. With the increase in social platforms introducing anonymous and unverified accounts, these sexist comments have reached an all-time high. Various platforms have tried eradicating this issue by placing content filters and blockers [26]. However, these systems showed little performance, and several cases were seen where regular texts were misclassified as sexist. Detecting sexist texts on social media can be challenging due to inconsistent and mixed sarcasm. While sexist comments may be considered hate speech, sexist sentences have a broader scope, and research on sexism detection is extensive [17]. In the fight against online sexism, identifying and categorizing such content into relevant sub-types is essential for moderation efforts. Sexism can manifest in various forms, such as derogatory language, stereotyping, or objectification, and these different types of sexist comments may require distinct responses [9]. Therefore, developing a system that detects sexist content and classifies it into multiple categories allows for a more nuanced understanding and response. Multiclass classification has become increasingly feasible with the rise of transformer-based models, such as BERT, RoBERTa, and GPT, which can capture context and subtleties in language better than traditional machine learning algorithms.
While transformer-based models have shown state-of-the-art performance on various natural language processing tasks, including text classification, their black-box nature remains a critical limitation [3]. In particular, users and moderators need more than just accurate predictions—they require insight into why certain comments are classified as sexist. Explainable AI (XAI) tools, such as LIME (Local Interpretable Model-agnostic Explanations), have been introduced to fill this gap by highlighting the specific words or phrases that influenced the model’s decision. This interpretability is not only crucial for trust in automated systems but also for understanding and addressing the deeper societal and cultural biases that are embedded in online discourse. On the other hand, studies show that AI systems, trained on large datasets, can inherit biases present in the data, leading to biased classifications [23]. By utilizing LIME, this study also seeks to uncover potential gender bias in transformer-based models. Specifically, it aims to understand whether similar sexist comments directed at men and women are treated differently by the model, revealing underlying discriminatory tendencies.
In this paper, we present a comprehensive approach to automatic detection and multi-class classification of online sexist comments using transformer-based large language models. The contribution of this research includes:
(1)
We perform binary, multi-class, and fine-grained vector classification of sexist comments using different transformer-based large language models.
(2)
We analyze the model’s prediction using LIME explainable AI to decode which parts of the sentence are sexist to the model.
(3)
We expose the potential gender discrimination during the classification by applying the framework to test on an independent dataset.

2 Related Work

This section gives an overview of the recent progress in natural language processing in the sexism detection field. There have been different approaches to identifying racist slurs, derogatory comments, and hate speech from texts. Earlier tasks had rule-based methods, that classified the patterns of abusive and non-abusive words [11]. Due to hardware limitations, These approaches were superior to machine learning algorithms at that time. Despite that, the study in 2012 showed exceptional results while using simple linear regression methods [31] in classifying Twitter texts. With the gradual increase of computational power, more complex algorithms were introduced. Decision tree [6], logistic regression [30], support vector machines [8] and many statistical machine learning approaches saw good performance over the years. These models mainly used a bag of word method under the hood [6, 30]. Later n-gram feature engineering was introduced for text classification tasks and was used along the TF-IDF method [10].
Besides, Deep Learning methods later saw a rush with the introduction of powerful GPUs being available. Recurrent neural networks (RNN) [27], Long Short-Term Memory (LSTM) [29], and Bi-directional LSTMs [33] have been the pioneering deep learning architectures in text classification. Their ability to understand and remember sequence was groundbreaking in all sorts of text understanding. Everything changed with the introduction of transformers and attention mechanisms by [28] in 2017. This enabled pre-training and transfer learning and made the models flexible and modular. BERT was one of the initial large language models that harnessed the power of transformers [7]. After that, all recent natural language tasks have some level of transformer influence. Almost all the submissions of SemEval-2023 Task 10, focusing on the classification of online sexism, have used some type of BERT or other transformer models. These teams employed various pre-trained models, data augmentation techniques, and explainability methods, achieving varying degrees of success across Tasks A (binary sexism detection), B (category classification), and C (fine-grained classification).
Additionally, Nakwijit et al. leveraged BERT, BERTweet, and TwHIN-BERT forming a sexism lexicon using SHAP values [18]. This method incorporated a bag of words approach for enhanced interpretability. Their performance was moderate, with F1 scores of 0.578 (Task B) and 0.262 (Task C), showcasing a trade-off between explainability and accuracy. Segura et al. used a diverse set of transformers, including BERT, DistilBERT, RoBERTa, and XLNet [25]. Despite exploring data augmentation, it led to performance degradation. RoBERTa performed best on Task A with an F1 score of 0.832, while BERT achieved 0.599 on Task B, and XLNet 0.417 on Task C. Chang et al. implemented DeBERTa-v3-large, XLM-RoBERTa-large, and covidtwitter-BERT-v2 [4], achieving strong results in Task A (F1 score of 0.8433 with DeBERTa), but only participated in this task. They pre-trained their models on the unlabeled GAB dataset. Hemati et al. introduced a random adversarial training layer, utilizing RoBERTa-Large and DeRoBERTa [13]. Their model achieved 0.845 on Task A, 0.678 on Task B, and 0.525 on Task C, indicating robust overall performance without explainability. Pu and Zhou implemented an ensemble of RoBERTa-BiLSTM-BiGRU and SimCSE-RoBERTa, excelling in Task A with a high F1 score of 0.85 [22].
Moreover, Guo et al. combined BERT and BiLSTM, attaining a modest 0.789 for Task A and enhancing the model with datasets from EXIST-22 and TRAC-22 for data augmentation [12]. Obeidat et al. achieved competitive scores with an ensemble of BERT and RoBERTa [19], performing well in all tasks: 0.8538 in Task A, 0.6417 in Task B, and 0.4774 in Task C. Das et al. fine-tuned RoBERTa [5], achieving 0.8364 on Task A, 0.6588 on Task B, and 0.3320 on Task C. Overall, the use of ensemble models and transformer architectures such as RoBERTa and DeBERTa led to superior performance, especially in Task A. While SHAP-based explainability provided insights into the models’ decision-making, it often came at the cost of reduced performance, as seen in the Nakwijit et al. method. Moreover, data augmentation techniques sometimes degraded the models’ results, as evidenced by Segura’s experiments.
To our knowledge, no tasks have been performed using explainable AI to detect the gender bias issue in the sexist comment classifier in a similar setting.

3 Methodology

The proposed architecture in Figure 1 outlines a machine-learning pipeline for text classification using the family of BERT models. The pipeline starts with the EDOS Dataset. This dataset is divided into three tasks: Task A, Task B, and Task C. After task division, the data undergoes Label Encoding and Train-Test Splitting. Label Encoding converts categorical labels into numerical format. Train-Test Splitting divides the data into separate sets for training and evaluation. The data is fed to the simple transformers framework. Here, the Tokenizer processes the input text breaks words into tokens, and substitutes these tokens with corresponding IDs. The tokenized input is fed into the large language model. After processing by the model, the output goes through the Prediction and Metrics Analysis steps. The final step uses LIME Text Explainer. LIME helps interpret the model’s decisions by explaining which parts of the input text contributed most to the classification. In our case, understanding the gender bias.
This architecture combines a powerful pre-trained language model with explainable AI techniques (LIME) to create a comprehensive system for text classification that predicts and provides insights into its decision-making process. The multiple tasks (A, B, C) suggest a multi-faceted approach to classification, potentially addressing different aspects or levels of the problem at hand.
Figure 1:
Figure 1: Workflow of the experiment

3.1 Dataset

This dataset, called Explainable Online Sexism Detection (EDOS), is obtained from [15]. This task was intended to develop English-language models for sexism detection that are more accurate and explainable, with fine-grained classifications for sexist content from various sources. The task is divided into three hierarchical sub-tasks:
TASK A - Binary Sexism Detection: a two-class (or binary) classification where systems have to predict whether a post is sexist or not.
TASK B - Category of Sexism: for sexist posts, a four-class classification where systems have to predict one of four categories: (1) threats, (2) derogation, (3) animosity, (4) prejudiced discussion.
TASK C - Fine-grained Vector of Sexism: for sexist posts, an 11-class classification where systems have to predict one of 11 fine-grained vectors.
The distribution of the dataset is hugely imbalanced, shown in table 1. This made the dataset ideal for Task A. However we attempted all three tasks as the challenge intended.
Table 1:
 CategoryVector 
   Count
SexistThreatsIncitement and encouragement of harm254
Threats of harm56
DerogationDescriptive attacks717
Aggressive and emotive attacks673
Dehumanizing attacks overt sexual objectification200
AnimosityCasual use of gendered slurs, profanities, and insults637
Immutable gender differences and gender stereotypes417
Backhanded gendered compliments64
Condescending explanations or unwelcome advice47
Prejudiced DiscussionSupporting systemic discrimination against women as a group258
Supporting mistreatment of individual women75
Non-Sexist  10602
Table 1: Dataset Distribution among different classes

3.2 Classifier Models

The classification task was made easier with the help of simple transformers framework. This framework made it easy to play with different models and compare their performance. The benefit of simple transformer is the preloading of the tokenizer and vocabs upon calling the model. We put the base BERT and three different BERT-based tuned models to test here.
BERT: BERT (Bidirectional Encoder Representations from Transformers) uses the Transformer architecture to train deep bidirectional representations by conditioning both left and right context across all layers [7]. BERT varies from typical models in that it employs a masked language modelling (MLM) aim, which involves masking some words in the input sequence and training the model to predict these words. It also includes the next sentence prediction (NSP) challenge, which helps you grasp how sentences work together.
RoBERTa: RoBERTa is based on BERT architecture [16]. It modifies several essential hyperparameters, removing the subsequent sentence pretraining objective and training with several mini-batches and varied learning rates. It is a multilingual model that uses Transformer as its core structure for representation. We conducted our investigation using the RoBERTa-base model. Similar to other transformer-based models, RoBERTa relies on a transformer architecture to process and understand textual input, capturing both local and long-range dependencies in the text.
DistilBERT: DistilBERT is a smaller and speedier variant of BERT produced by a method called knowledge distillation [24]. It preserves many of BERT’s language comprehension features while being more computationally efficient. DistilBERT eliminates the next sentence prediction (NSP) job and decreases the number of layers to 6, compared to BERT’s 12 layers, while maintaining the same number of hidden units. Despite its lower size, DistilBERT retains the majority of BERT’s language comprehension, giving it a lightweight yet effective alternative.
SqueezeBERT: SqueezeBERT, on the other hand, is a more recent model designed to reduce the computational and memory overhead of BERT-like models by using grouped convolutions instead of traditional transformers [14]. SqueezeBERT-uncased has fewer layers, fewer parameters, and uses less memory, making it suitable for deployment in resource-constrained environments, such as mobile devices, without significant loss in performance. It is an uncased model like BERT and DistilBERT, where case distinctions (upper vs. lower) are ignored during training and inference.

4 Experimental Evaluation

We conducted the experiments on an NVIDIA RTX 3060 GPU and AMD RYZEN 7 5700X CPU-equipped laptop, with 16 GB RAM and 6 GB Video Memory. PyTorch was the deep-learning backbone [21]. We used accuracy, precision, recall and F1-score to evaluate the classification performance of different models. The confusion matrix of the test set was also viewed to get an overview of the type I and II errors while classifying.

4.1 Classification Results

The study reveals that as the number of classes increases, the model’s overall accuracy tends to decrease, a common challenge in extreme multi-class classification problems. The varying performance across different classes in Tasks B and C indicates imbalances in the training data. Table 2 summarizes performance metrics for three tasks across four models: BERT, RoBERTa, DistilBERT, and SqueezeBERT. In Task A, RoBERTa outperforms the other models with the highest accuracy, precision, recall, and F1 score. BERT follows closely with an accuracy of 86.8%. DistilBERT and SqueezeBERT have similar performances with slightly lower scores around 85%). Task B’s performance is lower compared to Task A, with RoBERTa leading with 60.98% accuracy and an F1-score of 57.56%. DistilBERT and SqueezeBERT show reduced performance, with SqueezeBERT scoring the lowest. Lastly, all models struggle more with task C. RoBERTa maintains the highest performance with an accuracy of 52.75% and an F1-score of 49. BERT follows with 51.37% accuracy and a lower F1-score of 45.3. DistilBERT and SqueezeBERT perform significantly worse, with SqueezeBERT showing the lowest accuracy (46.08%) and very poor F1-score (35.78%).
Table 2:
TaskModelAccPrecRecF1
A      BERT86.882.881.181.9
RoBERTa93.6492.6490.1191.28
DistilBERT85.48177.278.8
SqueezeBERT85.280.577.879
B      BERT60.061.35758.8
RoBERTa6159.3556.2357.56
DistilBERT59.058.554.756.3
SqueezeBERT55.952.851.552.1
C      BERT51.3749.9643.3445.3
RoBERTa52.75534749
DistilBERT50.9842.6541.0741.54
SqueezeBERT46.0837.0935.3835.78
Table 2: Performance metrics (macro average) obtained in different tasks and models.
Overall, RoBERTa consistently outperforms the other models across all tasks, while DistilBERT and SqueezeBERT offer more lightweight but less accurate alternatives, particularly struggling with more complex multiclass tasks. In each task, we will be showing the outputs from the RoBERTa model.

4.2 Task A

In Task A, we see a binary classification scenario, distinguishing between sexist and non-sexist comments. The model demonstrates strong performance, accurately identifying both categories with high precision. There’s a slight edge in recognizing non-sexist content, but overall, the model shows balanced and effective classification for this binary task, as seen in Fig. 2.
Table 3:
 precisionrecallf1-scoresupport
095.297.396.22132
190.884.387.4668
accuracy  94.22800
macro avg9390.891.82800
weighted avg94.194.294.12800
Table 3: Classification report for Task A
The classification report in table 3 summarizes the performance of a binary classifier. The model’s precision, recall, and F1-score for class “0” (the majority class) are quite high, with values above 95%, indicating strong predictive power for this class. For class “1” (the minority class), the precision is 90.8%, meaning most predictions are correct, though recall is lower at 84.3%, suggesting some instances of this class are missing. The F1-score for class “1” is 87.4%, balancing precision and recall. The overall accuracy of the model is 94.2%, with the macro and weighted averages reflecting strong but slightly imbalanced performance across the two classes.
Figure 2:
Figure 2: Confusion matrix of binary classification

4.3 Task B

Task B presents a more complex scenario with four distinct classes, as seen in Fig. 3. Here, the model’s performance varies more noticeably. It handles two of the classes quite well, particularly class 0 and class 1, but shows decreased accuracy for the other two. This suggests that some classes are more challenging for the model to distinguish, possibly due to overlapping features or subtle differences between categories.
Table 4:
 precisionrecallf1-scoresupport
063.456.559.846
164.872.868.6243
255.549.752.4173
353.745.849.448
accuracy  61510
macro avg59.456.257.6510
weighted avg60.56160.5510
Table 4: Classification report for Task B
The classification report in table 4, for the multi-class classification, model shows varying performance across four classes. Class “1” has the best performance with a precision of 64.8  a recall of 72.8%, and an F1-score of 68.6%, indicating the model does well in identifying this class. The other classes perform worse, with Class “0” having a precision of 63.4% but lower recall at 56.5%, reflecting that many instances of this class are missed. Class “2” and Class “3” also show lower precision, recall, and F1 scores, indicating difficulties in correctly predicting these classes. The overall accuracy is 61%, with the macro average (averaged equally across classes) showing balanced but moderate performance. The weighted average is slightly higher, reflecting the model’s better performance on the more frequent class “1”. Overall, the model shows moderate success, with room for improvement in identifying certain classes.
Figure 3:
Figure 3: Confusion matrix of multi-category classification

4.4 Task C

The most intricate classification task is represented in Task C, where the model grapples with eleven different classes. The performance here is mixed, with the model excelling at identifying certain classes while struggling with others. Some classes, like 0, 1, 7, 9, and 10, are recognized with high accuracy, indicating clear distinguishing features for these categories. However, the model faces challenges with other classes, particularly 4, 5, and 8, where accuracy drops significantly. This pattern suggests that some classes have more distinct characteristics, making them easier for the model to identify, while others may share similarities that lead to misclassifications.
Table 5:
 precisionrecallf1-scoresupport
070707010
171.464.567.831
250.856.553.5115
350.761.655.6112
42932.130.528
5635659.391
641.23537.860
77542.954.67
80006
97058.363.612
1059.342.149.238
accuracy  52.6510
macro avg52.847.249.3510
weighted avg53.152.652.4510
Table 5: Classification report for Task C
In this vector classification report in table 5 the model shows varying performance across 11 classes. Class “1” achieves the best balance of precision, recall, and F1-score, indicating it is relatively well-predicted, though there is still room for improvement. Classes “2,” “3,” and “5” perform moderately, with fairly balanced precision and recall, though prediction errors are evident. Some classes, such as “8” and “4,” show significantly poor performance, with “8” being completely misclassified. The overall accuracy is moderate, and the macro average, which treats each class equally, suggests underperformance in several classes. The weighted average reflects a slightly better performance, as it takes the class distribution into account, but overall the model struggles with certain less frequent classes.
Figure 4:
Figure 4: Confusion matrix of vector classification

4.5 Gender Bias Analysis

We prepared a sample dataset of 20 sexist and 69 non-sexist sentences. In the sexist sentences only the word women were replaced with men. For example: women aren’t good drivers and men aren’t good drivers. 69 non-sexist sentences created an imbalance and made it difficult to predict. RoBERTa model being very good at binary classification, we put it to the test. All 20 sentences of women were classified as sexist, even some non-sexist ones were also identified as sexist. The F1 score of 85% suggests how good the model was while predicting sexist comments against women. During classifying the same sentences with men, it tagged only one sentence as sexist. A poor F1 score of only 8% shows the models struggle to predict any sexist comments against men. This dataset can be accessed here. The results of using LIME to explain the predictions of the RoBERTa model on potentially sexist comments are visualized in Fig. 5a and 5b.
Table 6:
  PrecisionRecallF1Support
Women010089.8694.6669
174.0710085.1120
Men077.3894.284.9769
1205820
Table 6: Performance on potential sexist comments against men and women
Figure 5:
Figure 5: LIME model output
In Fig. 5a, The model classified this sentence as *sexist* with a high probability of 0.98 (98%) and non-sexist with only a probability of 0.02 (2%). The word “women” is highlighted in orange with the highest contribution (0.40), followed by “are” and “douchbags” contributing slightly less. The word “always” is highlighted in blue (contributing to the non-sexist prediction but minimally). This time, the model classifies the comment as highly sexist. The word “women” is a strong indicator of sexism in the model, suggesting that when the subject of a comment is female, the model is more likely to detect sexism, revealing a potential gender bias. In Fig. 5b, the model classified this sentence as non-sexist with a probability of 1.00 (100%) and sexist with a probability of 0.00 (0%). The words “always” and “douchbags” are highlighted in orange, showing that these words had some importance in influencing the model’s decision. Despite this, the model still predicted the comment as non-sexist. Even though the sentence is derogatory towards men, the model does not flag it as sexist, suggesting potential bias where comments aimed at men might be downplayed.
The comparison between the two scenarios shows a clear imbalance in how the RoBERTa model interprets potentially sexist language based on the gender being targeted. Comments directed toward women are flagged as sexist, while similar comments about men are not, indicating gender bias in the model’s predictions. This analysis, using LIME, supports the hypothesis that the model is more sensitive to sexism involving women than men, highlighting the need for further refinement to address this bias.
Table 7:
PaperMethodF1 ScoreExplainerRemarks
Nakwijit et al. [18]BERT,
BERTweet,
TwHIN-BERT
Task B: 0.578
Task C: 0.262
SHAPFormed lexicon based on SHAP values and incorporated them with a bag of words method.
Segura-Bedmar [25]BERT,
DistilBERT,
RoBERTa,
XLNet.
Task A: 0.832
Task B: 0.599
Task C: 0.417
NoUsing data augmentation always saw a degradation in performance
Chang et al. [4]DeBERTa-v3-large,
XLMRoBERTa-large
covidtwitter-BERT-v2
Only Task A
0.8433 (DeBERTa)
0.8251 (XLM)
0.8316 (Covid)
NoPre-trained the models on unlabeled GAB dataset
Hemati et al. [13]RoBERTa – Large
DeRoBERTa
Task A: 0.845
Task B: 0.678
Task C: 0.525
NoImplemented random adversial training layer
Pu and Zhou [22]Ensemble of
RoBERTa-BiLSTM-BiGRU,
SimCSE-RoBERTa
Task A: 0.85No 
Guo et al. [12]BERT-BiLSTMTaskA: 0.789NoAugmentation using EXIST-22 and TRAC-22 datasets
Obeidat et al. [19]BERT,
RoBERTa
Task A: 0.8538
Task B: 0.6417
Task C: 0.4774
NoEnsemble of BERT and RoBERTa performed better in task B
Das et al. [5]RoBERTa-LargeTask A: 0.8364
Task B: 0.6588
Task C: 0.3320
No 
Proposed MethodologyRoBERTa-Base with
Simple Transformers
Task A: 0.9128
Task B: 0.5756
Task C: 0.493
LIMEFinds out the gender bias in classification with getting SOTA Task A accuracy
Table 7: Comparison table for different approaches on the EDOS dataset in SemEval 2023 challenge

5 Discussion

This section discusses the limitations and performance compared to similar tasks.

5.1 Comparison With Related Works

Table 7 compares our approach with some other works of the SemEval 2023 competition. The RoBERTa-Base with Simple Transformers method outperforms other systems in detecting sexism, with an F1 score of 0.9128 in task A. Task B, achieves an F1 score of 0.5756, comparable to other systems. Task C, the most challenging, achieves an F1 score of 0.493, competitive with top systems, despite many papers not attempting task C. Moreover, our method is a straightforward, efficient, and easy-to-implement framework, unlike other methods like PCJ’s complex ensemble of RoBERTa with BiLSTM and BiGRU, which adds complexity and computational cost. SUTNLP’s adversarial training method, while potentially powerful, adds complexity, making it a more efficient option for training and inference. Our method, unlike most others, prioritizes explainability, a crucial aspect for detecting sensitive topics like sexism. Its unique ability to explain and identify gender bias in classification could enhance its reliability and acceptability in real-world applications. In summary, our proposed methodology excels because it offers:
State-of-the-art performance, especially in Task A.
Explainability through LIME, offering insight into the model’s decision-making process. Revealing gender bias issue in machine learning algorithm.
A simpler and more efficient architecture compared to complex ensemble and adversarial training models.
A balanced performance across all tasks without compromising explainability or adding unnecessary complexity.
This makes the approach more practical, interpretable, and ultimately more suitable for real-world applications, where transparency and efficiency are just as important as raw performance. Apart from classification accuracy, there have been active studies regarding gender bias analysis. Gender swap data augmentation was discussed in [20, 32]. Though their works are mostly targeted towards correcting the gender pair detection in a sentence. [2] proposed an algorithm to correct word embedding by removing gender stereotypical information, calling it debiased word embedding. All these methods didn’t have any way to explain the bias other than the predicted outcomes. We introduced an explainable AI-enabled approach, that makes the results more intuitive and easier to treat.

5.2 Limitations

Though our study shines in various areas, it has some limitations. The class imbalances across tasks B and C were challenging to perform properly. We didn’t explicitly modify any models other than hyperparameter tuning. Also, class weighted training approach should have been utilized for tasks B and C.

6 Conclusion and Future Work

In this study, the focus was given to the ethical clarity of AI algorithms and their potential misuse in real-life applications. As seen, large language models can classify sexist and non-sexist comments, but they can’t do it gender neutrally. This issue originates due to the dataset being shifted towards women and not enough data with examples of men. The imbalanced data leads to biased models, as the model hasn’t seen enough examples from underrepresented groups. Our LIME explainable AI approach has shown that the algorithm understands the tokens, but the decision is based on gender. To solve this issue. a new human-in-XAI-loop can be formed. Our future work will leverage the LIME feature importance for threshold adjustments after prediction. We will use class-weighted loss functions, particularly for underrepresented classes in Tasks B and C to enhance the model performance. The gender bias analysis will be expanded to include complex forms of sexism for both genders, providing a more comprehensive assessment of nuanced bias detection. Training strategies will be investigated to promote gender-neutral recognition of sexism, including de-biasing algorithms and gender-neutral training regimes to mitigate biases from data imbalance.

References

[1]
Rebecca S Bigler and Lynn S Liben. 2007. Developmental intergroup theory: Explaining and reducing children’s social stereotyping and prejudice. Current directions in psychological science 16, 3 (2007), 162–166.
[2]
Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems 29 (2016).
[3]
Davide Castelvecchi. 2016. Can we open the black box of AI? Nature News 538, 7623 (2016), 20.
[4]
Yu Chang, Yuxi Chen, and Yanru Zhang. 2023. niceNLP at SemEval-2023 Task 10: Dual Model Alternate Pseudo-labeling Improves Your Predictions. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). 307–311.
[5]
Amit Das, Nilanjana Raychawdhary, Tathagata Bhattacharya, Gerry Dozier, and Cheryl D Seals. 2023. AU_NLP at SemEval-2023 task 10: Explainable detection of online sexism using fine-tuned RoBERTa. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). 707–717.
[6]
Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In Proceedings of the international AAAI conference on web and social media, Vol. 11. 512–515.
[7]
Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:https://arXiv.org/abs/1810.04805 (2018).
[8]
Shahnoor C Eshan and Mohammad S Hasan. 2017. An application of machine learning to detect abusive bengali text. In 2017 20th International conference of computer and information technology (ICCIT). IEEE, 1–6.
[9]
Fabio Fasoli, Andrea Carnaghi, and Maria Paola Paladino. 2015. Social acceptability of sexist derogatory and sexist objectifying slurs across contexts. Language sciences 52 (2015), 98–107.
[10]
Aditya Gaydhani, Vikrant Doma, Shrikant Kendre, and Laxmi Bhagwat. 2018. Detecting hate speech and offensive language on twitter using machine learning: An n-gram and tfidf based approach. arXiv preprint arXiv:https://arXiv.org/abs/1809.08651 (2018).
[11]
Philip Gianfortoni, David Adamson, and Carolyn Rose. 2011. Modeling of stylistic variation in social media with stretchy patterns. In Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties. 49–59.
[12]
Kangshuai Guo, Ruipeng Ma, Shichao Luo, and Yan Wang. 2023. Coco at semeval-2023 task 10: Explainable detection of online sexism. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). 469–476.
[13]
Hamed Hematian Hemati, Sayed Hesam Alavian, Hamid Beigy, and Hossein Sameti. 2023. SUTNLP at SemEval-2023 Task 10: RLAT-Transformer for explainable online sexism detection. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). 347–356.
[14]
Forrest N Iandola, Albert E Shaw, Ravi Krishna, and Kurt W Keutzer. 2020. SqueezeBERT: What can computer vision teach NLP about efficient neural networks? arXiv preprint arXiv:https://arXiv.org/abs/2006.11316 (2020).
[15]
Hannah Rose Kirk, Wenjie Yin, Bertie Vidgen, and Paul Röttger. 2023. SemEval-2023 Task 10: Explainable Detection of Online Sexism. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). Association for Computational Linguistics.
[16]
Yinhan Liu. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:https://arXiv.org/abs/1907.11692 (2019).
[17]
Ariadna Matamoros-Fernández. 2017. Platformed racism: The mediation and circulation of an Australian race-based controversy on Twitter, Facebook and YouTube. Information, Communication & Society 20, 6 (2017), 930–946.
[18]
Pakawat Nakwijit, Mahmoud Samir, and Matthew Purver. 2023. Lexicools at SemEval-2023 Task 10: Sexism Lexicon Construction via XAI. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). 23–43.
[19]
Doaa Obeidat, Heba Nammas, Malak Abdullah, et al. 2023. JUST_ONE at SemEval-2023 Task 10: Explainable Detection of Online Sexism (EDOS). In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). 526–531.
[20]
Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Reducing gender bias in abusive language detection. arXiv preprint arXiv:https://arXiv.org/abs/1808.07231 (2018).
[21]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.
[22]
Chujun Pu and Xiaobing Zhou. 2023. PCJ at SemEval-2023 Task 10: A Ensemble Model Based on Pre-trained Model for Sexism Detection and Classification in English. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). 433–438.
[23]
Drew Roselli, Jeanna Matthews, and Nisha Talagala. 2019. Managing bias in AI. In Companion proceedings of the 2019 world wide web conference. 539–544.
[24]
V Sanh. 2019. DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv preprint arXiv:https://arXiv.org/abs/1910.01108 (2019).
[25]
Isabel Segura-Bedmar. 2023. HULAT at SemEval-2023 Task 10: Data augmentation for pre-trained transformers applied to the detection of sexism in social media. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). 184–192.
[26]
Eugenia Siapera and Paloma Viejo-Otero. 2021. Governing hate: Facebook and digital racism. Television & New Media 22, 2 (2021), 112–130.
[27]
Duyu Tang, Bing Qin, and Ting Liu. 2015. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 conference on empirical methods in natural language processing. 1422–1432.
[28]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need.(Nips), 2017. arXiv preprint arXiv:https://arXiv.org/abs/1706.03762 10 (2017), S0140525X16001837.
[29]
Xin Wang, Yuanchao Liu, Cheng-Jie Sun, Baoxun Wang, and Xiaolong Wang. 2015. Predicting polarities of tweets by composing word embeddings with long short-term memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 1343–1353.
[30]
Zeerak Waseem. 2016. Are you a racist or am i seeing things? annotator influence on hate speech detection on twitter. In Proceedings of the first workshop on NLP and computational social science. 138–142.
[31]
Guang Xiang, Bin Fan, Ling Wang, Jason Hong, and Carolyn Rose. 2012. Detecting offensive tweets via topical feature discovery over a large scale twitter corpus. In Proceedings of the 21st ACM international conference on Information and knowledge management. 1980–1984.
[32]
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018. Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:https://arXiv.org/abs/1804.06876 (2018).
[33]
Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, and Bo Xu. 2016. Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. arXiv preprint arXiv:https://arXiv.org/abs/1611.06639 (2016).

Index Terms

  1. Explainable AI Discloses Gender Bias in Sexism Detection Algorithm

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    NSysS '24: Proceedings of the 11th International Conference on Networking, Systems, and Security
    December 2024
    278 pages
    ISBN:9798400711589
    DOI:10.1145/3704522

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 January 2025

    Check for updates

    Author Tags

    1. Gender Bias
    2. Explainable AI
    3. Online Sexism
    4. Transformer
    5. Multi-class Classification
    6. LIME

    Qualifiers

    • Research-article

    Conference

    NSysS '24

    Acceptance Rates

    Overall Acceptance Rate 12 of 44 submissions, 27%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 287
      Total Downloads
    • Downloads (Last 12 months)287
    • Downloads (Last 6 weeks)143
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media