research-article

Open access

Explainable AI Discloses Gender Bias in Sexism Detection Algorithm

Authors:

Fahim Muntasir,

Jannatun NoorAuthors Info & Claims

NSysS '24: Proceedings of the 11th International Conference on Networking, Systems, and Security

Pages 120 - 127

https://doi.org/10.1145/3704522.3704524

Published: 03 January 2025 Publication History

PDF eReader

Abstract

Online platforms have become breeding grounds for hate speech, with sexist comments posing a serious societal challenge. This paper employs a multi-category classification approach to detect and categorise sexist remarks. We experimented with several transformer-based models, including BERT, DistilBERT, SqueezeBERT, and- RoBERTa. While excelling in binary classification with state-of-the-art F1 scores, our primary focus was to uncover potential gender biases within the model’s predictions. We built and examined a custom gender bias dataset, where the same sexist sentences were gender-swapped, to prove our claims. The LIME explainable AI technique was then used to identify the most relied-on word feature, revealing the model’s inability to detect sexism in the same sentence for the other gender (in this case, men). This research emphasizes the critical need for integrating explainable and unbiased sexism detectors into social media platforms to ensure fair, transparent content moderation and help mitigate gender bias in automated systems.

1 Introduction

Sexism means showing hatred, bigotry, or stereotyping someone based on their sex [1]. In this day and age where online activities like browsing, and social media take up most of our time, the presence of sexism has been an issue. With the increase in social platforms introducing anonymous and unverified accounts, these sexist comments have reached an all-time high. Various platforms have tried eradicating this issue by placing content filters and blockers [26]. However, these systems showed little performance, and several cases were seen where regular texts were misclassified as sexist. Detecting sexist texts on social media can be challenging due to inconsistent and mixed sarcasm. While sexist comments may be considered hate speech, sexist sentences have a broader scope, and research on sexism detection is extensive [17]. In the fight against online sexism, identifying and categorizing such content into relevant sub-types is essential for moderation efforts. Sexism can manifest in various forms, such as derogatory language, stereotyping, or objectification, and these different types of sexist comments may require distinct responses [9]. Therefore, developing a system that detects sexist content and classifies it into multiple categories allows for a more nuanced understanding and response. Multiclass classification has become increasingly feasible with the rise of transformer-based models, such as BERT, RoBERTa, and GPT, which can capture context and subtleties in language better than traditional machine learning algorithms.

While transformer-based models have shown state-of-the-art performance on various natural language processing tasks, including text classification, their black-box nature remains a critical limitation [3]. In particular, users and moderators need more than just accurate predictions—they require insight into why certain comments are classified as sexist. Explainable AI (XAI) tools, such as LIME (Local Interpretable Model-agnostic Explanations), have been introduced to fill this gap by highlighting the specific words or phrases that influenced the model’s decision. This interpretability is not only crucial for trust in automated systems but also for understanding and addressing the deeper societal and cultural biases that are embedded in online discourse. On the other hand, studies show that AI systems, trained on large datasets, can inherit biases present in the data, leading to biased classifications [23]. By utilizing LIME, this study also seeks to uncover potential gender bias in transformer-based models. Specifically, it aims to understand whether similar sexist comments directed at men and women are treated differently by the model, revealing underlying discriminatory tendencies.

In this paper, we present a comprehensive approach to automatic detection and multi-class classification of online sexist comments using transformer-based large language models. The contribution of this research includes:

(1)

We perform binary, multi-class, and fine-grained vector classification of sexist comments using different transformer-based large language models.

(2)

We analyze the model’s prediction using LIME explainable AI to decode which parts of the sentence are sexist to the model.

(3)

We expose the potential gender discrimination during the classification by applying the framework to test on an independent dataset.

2 Related Work

This section gives an overview of the recent progress in natural language processing in the sexism detection field. There have been different approaches to identifying racist slurs, derogatory comments, and hate speech from texts. Earlier tasks had rule-based methods, that classified the patterns of abusive and non-abusive words [11]. Due to hardware limitations, These approaches were superior to machine learning algorithms at that time. Despite that, the study in 2012 showed exceptional results while using simple linear regression methods [31] in classifying Twitter texts. With the gradual increase of computational power, more complex algorithms were introduced. Decision tree [6], logistic regression [30], support vector machines [8] and many statistical machine learning approaches saw good performance over the years. These models mainly used a bag of word method under the hood [6, 30]. Later n-gram feature engineering was introduced for text classification tasks and was used along the TF-IDF method [10].

Besides, Deep Learning methods later saw a rush with the introduction of powerful GPUs being available. Recurrent neural networks (RNN) [27], Long Short-Term Memory (LSTM) [29], and Bi-directional LSTMs [33] have been the pioneering deep learning architectures in text classification. Their ability to understand and remember sequence was groundbreaking in all sorts of text understanding. Everything changed with the introduction of transformers and attention mechanisms by [28] in 2017. This enabled pre-training and transfer learning and made the models flexible and modular. BERT was one of the initial large language models that harnessed the power of transformers [7]. After that, all recent natural language tasks have some level of transformer influence. Almost all the submissions of SemEval-2023 Task 10, focusing on the classification of online sexism, have used some type of BERT or other transformer models. These teams employed various pre-trained models, data augmentation techniques, and explainability methods, achieving varying degrees of success across Tasks A (binary sexism detection), B (category classification), and C (fine-grained classification).

Additionally, Nakwijit et al. leveraged BERT, BERTweet, and TwHIN-BERT forming a sexism lexicon using SHAP values [18]. This method incorporated a bag of words approach for enhanced interpretability. Their performance was moderate, with F1 scores of 0.578 (Task B) and 0.262 (Task C), showcasing a trade-off between explainability and accuracy. Segura et al. used a diverse set of transformers, including BERT, DistilBERT, RoBERTa, and XLNet [25]. Despite exploring data augmentation, it led to performance degradation. RoBERTa performed best on Task A with an F1 score of 0.832, while BERT achieved 0.599 on Task B, and XLNet 0.417 on Task C. Chang et al. implemented DeBERTa-v3-large, XLM-RoBERTa-large, and covidtwitter-BERT-v2 [4], achieving strong results in Task A (F1 score of 0.8433 with DeBERTa), but only participated in this task. They pre-trained their models on the unlabeled GAB dataset. Hemati et al. introduced a random adversarial training layer, utilizing RoBERTa-Large and DeRoBERTa [13]. Their model achieved 0.845 on Task A, 0.678 on Task B, and 0.525 on Task C, indicating robust overall performance without explainability. Pu and Zhou implemented an ensemble of RoBERTa-BiLSTM-BiGRU and SimCSE-RoBERTa, excelling in Task A with a high F1 score of 0.85 [22].

Moreover, Guo et al. combined BERT and BiLSTM, attaining a modest 0.789 for Task A and enhancing the model with datasets from EXIST-22 and TRAC-22 for data augmentation [12]. Obeidat et al. achieved competitive scores with an ensemble of BERT and RoBERTa [19], performing well in all tasks: 0.8538 in Task A, 0.6417 in Task B, and 0.4774 in Task C. Das et al. fine-tuned RoBERTa [5], achieving 0.8364 on Task A, 0.6588 on Task B, and 0.3320 on Task C. Overall, the use of ensemble models and transformer architectures such as RoBERTa and DeBERTa led to superior performance, especially in Task A. While SHAP-based explainability provided insights into the models’ decision-making, it often came at the cost of reduced performance, as seen in the Nakwijit et al. method. Moreover, data augmentation techniques sometimes degraded the models’ results, as evidenced by Segura’s experiments.

To our knowledge, no tasks have been performed using explainable AI to detect the gender bias issue in the sexist comment classifier in a similar setting.

3 Methodology

The proposed architecture in Figure 1 outlines a machine-learning pipeline for text classification using the family of BERT models. The pipeline starts with the EDOS Dataset. This dataset is divided into three tasks: Task A, Task B, and Task C. After task division, the data undergoes Label Encoding and Train-Test Splitting. Label Encoding converts categorical labels into numerical format. Train-Test Splitting divides the data into separate sets for training and evaluation. The data is fed to the simple transformers framework. Here, the Tokenizer processes the input text breaks words into tokens, and substitutes these tokens with corresponding IDs. The tokenized input is fed into the large language model. After processing by the model, the output goes through the Prediction and Metrics Analysis steps. The final step uses LIME Text Explainer. LIME helps interpret the model’s decisions by explaining which parts of the input text contributed most to the classification. In our case, understanding the gender bias.

This architecture combines a powerful pre-trained language model with explainable AI techniques (LIME) to create a comprehensive system for text classification that predicts and provides insights into its decision-making process. The multiple tasks (A, B, C) suggest a multi-faceted approach to classification, potentially addressing different aspects or levels of the problem at hand.

Figure 1:

3.1 Dataset

This dataset, called Explainable Online Sexism Detection (EDOS), is obtained from [15]. This task was intended to develop English-language models for sexism detection that are more accurate and explainable, with fine-grained classifications for sexist content from various sources. The task is divided into three hierarchical sub-tasks:

•

TASK A - Binary Sexism Detection: a two-class (or binary) classification where systems have to predict whether a post is sexist or not.

•

TASK B - Category of Sexism: for sexist posts, a four-class classification where systems have to predict one of four categories: (1) threats, (2) derogation, (3) animosity, (4) prejudiced discussion.

•

TASK C - Fine-grained Vector of Sexism: for sexist posts, an 11-class classification where systems have to predict one of 11 fine-grained vectors.

The distribution of the dataset is hugely imbalanced, shown in table 1. This made the dataset ideal for Task A. However we attempted all three tasks as the challenge intended.

Table 1:

	Category	Vector
			Count
Sexist	Threats	Incitement and encouragement of harm	254
	Threats	Threats of harm	56
	Derogation	Descriptive attacks	717
		Aggressive and emotive attacks	673
		Dehumanizing attacks overt sexual objectification	200
	Animosity	Casual use of gendered slurs, profanities, and insults	637
		Immutable gender differences and gender stereotypes	417
		Backhanded gendered compliments	64
		Condescending explanations or unwelcome advice	47
	Prejudiced Discussion	Supporting systemic discrimination against women as a group	258
	Prejudiced Discussion	Supporting mistreatment of individual women	75
Non-Sexist			10602

Table 1: Dataset Distribution among different classes

3.2 Classifier Models

The classification task was made easier with the help of simple transformers framework. This framework made it easy to play with different models and compare their performance. The benefit of simple transformer is the preloading of the tokenizer and vocabs upon calling the model. We put the base BERT and three different BERT-based tuned models to test here.

•

BERT: BERT (Bidirectional Encoder Representations from Transformers) uses the Transformer architecture to train deep bidirectional representations by conditioning both left and right context across all layers [7]. BERT varies from typical models in that it employs a masked language modelling (MLM) aim, which involves masking some words in the input sequence and training the model to predict these words. It also includes the next sentence prediction (NSP) challenge, which helps you grasp how sentences work together.

•

RoBERTa: RoBERTa is based on BERT architecture [16]. It modifies several essential hyperparameters, removing the subsequent sentence pretraining objective and training with several mini-batches and varied learning rates. It is a multilingual model that uses Transformer as its core structure for representation. We conducted our investigation using the RoBERTa-base model. Similar to other transformer-based models, RoBERTa relies on a transformer architecture to process and understand textual input, capturing both local and long-range dependencies in the text.

•

DistilBERT: DistilBERT is a smaller and speedier variant of BERT produced by a method called knowledge distillation [24]. It preserves many of BERT’s language comprehension features while being more computationally efficient. DistilBERT eliminates the next sentence prediction (NSP) job and decreases the number of layers to 6, compared to BERT’s 12 layers, while maintaining the same number of hidden units. Despite its lower size, DistilBERT retains the majority of BERT’s language comprehension, giving it a lightweight yet effective alternative.

•

SqueezeBERT: SqueezeBERT, on the other hand, is a more recent model designed to reduce the computational and memory overhead of BERT-like models by using grouped convolutions instead of traditional transformers [14]. SqueezeBERT-uncased has fewer layers, fewer parameters, and uses less memory, making it suitable for deployment in resource-constrained environments, such as mobile devices, without significant loss in performance. It is an uncased model like BERT and DistilBERT, where case distinctions (upper vs. lower) are ignored during training and inference.

4 Experimental Evaluation

We conducted the experiments on an NVIDIA RTX 3060 GPU and AMD RYZEN 7 5700X CPU-equipped laptop, with 16 GB RAM and 6 GB Video Memory. PyTorch was the deep-learning backbone [21]. We used accuracy, precision, recall and F1-score to evaluate the classification performance of different models. The confusion matrix of the test set was also viewed to get an overview of the type I and II errors while classifying.

4.1 Classification Results

The study reveals that as the number of classes increases, the model’s overall accuracy tends to decrease, a common challenge in extreme multi-class classification problems. The varying performance across different classes in Tasks B and C indicates imbalances in the training data. Table 2 summarizes performance metrics for three tasks across four models: BERT, RoBERTa, DistilBERT, and SqueezeBERT. In Task A, RoBERTa outperforms the other models with the highest accuracy, precision, recall, and F1 score. BERT follows closely with an accuracy of 86.8%. DistilBERT and SqueezeBERT have similar performances with slightly lower scores around 85%). Task B’s performance is lower compared to Task A, with RoBERTa leading with 60.98% accuracy and an F1-score of 57.56%. DistilBERT and SqueezeBERT show reduced performance, with SqueezeBERT scoring the lowest. Lastly, all models struggle more with task C. RoBERTa maintains the highest performance with an accuracy of 52.75% and an F1-score of 49. BERT follows with 51.37% accuracy and a lower F1-score of 45.3. DistilBERT and SqueezeBERT perform significantly worse, with SqueezeBERT showing the lowest accuracy (46.08%) and very poor F1-score (35.78%).

Table 2:

Task	Model	Acc	Prec	Rec	F1
A	BERT	86.8	82.8	81.1	81.9
	RoBERTa	93.64	92.64	90.11	91.28
	DistilBERT	85.4	81	77.2	78.8
	SqueezeBERT	85.2	80.5	77.8	79
B	BERT	60.0	61.3	57	58.8
	RoBERTa	61	59.35	56.23	57.56
	DistilBERT	59.0	58.5	54.7	56.3
	SqueezeBERT	55.9	52.8	51.5	52.1
C	BERT	51.37	49.96	43.34	45.3
	RoBERTa	52.75	53	47	49
	DistilBERT	50.98	42.65	41.07	41.54
	SqueezeBERT	46.08	37.09	35.38	35.78

Table 2: Performance metrics (macro average) obtained in different tasks and models.

Overall, RoBERTa consistently outperforms the other models across all tasks, while DistilBERT and SqueezeBERT offer more lightweight but less accurate alternatives, particularly struggling with more complex multiclass tasks. In each task, we will be showing the outputs from the RoBERTa model.

4.2 Task A

In Task A, we see a binary classification scenario, distinguishing between sexist and non-sexist comments. The model demonstrates strong performance, accurately identifying both categories with high precision. There’s a slight edge in recognizing non-sexist content, but overall, the model shows balanced and effective classification for this binary task, as seen in Fig. 2.

Table 3:

	precision	recall	f1-score	support
0	95.2	97.3	96.2	2132
1	90.8	84.3	87.4	668
accuracy			94.2	2800
macro avg	93	90.8	91.8	2800
weighted avg	94.1	94.2	94.1	2800

Table 3: Classification report for Task A

The classification report in table 3 summarizes the performance of a binary classifier. The model’s precision, recall, and F1-score for class “0” (the majority class) are quite high, with values above 95%, indicating strong predictive power for this class. For class “1” (the minority class), the precision is 90.8%, meaning most predictions are correct, though recall is lower at 84.3%, suggesting some instances of this class are missing. The F1-score for class “1” is 87.4%, balancing precision and recall. The overall accuracy of the model is 94.2%, with the macro and weighted averages reflecting strong but slightly imbalanced performance across the two classes.

Figure 2:

4.3 Task B

Task B presents a more complex scenario with four distinct classes, as seen in Fig. 3. Here, the model’s performance varies more noticeably. It handles two of the classes quite well, particularly class 0 and class 1, but shows decreased accuracy for the other two. This suggests that some classes are more challenging for the model to distinguish, possibly due to overlapping features or subtle differences between categories.

Table 4:

	precision	recall	f1-score	support
0	63.4	56.5	59.8	46
1	64.8	72.8	68.6	243
2	55.5	49.7	52.4	173
3	53.7	45.8	49.4	48
accuracy			61	510
macro avg	59.4	56.2	57.6	510
weighted avg	60.5	61	60.5	510

Table 4: Classification report for Task B

The classification report in table 4, for the multi-class classification, model shows varying performance across four classes. Class “1” has the best performance with a precision of 64.8 a recall of 72.8%, and an F1-score of 68.6%, indicating the model does well in identifying this class. The other classes perform worse, with Class “0” having a precision of 63.4% but lower recall at 56.5%, reflecting that many instances of this class are missed. Class “2” and Class “3” also show lower precision, recall, and F1 scores, indicating difficulties in correctly predicting these classes. The overall accuracy is 61%, with the macro average (averaged equally across classes) showing balanced but moderate performance. The weighted average is slightly higher, reflecting the model’s better performance on the more frequent class “1”. Overall, the model shows moderate success, with room for improvement in identifying certain classes.

Figure 3:

4.4 Task C

The most intricate classification task is represented in Task C, where the model grapples with eleven different classes. The performance here is mixed, with the model excelling at identifying certain classes while struggling with others. Some classes, like 0, 1, 7, 9, and 10, are recognized with high accuracy, indicating clear distinguishing features for these categories. However, the model faces challenges with other classes, particularly 4, 5, and 8, where accuracy drops significantly. This pattern suggests that some classes have more distinct characteristics, making them easier for the model to identify, while others may share similarities that lead to misclassifications.

Table 5:

	precision	recall	f1-score	support
0	70	70	70	10
1	71.4	64.5	67.8	31
2	50.8	56.5	53.5	115
3	50.7	61.6	55.6	112
4	29	32.1	30.5	28
5	63	56	59.3	91
6	41.2	35	37.8	60
7	75	42.9	54.6	7
8	0	0	0	6
9	70	58.3	63.6	12
10	59.3	42.1	49.2	38
accuracy			52.6	510
macro avg	52.8	47.2	49.3	510
weighted avg	53.1	52.6	52.4	510

Table 5: Classification report for Task C

In this vector classification report in table 5 the model shows varying performance across 11 classes. Class “1” achieves the best balance of precision, recall, and F1-score, indicating it is relatively well-predicted, though there is still room for improvement. Classes “2,” “3,” and “5” perform moderately, with fairly balanced precision and recall, though prediction errors are evident. Some classes, such as “8” and “4,” show significantly poor performance, with “8” being completely misclassified. The overall accuracy is moderate, and the macro average, which treats each class equally, suggests underperformance in several classes. The weighted average reflects a slightly better performance, as it takes the class distribution into account, but overall the model struggles with certain less frequent classes.

Figure 4:

4.5 Gender Bias Analysis

We prepared a sample dataset of 20 sexist and 69 non-sexist sentences. In the sexist sentences only the word women were replaced with men. For example: women aren’t good drivers and men aren’t good drivers. 69 non-sexist sentences created an imbalance and made it difficult to predict. RoBERTa model being very good at binary classification, we put it to the test. All 20 sentences of women were classified as sexist, even some non-sexist ones were also identified as sexist. The F1 score of 85% suggests how good the model was while predicting sexist comments against women. During classifying the same sentences with men, it tagged only one sentence as sexist. A poor F1 score of only 8% shows the models struggle to predict any sexist comments against men. This dataset can be accessed here. The results of using LIME to explain the predictions of the RoBERTa model on potentially sexist comments are visualized in Fig. 5a and 5b.

Table 6:

		Precision	Recall	F1	Support
Women	0	100	89.86	94.66	69
Women	1	74.07	100	85.11	20
Men	0	77.38	94.2	84.97	69
Men	1	20	5	8	20

Table 6: Performance on potential sexist comments against men and women

Figure 5:

In Fig. 5a, The model classified this sentence as *sexist* with a high probability of 0.98 (98%) and non-sexist with only a probability of 0.02 (2%). The word “women” is highlighted in orange with the highest contribution (0.40), followed by “are” and “douchbags” contributing slightly less. The word “always” is highlighted in blue (contributing to the non-sexist prediction but minimally). This time, the model classifies the comment as highly sexist. The word “women” is a strong indicator of sexism in the model, suggesting that when the subject of a comment is female, the model is more likely to detect sexism, revealing a potential gender bias. In Fig. 5b, the model classified this sentence as non-sexist with a probability of 1.00 (100%) and sexist with a probability of 0.00 (0%). The words “always” and “douchbags” are highlighted in orange, showing that these words had some importance in influencing the model’s decision. Despite this, the model still predicted the comment as non-sexist. Even though the sentence is derogatory towards men, the model does not flag it as sexist, suggesting potential bias where comments aimed at men might be downplayed.

The comparison between the two scenarios shows a clear imbalance in how the RoBERTa model interprets potentially sexist language based on the gender being targeted. Comments directed toward women are flagged as sexist, while similar comments about men are not, indicating gender bias in the model’s predictions. This analysis, using LIME, supports the hypothesis that the model is more sensitive to sexism involving women than men, highlighting the need for further refinement to address this bias.

Table 7:

Paper	Method	F1 Score	Explainer	Remarks
Nakwijit et al. [18]	BERT, BERTweet, TwHIN-BERT	Task B: 0.578 Task C: 0.262	SHAP	Formed lexicon based on SHAP values and incorporated them with a bag of words method.
Segura-Bedmar [25]	BERT, DistilBERT, RoBERTa, XLNet.	Task A: 0.832 Task B: 0.599 Task C: 0.417	No	Using data augmentation always saw a degradation in performance
Chang et al. [4]	DeBERTa-v3-large, XLMRoBERTa-large covidtwitter-BERT-v2	Only Task A 0.8433 (DeBERTa) 0.8251 (XLM) 0.8316 (Covid)	No	Pre-trained the models on unlabeled GAB dataset
Hemati et al. [13]	RoBERTa – Large DeRoBERTa	Task A: 0.845 Task B: 0.678 Task C: 0.525	No	Implemented random adversial training layer
Pu and Zhou [22]	Ensemble of RoBERTa-BiLSTM-BiGRU, SimCSE-RoBERTa	Task A: 0.85	No
Guo et al. [12]	BERT-BiLSTM	TaskA: 0.789	No	Augmentation using EXIST-22 and TRAC-22 datasets
Obeidat et al. [19]	BERT, RoBERTa	Task A: 0.8538 Task B: 0.6417 Task C: 0.4774	No	Ensemble of BERT and RoBERTa performed better in task B
Das et al. [5]	RoBERTa-Large	Task A: 0.8364 Task B: 0.6588 Task C: 0.3320	No
Proposed Methodology	RoBERTa-Base with Simple Transformers	Task A: 0.9128 Task B: 0.5756 Task C: 0.493	LIME	Finds out the gender bias in classification with getting SOTA Task A accuracy

Table 7: Comparison table for different approaches on the EDOS dataset in SemEval 2023 challenge

5 Discussion

This section discusses the limitations and performance compared to similar tasks.

5.1 Comparison With Related Works

Table 7 compares our approach with some other works of the SemEval 2023 competition. The RoBERTa-Base with Simple Transformers method outperforms other systems in detecting sexism, with an F1 score of 0.9128 in task A. Task B, achieves an F1 score of 0.5756, comparable to other systems. Task C, the most challenging, achieves an F1 score of 0.493, competitive with top systems, despite many papers not attempting task C. Moreover, our method is a straightforward, efficient, and easy-to-implement framework, unlike other methods like PCJ’s complex ensemble of RoBERTa with BiLSTM and BiGRU, which adds complexity and computational cost. SUTNLP’s adversarial training method, while potentially powerful, adds complexity, making it a more efficient option for training and inference. Our method, unlike most others, prioritizes explainability, a crucial aspect for detecting sensitive topics like sexism. Its unique ability to explain and identify gender bias in classification could enhance its reliability and acceptability in real-world applications. In summary, our proposed methodology excels because it offers:

•

State-of-the-art performance, especially in Task A.

•

Explainability through LIME, offering insight into the model’s decision-making process. Revealing gender bias issue in machine learning algorithm.

•

A simpler and more efficient architecture compared to complex ensemble and adversarial training models.

•

A balanced performance across all tasks without compromising explainability or adding unnecessary complexity.

This makes the approach more practical, interpretable, and ultimately more suitable for real-world applications, where transparency and efficiency are just as important as raw performance. Apart from classification accuracy, there have been active studies regarding gender bias analysis. Gender swap data augmentation was discussed in [20, 32]. Though their works are mostly targeted towards correcting the gender pair detection in a sentence. [2] proposed an algorithm to correct word embedding by removing gender stereotypical information, calling it debiased word embedding. All these methods didn’t have any way to explain the bias other than the predicted outcomes. We introduced an explainable AI-enabled approach, that makes the results more intuitive and easier to treat.

5.2 Limitations

Though our study shines in various areas, it has some limitations. The class imbalances across tasks B and C were challenging to perform properly. We didn’t explicitly modify any models other than hyperparameter tuning. Also, class weighted training approach should have been utilized for tasks B and C.

6 Conclusion and Future Work

In this study, the focus was given to the ethical clarity of AI algorithms and their potential misuse in real-life applications. As seen, large language models can classify sexist and non-sexist comments, but they can’t do it gender neutrally. This issue originates due to the dataset being shifted towards women and not enough data with examples of men. The imbalanced data leads to biased models, as the model hasn’t seen enough examples from underrepresented groups. Our LIME explainable AI approach has shown that the algorithm understands the tokens, but the decision is based on gender. To solve this issue. a new human-in-XAI-loop can be formed. Our future work will leverage the LIME feature importance for threshold adjustments after prediction. We will use class-weighted loss functions, particularly for underrepresented classes in Tasks B and C to enhance the model performance. The gender bias analysis will be expanded to include complex forms of sexism for both genders, providing a more comprehensive assessment of nuanced bias detection. Training strategies will be investigated to promote gender-neutral recognition of sexism, including de-biasing algorithms and gender-neutral training regimes to mitigate biases from data imbalance.

References

[1]

Rebecca S Bigler and Lynn S Liben. 2007. Developmental intergroup theory: Explaining and reducing children’s social stereotyping and prejudice. Current directions in psychological science 16, 3 (2007), 162–166.

Abstract

1 Introduction

2 Related Work

3 Methodology

3.1 Dataset

3.2 Classifier Models

4 Experimental Evaluation

4.1 Classification Results

4.2 Task A

4.3 Task B

4.4 Task C

4.5 Gender Bias Analysis

5 Discussion

5.1 Comparison With Related Works

5.2 Limitations

6 Conclusion and Future Work

References

Index Terms

Recommendations

Perpetuating online sexism offline

Gender bias in invention

Gender bias in artificial intelligence: the need for diversity and gender theory in machine learning

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations