Abstract
With the rise of social networks, there has been a marked increase in offensive content targeting women, ranging from overt acts of hatred to subtler, often overlooked forms of sexism. The EXIST (sEXism Identification in Social neTworks) competition, initiated in 2021, aimed to advance research in automatically identifying these forms of online sexism. However, the results revealed the multifaceted nature of sexism and emphasized the need for robust systems to detect and classify such content. In this study, we provide an extensive analysis of sexism, highlighting the characteristics and diverse manifestations of sexism across multiple languages on social networks. To achieve this objective, we conducted a detailed analysis of the EXIST dataset to evaluate its capacity to represent various types of sexism. Moreover, we analyzed the systems submitted to the EXIST competition to identify the most effective methodologies and resources for the automated detection of sexism. We employed statistical methods to discern textual patterns related to different categories of sexism, such as stereotyping, misogyny, and sexual violence. Additionally, we investigated linguistic variations in categories of sexism across different languages and platforms. Our results suggest that the EXIST dataset covers a broad spectrum of sexist expressions, from the explicit to the subtle. We observe significant differences in the portrayal of sexism across languages; English texts predominantly feature sexual connotations, whereas Spanish texts tend to reflect neosexism. Across both languages, objectification and misogyny prove to be the most challenging to detect, which is attributable to the varied vocabulary associated with these forms of sexism. Additionally, we demonstrate that models trained on platforms like Twitter can effectively identify sexist content on less-regulated platforms such as Gab. Building on these insights, we introduce a transformer-based system with data augmentation techniques that outperforms competition benchmarks. Our work contributes to the field by enhancing the understanding of online sexism and advancing the technological capabilities for its detection.


Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
Data will be made available on request.
References
Campuzano MV (2019) Force and inertia: a systematic review of women’s leadership in male-dominated organizational cultures in the United States. Hum Resour Dev Rev 18(4):437–469
Mandel H, Semyonov M (2014) Gender pay gap and employment sector: sources of earnings disparities in the United States, 1970–2010. Demography 51(5):1597–1618. Accessed Apr 08 2022
Dosil M, Jaureguizar J, Bernaras E, Sbicigo J (2020) Teen dating violence, sexism, and resilience: a multivariate analysis. Int J Environ Res Public Health 17(8):2652
Rodríguez-Sánchez F, Carrillo-de-Albornoz J, Plaza L, Gonzalo J, Rosso P, Comet M, Donoso T (2021) Overview of exist 2021: sexism identification in social networks. Proces Leng Nat 67:195–207
Fersini E, Rosso P, Anzovino M (2018) Overview of the task on automatic misogyny identification at ibereval 2018. IberEval@ SEPLN 2150, 214–228
Pamungkas EW, Basile V, Patti V (2020) Misogyny detection in twitter: a multilingual and cross-domain study. Inf Process Manage 57(6):102360
Guest E, Vidgen B, Mittos A, Sastry N, Tyson G, Margetts H (2021) An expert annotated dataset for the detection of online misogyny. In: Proceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume, pp 1336–1350
Rodríguez-Sánchez F, Carrillo-de-Albornoz J, Plaza L, Mendieta-Aragón A, Marco-Remón G, Makeienko M, Plaza M, Gonzalo J, Spina D, Rosso P (2022) Overview of exist 2022: sexism identification in social networks. Proces Leng Nat 69:229–240
Waseem Z (2016) Are you a racist or am I seeing things? annotator influence on hate speech detection on Twitter. In: Proceedings of the first workshop on NLP and computational social science. Association for Computational Linguistics, Austin, Texas, pp 138–142. https://doi.org/10.18653/v1/W16-5618 . https://aclanthology.org/W16-5618
Waseem Z, Hovy D (2016) Hateful symbols or hateful people? predictive features for hate speech detection on Twitter. In: Proceedings of the NAACL Student Research Workshop. Association for Computational Linguistics, San Diego, California pp 88–93. https://doi.org/10.18653/v1/N16-2013. https://www.aclweb.org/anthology/N16-2013
Frenda S, Ghanem B, Montes-y-Gómez M, Rosso P (2019) Online hate speech against women: automatic identification of misogyny and sexism on twitter. J Intell Fuzzy Syst 36(5):4743–4752
Anzovino M, Fersini E, Rosso P (2018) Automatic identification and classification of misogynistic language on twitter. In: Natural language processing and information systems
Rodríguez-Sánchez F, Carrillo-de-Albornoz J, Plaza L (2020) Automatic classification of sexism in social networks: an empirical study on twitter data. IEEE Access 8:219563–219576. https://doi.org/10.1109/ACCESS.2020.3042604
Zeinert P, Inie N, Derczynski L (2021) Annotating online misogyny. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (vol 1: Long Papers), pp 3181–3197
Basile V, Bosco C, Fersini E, Debora N, Patti V, Pardo FMR, Rosso P, Sanguinetti M et al (2019) Semeval-2019 task 5: multilingual detection of hate speech against immigrants and women in twitter. In: 13th International workshop on semantic evaluation. Association for Computational Linguistics, pp 54–63
Jiang A, Yang X, Liu Y, Zubiaga A (2021) SWSR: a chinese dataset and lexicon for online sexism detection. CoRR abs/2108.03070 2108.03070
Höfels DC, Çöltekin Ç, Mădroane ID (2022) Coroseof-an annotated corpus of romanian sexist and offensive tweets. In: Proceedings of the thirteenth language resources and evaluation conference, pp 2269–2281
Canós JS (2018) Misogyny identification through svm at ibereval 2018. In: IberEval@SEPLN
Nina-Alcocer V (2018) Ami at ibereval2018 automatic misogyny identification in spanish and english tweets. In: IberEval@SEPLN
Frenda S, Ghanem B, Montes M (2018) Exploration of misogyny in spanish and english tweets. In: IberEval@SEPLN
Paetzold GH, Zampieri M, Malmasi S (2019) UTFPR at SemEval-2019 task 5: hate speech identification with recurrent neural networks. In: Proceedings of the 13th international workshop on semantic evaluation. Association for Computational Linguistics, Minneapolis, Minnesota, USA, pp 519–523. https://doi.org/10.18653/v1/S19-2093. https://aclanthology.org/S19-2093
West C (1995) Critical race theory: the key writings that formed the movement. The New Press
Rostami M, Oussalah M, Farrahi V (2022) A novel time-aware food recommender-system based on deep learning and graph clustering. IEEE Access 10:52508–52524
Rostami M, Muhammad U, Forouzandeh S, Berahmand K, Farrahi V, Oussalah M (2022) An effective explainable food recommendation using deep image clustering and community detection. Intell Syst Appl 16:200157
Bassignana E, Basile V, Patti V (2018) Hurtlex: a multilingual lexicon of words to hurt. In: 5th Italian conference on computational linguistics, CLiC-it 2018, vol 2253, pp 1–6. CEUR-WS
Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speech detection and the problem of offensive language. In: Proceedings of the 11th international AAAI conference on web and social media. ICWSM ’17, pp 512–515
Hartvigsen T, Gabriel S, Palangi H, Sap M, Ray D, Kamar E (2022) ToxiGen: a large-scale machine-generated dataset for adversarial and implicit hate speech detection. In: Muresan S, Nakov P, Villavicencio A (eds) Proceedings of the 60th annual meeting of the association for computational linguistics (vol 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, pp 3309–3326. https://doi.org/10.18653/v1/2022.acl-long.234. https://aclanthology.org/2022.acl-long.234
Achananuparp P, Hu X, Shen X (2008) The evaluation of sentence similarity measures. In: International conference on data warehousing and knowledge discovery. Springer, pp 305–316
Zannettou S, Bradlyn B, De Cristofaro E, Kwak H, Sirivianos M, Stringini G, Blackburn J (2018) What is gab: a bastion of free speech or an alt-right echo chamber. In: Companion proceedings of the web conference 2018, pp 1007–1014
Wilson J (2016) ab: alt-right’s social media alternative attracts users banned from Twitter. The Guardian. https://www.theguardian.com/media/2016/nov/17/gab-alt-right-social-media-twitter
Montes MEA (2021) Proceedings of the iberian languages evaluation forum (iberlef 2021). In: CEUR Workshop proceedings
Paula A, Silva R, Schlicht I (2021) Sexism prediction in spanish and english tweets using monolingual and multilingual bert and ensemble models. Proces Leng Nat
Canete J, Chaperon G, Fuentes R, Pérez J (2020) Spanish pre-trained bert model and evaluation data. PML4DC at ICLR 2020
Martínez-Cámara E, Díaz-Galiano M, García-Cumbreras M, García-Vega M, Villena-Román J (2017) Overview of tass 2017. IberEval@ SEPLN 1896, 13–21
Mnassri K, Rajapaksha P, Farahbakhsh R, Crespi N (2022) BERT-based ensemble approaches for hate speech detection. https://doi.org/10.48550/ARXIV.2209.06505. https://arxiv.org/abs/2209.06505
He P, Liu X, Gao J, Chen W (2020) Deberta: decoding-enhanced bert with disentangled attention. arXiv:2006.03654
Fandiño AG, Estapé JA, Pàmies M, Palao JL, Ocampo JS, Carrino CP, Oller CA, Penagos CR, Agirre AG, Villegas M (2022) Maria: Spanish language models. Proces. Leng. Nat 68. https://doi.org/10.26342/2022-68-3
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv:1907.11692
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y, (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference track proceedings. http://arxiv.org/abs/1412.6980
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Brew J (2019) Huggingface’s transformers: State-of-the-art natural language processing. CoRR abs/1910.03771 1910.03771
Vaca-Serrano A (2022) Detecting and classifying sexism by ensembling transformers models. Proces Leng Nat 2:1
Nguyen DQ, Vu T, Nguyen AT (2020) Bertweet: a pre-trained language model for english tweets. arXiv:2005.10200
Rosa J, Ponferrada EG, Romero M, Villegas P, Prado Salas PG, Grandury M (2022) Bertin: efficient pre-training of a spanish language model using perplexity sampling. Proces Leng Nat 68:13–23
Pérez, J.M., Furman, D.A., Alemany, L.A., Luque, F.: Robertuito: a pre-trained language model for social media text in spanish. arXiv:2111.09453 (2021)
Plaza L, Carrillo-de-Albornoz J, Morante R, Amigó E, Gonzalo J, Spina D, Rosso P (2023) Overview of exist 2023–learning with disagreement for sexism identification and characterization (extended overview). Working Notes of CLEF
Acknowledgements
This work was supported by the Spanish Ministry of Science and Innovation under the project “FairTransNLP: Midiendo y Cuantificando el sesgo y la justicia en sistemas de PLN”(PID2021-124361OB-C32), funded by MCIN/AEI/10.13039/501100011033 and by ERDF, EU A way of making Europe.
Author information
Authors and Affiliations
Contributions
All authors contributed equally to this work.
Corresponding authors
Ethics declarations
Competing Interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical and Informed Consent for Data Used
The data utilized in this study was developed by the authors specifically for research purposes within the context of the EXIST competition [4]. Given its original creation and management by the authors, there are no concerns related to external data collection or participant consent. All necessary ethical considerations, including ensuring the anonymity and confidentiality of all participants or contributors, were strictly adhered to during data collection and processing.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Rodríguez-Sánchez, F., Carrillo-de-Albornoz, J. & Plaza, L. Detecting sexism in social media: an empirical analysis of linguistic patterns and strategies. Appl Intell 54, 10995–11019 (2024). https://doi.org/10.1007/s10489-024-05795-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05795-2