Skip to main content
Log in

Detecting sexism in social media: an empirical analysis of linguistic patterns and strategies

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

With the rise of social networks, there has been a marked increase in offensive content targeting women, ranging from overt acts of hatred to subtler, often overlooked forms of sexism. The EXIST (sEXism Identification in Social neTworks) competition, initiated in 2021, aimed to advance research in automatically identifying these forms of online sexism. However, the results revealed the multifaceted nature of sexism and emphasized the need for robust systems to detect and classify such content. In this study, we provide an extensive analysis of sexism, highlighting the characteristics and diverse manifestations of sexism across multiple languages on social networks. To achieve this objective, we conducted a detailed analysis of the EXIST dataset to evaluate its capacity to represent various types of sexism. Moreover, we analyzed the systems submitted to the EXIST competition to identify the most effective methodologies and resources for the automated detection of sexism. We employed statistical methods to discern textual patterns related to different categories of sexism, such as stereotyping, misogyny, and sexual violence. Additionally, we investigated linguistic variations in categories of sexism across different languages and platforms. Our results suggest that the EXIST dataset covers a broad spectrum of sexist expressions, from the explicit to the subtle. We observe significant differences in the portrayal of sexism across languages; English texts predominantly feature sexual connotations, whereas Spanish texts tend to reflect neosexism. Across both languages, objectification and misogyny prove to be the most challenging to detect, which is attributable to the varied vocabulary associated with these forms of sexism. Additionally, we demonstrate that models trained on platforms like Twitter can effectively identify sexist content on less-regulated platforms such as Gab. Building on these insights, we introduce a transformer-based system with data augmentation techniques that outperforms competition benchmarks. Our work contributes to the field by enhancing the understanding of online sexism and advancing the technological capabilities for its detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

Data will be made available on request.

Notes

  1. https://www.merriam-webster.com/dictionary/feminism

  2. https://cloud.google.com/translate/docs/reference/rest/

References

  1. Campuzano MV (2019) Force and inertia: a systematic review of women’s leadership in male-dominated organizational cultures in the United States. Hum Resour Dev Rev 18(4):437–469

    Article  Google Scholar 

  2. Mandel H, Semyonov M (2014) Gender pay gap and employment sector: sources of earnings disparities in the United States, 1970–2010. Demography 51(5):1597–1618. Accessed Apr 08 2022

  3. Dosil M, Jaureguizar J, Bernaras E, Sbicigo J (2020) Teen dating violence, sexism, and resilience: a multivariate analysis. Int J Environ Res Public Health 17(8):2652

    Article  Google Scholar 

  4. Rodríguez-Sánchez F, Carrillo-de-Albornoz J, Plaza L, Gonzalo J, Rosso P, Comet M, Donoso T (2021) Overview of exist 2021: sexism identification in social networks. Proces Leng Nat 67:195–207

    Google Scholar 

  5. Fersini E, Rosso P, Anzovino M (2018) Overview of the task on automatic misogyny identification at ibereval 2018. IberEval@ SEPLN 2150, 214–228

  6. Pamungkas EW, Basile V, Patti V (2020) Misogyny detection in twitter: a multilingual and cross-domain study. Inf Process Manage 57(6):102360

    Article  Google Scholar 

  7. Guest E, Vidgen B, Mittos A, Sastry N, Tyson G, Margetts H (2021) An expert annotated dataset for the detection of online misogyny. In: Proceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume, pp 1336–1350

  8. Rodríguez-Sánchez F, Carrillo-de-Albornoz J, Plaza L, Mendieta-Aragón A, Marco-Remón G, Makeienko M, Plaza M, Gonzalo J, Spina D, Rosso P (2022) Overview of exist 2022: sexism identification in social networks. Proces Leng Nat 69:229–240

    Google Scholar 

  9. Waseem Z (2016) Are you a racist or am I seeing things? annotator influence on hate speech detection on Twitter. In: Proceedings of the first workshop on NLP and computational social science. Association for Computational Linguistics, Austin, Texas, pp 138–142. https://doi.org/10.18653/v1/W16-5618 . https://aclanthology.org/W16-5618

  10. Waseem Z, Hovy D (2016) Hateful symbols or hateful people? predictive features for hate speech detection on Twitter. In: Proceedings of the NAACL Student Research Workshop. Association for Computational Linguistics, San Diego, California pp 88–93. https://doi.org/10.18653/v1/N16-2013. https://www.aclweb.org/anthology/N16-2013

  11. Frenda S, Ghanem B, Montes-y-Gómez M, Rosso P (2019) Online hate speech against women: automatic identification of misogyny and sexism on twitter. J Intell Fuzzy Syst 36(5):4743–4752

    Article  Google Scholar 

  12. Anzovino M, Fersini E, Rosso P (2018) Automatic identification and classification of misogynistic language on twitter. In: Natural language processing and information systems

  13. Rodríguez-Sánchez F, Carrillo-de-Albornoz J, Plaza L (2020) Automatic classification of sexism in social networks: an empirical study on twitter data. IEEE Access 8:219563–219576. https://doi.org/10.1109/ACCESS.2020.3042604

    Article  Google Scholar 

  14. Zeinert P, Inie N, Derczynski L (2021) Annotating online misogyny. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (vol 1: Long Papers), pp 3181–3197

  15. Basile V, Bosco C, Fersini E, Debora N, Patti V, Pardo FMR, Rosso P, Sanguinetti M et al (2019) Semeval-2019 task 5: multilingual detection of hate speech against immigrants and women in twitter. In: 13th International workshop on semantic evaluation. Association for Computational Linguistics, pp 54–63

  16. Jiang A, Yang X, Liu Y, Zubiaga A (2021) SWSR: a chinese dataset and lexicon for online sexism detection. CoRR abs/2108.03070 2108.03070

  17. Höfels DC, Çöltekin Ç, Mădroane ID (2022) Coroseof-an annotated corpus of romanian sexist and offensive tweets. In: Proceedings of the thirteenth language resources and evaluation conference, pp 2269–2281

  18. Canós JS (2018) Misogyny identification through svm at ibereval 2018. In: IberEval@SEPLN

  19. Nina-Alcocer V (2018) Ami at ibereval2018 automatic misogyny identification in spanish and english tweets. In: IberEval@SEPLN

  20. Frenda S, Ghanem B, Montes M (2018) Exploration of misogyny in spanish and english tweets. In: IberEval@SEPLN

  21. Paetzold GH, Zampieri M, Malmasi S (2019) UTFPR at SemEval-2019 task 5: hate speech identification with recurrent neural networks. In: Proceedings of the 13th international workshop on semantic evaluation. Association for Computational Linguistics, Minneapolis, Minnesota, USA, pp 519–523. https://doi.org/10.18653/v1/S19-2093. https://aclanthology.org/S19-2093

  22. West C (1995) Critical race theory: the key writings that formed the movement. The New Press

  23. Rostami M, Oussalah M, Farrahi V (2022) A novel time-aware food recommender-system based on deep learning and graph clustering. IEEE Access 10:52508–52524

    Article  Google Scholar 

  24. Rostami M, Muhammad U, Forouzandeh S, Berahmand K, Farrahi V, Oussalah M (2022) An effective explainable food recommendation using deep image clustering and community detection. Intell Syst Appl 16:200157

    Google Scholar 

  25. Bassignana E, Basile V, Patti V (2018) Hurtlex: a multilingual lexicon of words to hurt. In: 5th Italian conference on computational linguistics, CLiC-it 2018, vol 2253, pp 1–6. CEUR-WS

  26. Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speech detection and the problem of offensive language. In: Proceedings of the 11th international AAAI conference on web and social media. ICWSM ’17, pp 512–515

  27. Hartvigsen T, Gabriel S, Palangi H, Sap M, Ray D, Kamar E (2022) ToxiGen: a large-scale machine-generated dataset for adversarial and implicit hate speech detection. In: Muresan S, Nakov P, Villavicencio A (eds) Proceedings of the 60th annual meeting of the association for computational linguistics (vol 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, pp 3309–3326. https://doi.org/10.18653/v1/2022.acl-long.234. https://aclanthology.org/2022.acl-long.234

  28. Achananuparp P, Hu X, Shen X (2008) The evaluation of sentence similarity measures. In: International conference on data warehousing and knowledge discovery. Springer, pp 305–316

  29. Zannettou S, Bradlyn B, De Cristofaro E, Kwak H, Sirivianos M, Stringini G, Blackburn J (2018) What is gab: a bastion of free speech or an alt-right echo chamber. In: Companion proceedings of the web conference 2018, pp 1007–1014

  30. Wilson J (2016) ab: alt-right’s social media alternative attracts users banned from Twitter. The Guardian. https://www.theguardian.com/media/2016/nov/17/gab-alt-right-social-media-twitter

  31. Montes MEA (2021) Proceedings of the iberian languages evaluation forum (iberlef 2021). In: CEUR Workshop proceedings

  32. Paula A, Silva R, Schlicht I (2021) Sexism prediction in spanish and english tweets using monolingual and multilingual bert and ensemble models. Proces Leng Nat

  33. Canete J, Chaperon G, Fuentes R, Pérez J (2020) Spanish pre-trained bert model and evaluation data. PML4DC at ICLR 2020

  34. Martínez-Cámara E, Díaz-Galiano M, García-Cumbreras M, García-Vega M, Villena-Román J (2017) Overview of tass 2017. IberEval@ SEPLN 1896, 13–21

  35. Mnassri K, Rajapaksha P, Farahbakhsh R, Crespi N (2022) BERT-based ensemble approaches for hate speech detection. https://doi.org/10.48550/ARXIV.2209.06505. https://arxiv.org/abs/2209.06505

  36. He P, Liu X, Gao J, Chen W (2020) Deberta: decoding-enhanced bert with disentangled attention. arXiv:2006.03654

  37. Fandiño AG, Estapé JA, Pàmies M, Palao JL, Ocampo JS, Carrino CP, Oller CA, Penagos CR, Agirre AG, Villegas M (2022) Maria: Spanish language models. Proces. Leng. Nat 68. https://doi.org/10.26342/2022-68-3

  38. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv:1907.11692

  39. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y, (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference track proceedings. http://arxiv.org/abs/1412.6980

  40. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch

  41. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Brew J (2019) Huggingface’s transformers: State-of-the-art natural language processing. CoRR abs/1910.03771 1910.03771

  42. Vaca-Serrano A (2022) Detecting and classifying sexism by ensembling transformers models. Proces Leng Nat 2:1

    Google Scholar 

  43. Nguyen DQ, Vu T, Nguyen AT (2020) Bertweet: a pre-trained language model for english tweets. arXiv:2005.10200

  44. Rosa J, Ponferrada EG, Romero M, Villegas P, Prado Salas PG, Grandury M (2022) Bertin: efficient pre-training of a spanish language model using perplexity sampling. Proces Leng Nat 68:13–23

    Google Scholar 

  45. Pérez, J.M., Furman, D.A., Alemany, L.A., Luque, F.: Robertuito: a pre-trained language model for social media text in spanish. arXiv:2111.09453 (2021)

  46. Plaza L, Carrillo-de-Albornoz J, Morante R, Amigó E, Gonzalo J, Spina D, Rosso P (2023) Overview of exist 2023–learning with disagreement for sexism identification and characterization (extended overview). Working Notes of CLEF

Download references

Acknowledgements

This work was supported by the Spanish Ministry of Science and Innovation under the project “FairTransNLP: Midiendo y Cuantificando el sesgo y la justicia en sistemas de PLN”(PID2021-124361OB-C32), funded by MCIN/AEI/10.13039/501100011033 and by ERDF, EU A way of making Europe.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed equally to this work.

Corresponding authors

Correspondence to Francisco Rodríguez-Sánchez, Jorge Carrillo-de-Albornoz or Laura Plaza.

Ethics declarations

Competing Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical and Informed Consent for Data Used

The data utilized in this study was developed by the authors specifically for research purposes within the context of the EXIST competition [4]. Given its original creation and management by the authors, there are no concerns related to external data collection or participant consent. All necessary ethical considerations, including ensuring the anonymity and confidentiality of all participants or contributors, were strictly adhered to during data collection and processing.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rodríguez-Sánchez, F., Carrillo-de-Albornoz, J. & Plaza, L. Detecting sexism in social media: an empirical analysis of linguistic patterns and strategies. Appl Intell 54, 10995–11019 (2024). https://doi.org/10.1007/s10489-024-05795-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05795-2

Keywords