Skip to main content

Advertisement

Log in

Introducing the Gab Hate Corpus: defining and applying hate-based rhetoric to social media posts at scale

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

We present the Gab Hate Corpus (GHC), consisting of 27,665 posts from the social network service gab.com, each annotated for the presence of “hate-based rhetoric” by a minimum of three annotators. Posts were labeled according to a coding typology derived from a synthesis of hate speech definitions across legal precedent, previous hate speech coding typologies, and definitions from psychology and sociology, comprising hierarchical labels indicating dehumanizing and violent speech as well as indicators of targeted groups and rhetorical framing. We provide inter-annotator agreement statistics and perform a classification analysis in order to validate the corpus and establish performance baselines. The GHC complements existing hate speech datasets in its theoretical grounding and by providing a large, representative sample of richly annotated social media posts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. We will use “hate speech” to refer to the broad set of tasks in this category, including abusive and toxic language.

  2. Available as version 1 of the preprint at https://psyarxiv.com/hqjxn/.

  3. https://hatebase.org/.

  4. https://files.pushshift.io/gab/.

  5. https://osf.io/edua3/.

  6. https://github.com/huggingface/transformers.

  7. See https://github.com/ufoym/imbalanced-dataset-sampler.

References

  • Ah-Pine, J., & Soriano-Morales, E. P. (2016). A study of synthetic oversampling for twitter imbalanced sentiment analysis. In Workshop on interactions between data mining and natural language processing (DMNLP 2016).

  • Aizawa, A. (2003). An information-theoretic perspective of tf-idf measures. Information Processing & Management, 39(1), 45–65.

    Article  Google Scholar 

  • Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2623–2631).

  • Akiwowo, S., Vidgen, B., Prabhakaran, V., & Waseem, Z. (Eds.). (2020). In Proceedings of the Fourth Workshop on Online Abuse and Harms, Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/2020.alw-1.0.

  • Allport, G. W. (1954). The nature of prejudice. Boston: Addison-Wesley.

    Google Scholar 

  • American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (DSM-5®). Washington: American Psychiatric Pub.

    Book  Google Scholar 

  • Anderson, A. A., Brossard, D., Scheufele, D. A., Xenos, M. A., & Ladwig, P. (2014). The “nasty effect:’’ Online incivility and risk perceptions of emerging technologies. Journal of Computer-Mediated Communication, 19(3), 373–387.

    Article  Google Scholar 

  • Andrews, T. M. (2021). Gab, the social network that has welcomed qanon and extremist figures, explained. Retrieved February 15, 2021, from https://www.washingtonpost.com/technology/2021/01/11/gab-social-network/.

  • Arango, A., Pérez, J., & Poblete, B. (2019). Hate speech detection is not as easy as you may think: A closer look at model validation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 45–54).

  • Atari, M., Davani, A. M., Kogon, D., Kennedy, B., Saxena, N. A., Anderson, I., & Dehghani, M. (2021). Morally homogeneous networks and radicalism

  • Bell, H., Kulkarni, S., & Dalton, L. (2003). Organizational prevention of vicarious trauma. Families in Society, 84(4), 463–470.

    Article  Google Scholar 

  • Boeckmann, R. J., & Liew, J. (2002). Hate speech: Asian American students’ justice judgments and psychological responses. Journal of Social Issues, 58(2), 363–381.

    Article  Google Scholar 

  • Bradbury, S. (2018). Timeline of terror: A moment-by-moment account of squirrel hill mass shooting. Pittsburgh Post-Gazette. Retrieved from https://www.post-gazette.com/news/crime-courts/2018/10/28/TIMELINE-20-minutes-of-terror-gripped-Squirrel-Hill-during-Saturday-synagogue-attack/stories/201810280197.

  • Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46(5), 423–429.

    Article  Google Scholar 

  • Cabeldue, M. K., Cramer, R. J., Kehn, A., Crosby, J. W., & Anastasi, J. S. (2018). Measuring attitudes about hate: Development of the hate crime beliefs scale. Journal of Interpersonal Violence, 33(23), 3656–3685.

    Article  Google Scholar 

  • Chaplinsky v New Hampshire. (1942). Chaplinsky v. New Hampshire.

  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.

    Google Scholar 

  • Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated hate speech detection and the problem of offensive language. In Eleventh international aaai conference on web and social media.

  • de Gibert, O., Perez, N., García-Pablos, A., & Cuadros, M. (2018). Hate speech dataset from a white supremacy forum. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2) (pp. 11–20).

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805.

  • Dixon, L., Li, J., Sorensen, J., Thain, N., & Vasserman, L. (2018). Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society (pp. 67–73). ACM.

  • Facebook. (2020). Community Standards: 12 Hate Speech. Retrieved from February 7, 2020, from https://www.facebook.com/communitystandards/hate_speech.

  • Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.

    Google Scholar 

  • Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378.

    Article  Google Scholar 

  • GAB AI, INC. (2020). Website terms of service. Retrieved February 15, 2021, from https://gab.com/about/tos.

  • Gaffney, G. (2018). Pushshift gab corpus. Retrieved February 23, 2019, from https://files.pushshift.io/gab/.

  • Gagliardone, I., Gal, D., Alves, T., & Martinez, G. (2015). Countering online hate speech. Paris: Unesco Publishing.

    Google Scholar 

  • German Criminal Code. (1998). German Criminal Code. Retrieved from https://www.gesetze-im-internet.de/englisch_stgb/englisch_stgb.html.

  • Glaser, J., Dixit, J., & Green, D. P. (2002). Studying hate crime with the internet: What makes racists advocate racial violence? Journal of Social Issues, 58(1), 177–193.

    Article  Google Scholar 

  • Google. (2020). Hate speech policy. Retrieved February 7, 2020, from https://support.google.com/youtube/answer/2801939.

  • Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. (1998). Measuring individual differences in implicit cognition: The implicit association test. Journal of Personality and Social Psychology, 74(6), 1464.

    Article  Google Scholar 

  • Grey Ellis, E. (2016). On gab, an extremist-friendly site, pittsburgh shooting suspect aired his hatred in full. WIRED. Retrieved from https://www.wired.com/2016/09/gab-alt-rights-twitter-ultimate-filter-bubble/.

  • Hern, A. (2019). Ex-facebook worker claims disturbing content led to ptsd. The Guardian Retrieved from https://www.theguardian.com/technology/2019/dec/04/ex-facebook-worker-claims-disturbing-content-led-to-ptsd.

  • Hoover, J., Atari, M., Davani, AM., Kennedy, B., Portillo-Wightman, G., Yeh, L., Kogon, D., & Dehghani, M. (2019). Bound in hatred: The role of group-based morality in acts of hate. PsyArxiv Preprint 1031234/osfio/359me.

  • Hoover, J., Portillo-Wightman, G., Yeh, L., Havaldar, S., Davani, A. M., Lin, Y., Kennedy, B., Atari, M., Kamel, Z., Mendlen, M., Moreno, G., Park, C., Chang, T. E., Chin, J., Leong, C., Leung, J. Y., Mirinjian, A., & Dehghani, M. (2020). Moral foundations twitter corpus: A collection of 35k tweets annotated for moral sentiment. Social Psychological and Personality Science, 11, 1057–1071.

    Article  Google Scholar 

  • Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. arXiv preprint arXiv:180106146.

  • Howard, J. W. (2019). Free speech and hate speech. Annual Review of Political Science, 22, 93–109.

    Article  Google Scholar 

  • Jin, X., Du, J., Wei, Z., Xue, X., & Ren, X. (2019). Towards hierarchical importance attribution: Explaining compositional semantics for neural sequence models. arXiv preprint arXiv:191106194.

  • Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning (pp. 137–142) Springer.

  • Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation.

  • Joulin, A., Grave, É., Bojanowski, P., & Mikolov, T. (2017). Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Short Papers (Vol. 2, pp. 427–431).

  • Kaplan, J. B., Bergman, A. L., Christopher, M., Bowen, S., & Hunsinger, M. (2017). Role of resilience in mindfulness training for first responders. Mindfulness, 8(5), 1373–1380.

    Article  Google Scholar 

  • Kennedy, B., Jin, X., Mostafazadeh Davani, A., Dehghani, M., & Ren, X. (2020). Contextualizing hate speech classifiers with post-hoc explanation. In Proceedings of the 2020 Annual Conference of the Association for Computational Linguistics.

  • Kleim, B., & Westphal, M. (2011). Mental health in first responders: A review and recommendation for prevention and intervention strategies. Traumatology, 17(4), 17–24.

    Article  Google Scholar 

  • Kuklinski, J. H., Cobb, M. D., & Gilens, M. (1997). Racial attitudes and the “new south’’. The Journal of Politics, 59(2), 323–349.

    Article  Google Scholar 

  • Kumar, R., & O, AK., Malmasi, S., & Zampieri, M. (2018). Benchmarking aggression identification in social media. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018) (pp. 1–11).

  • Kwan, I., Dickson, K., Richardson, M., MacDowall, W., Burchett, H., Stansfield, C., Brunton, G., Sutcliffe, K., & Thomas, J. (2020). Cyberbullying and children and young people’s mental health: A systematic map of systematic reviews. Cyberpsychology, Behavior, and Social Networking, 23(2), 72–82.

    Article  Google Scholar 

  • Leets, L. (2002). Experiencing hate speech: Perceptions and responses to anti-semitism and antigay speech. Journal of Social Issues, 58(2), 341–361.

    Article  Google Scholar 

  • Levin, S. (2017). Moderators who had to view child abuse content sue microsoft, claiming ptsd. The Guardian Retrieved from https://www.theguardian.com/technology/2017/jan/11/microsoft-employees-child-abuse-lawsuit-ptsd.

  • Liu, N.F., Gardner, M., Belinkov, Y., Peters, M., & Smith, N.A. (2019). Linguistic knowledge and transferability of contextual representations. arXiv preprint arXiv:190308855.

  • López, I. H. (2015). Dog whistle politics: How coded racial appeals have reinvented racism and wrecked the middle class. Oxford: Oxford University Press.

    Google Scholar 

  • Ludick, M., & Figley, C. R. (2017). Toward a mechanism for secondary trauma induction and reduction: Reimagining a theory of secondary traumatic stress. Traumatology, 23(1), 112.

    Article  Google Scholar 

  • MacAvaney, S., Yao, H. R., Yang, E., Russell, K., Goharian, N., & Frieder, O. (2019). Hate speech detection: Challenges and solutions. PLoS ONE, 14(8), e0221152.

    Article  Google Scholar 

  • Mathew, B., Dutt, R., Goyal, P., & Mukherjee, A. (2019). Spread of hate speech in online social media. In Proceedings of the 10th ACM conference on web science (pp. 173–182).

  • Mathew, B., Illendula, A., Saha, P., Sarkar, S., Goyal, P., & Mukherjee, A. (2020). Hate begets hate: A temporal study of hate speech. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW2), 1–24.

    Article  Google Scholar 

  • Mathew, B., Saha, P., Yimam, S.M., Biemann, C., Goyal, P., & Mukherjee, A. (2021). Hatexplain: A benchmark dataset for explainable hate speech detection. In Proceedings of The Thirty-Fifth AAAI Conference on Artificial Intelligence.

  • Matsakis, L. (2018). Pittsburgh synagogue shooting suspect’s gab posts are part of a pattern. WIRED. Retrieved from https://www.wired.com/story/pittsburgh-synagogue-shooting-gab-tree-of-life/.

  • Matsuda, M. J. (1989). Public response to racist speech: Considering the victim’s story. Michigan Law Review, 87(8), 2320–2381.

    Article  Google Scholar 

  • May, C. L., & Wisco, B. E. (2016). Defining trauma: How level of exposure and proximity affect risk for posttraumatic stress disorder. Psychological Trauma: Theory, Research, Practice, and Policy, 8(2), 233.

    Article  Google Scholar 

  • Medin, D., Bennis, W., & Chandler, M. (2010). Culture and the home-field disadvantage. Perspectives on Psychological Science, 5(6), 708–713.

    Article  Google Scholar 

  • Mondal, M., Silva, L.A., & Benevenuto, F. (2017). A measurement study of hate speech in social media. In Proceedings of the 28th ACM Conference on Hypertext and Social Media (pp. 85–94). ACM.

  • Mostafazadeh Davani, A., Atari, M., Kennedy, B., Havaldar, S., & Dehghani, M. (2020). Hatred is in the eye of the annotator: Hate speech classifiers learn human-like social stereotypes (in press). In Proceedings of the 42nd Annual Conference of the Cognitive Science Society (CogSci).

  • Müller, K., & Schwarz, C. (2019). Fanning the flames of hate: Social media and hate crime. SSRN 3082972.

  • Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., & Chang, Y. (2016). Abusive language detection in online user content. In Proceedings of the 25th international conference on world wide web, International World Wide Web Conferences Steering Committee (pp. 145–153).

  • Olteanu, A., Castillo, C., Boy, J., Varshney, K. R. (2018). The effect of extremist violence on hateful speech online. In Twelfth International AAAI Conference on Web and Social Media.

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (pp. 8024–8035).

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

    Google Scholar 

  • Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. (2015). The development and psychometric properties of liwc2015. Tech. rep.

  • Perez, L. M., Jones, J., Englert, D. R., & Sachau, D. (2010). Secondary traumatic stress and burnout among law enforcement investigators exposed to disturbing media images. Journal of Police and Criminal Psychology, 25(2), 113–124.

    Article  Google Scholar 

  • Perry, B. (2002). In the name of hate: Understanding hate crimes. London: Routledge.

    Book  Google Scholar 

  • Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:180205365.

  • Poletto, F., Basile, V., Sanguinetti, M., Bosco, C., & Patti, V. (2020). Resources and benchmark corpora for hate speech detection: A systematic review. Language Resources and Evaluation, 1–47.

  • Qian, J., Bethke, A., Liu, Y., Belding, E., & Wang, W. Y. (2019). A benchmark dataset for learning to intervene in online hate speech. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 4757–4766).

  • Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. Retrieved from https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf.

  • RAV v St Paul. (1992). Rav v. st. paul.

  • Roberts, S. T. (2019). Behind the screen: Content moderation in the shadows of social media. London: Yale University Press.

    Book  Google Scholar 

  • Romero, L. (2021). Experts say echo chambers from apps like parler and gab contributed to attack on capitol. ABC News. Retrieved from https://abcnews.go.com/US/experts-echo-chambers-apps-parler-gab-contributed-attack/story?id=75141014.

  • Roose, K. (2018). On gab, an extremist-friendly site, pittsburgh shooting suspect aired his hatred in full. The New York Times. Retrieved from https://www.nytimes.com/2018/10/28/us/gab-robert-bowers-pittsburgh-synagogue-shootings.html.

  • Ross, B., Rist, M., Carbonell, G., Cabrera, B., Kurowsky, N., & Wojatzki, M. (2017). Measuring the reliability of hate speech annotations: The case of the european refugee crisis. In Proceedings of the 3rd Workshop on Natural Language Processing for Computer-Mediated Communication.

  • Sap, M., Card, D., Gabriel, S., Choi, Y., & Smith, N. A. (2019). The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 1668–1678).

  • Schmidt, A., & Wiegand, M. (2017). A survey on hate speech detection using natural language processing. In Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media (pp. 1–10).

  • Sellars, A. (2016). Defining hate speech. Berkman Klein Center Research Publication 2016(20).

  • Twitter. (2020). Hateful Conduct Policy. Retrieved February 7, 2020, from https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy.

  • Vidgen, B., & Yasseri, T. (2020). Detecting weak and strong islamophobic hate speech on social media. Journal of Information Technology & Politics, 17(1), 66–78.

    Article  Google Scholar 

  • Wagaman, M. A., Geiger, J. M., Shockley, C., & Segal, E. A. (2015). The role of empathy in burnout, compassion satisfaction, and secondary traumatic stress among social workers. Social Work, 60(3), 201–209.

    Article  Google Scholar 

  • Waldron, J. (2012). The harm in hate speech. Cambridge: Harvard University Press.

    Book  Google Scholar 

  • Warner, W., & Hirschberg, J. (2012). Detecting hate speech on the world wide web. In Proceedings of the Second Workshop on Language in Social Media, Association for Computational Linguistics (pp. 19–26).

  • Waseem, Z., Davidson, T., Warmsley, D., & Weber, I. (2017). Understanding abuse: A typology of abusive language detection subtasks. In Proceedings of the First Workshop on Abusive Language Online (pp. 78–84).

  • Waseem, Z., & Hovy, D. (2016). Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop (pp. 88–93).

  • Wiegand, M., Ruppenhofer, J., & Kleinbauer, T. (2019). Detection of abusive language: the problem of biased datasets. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Long and Short Papers) (Vol. 1, pp. 602–608).

  • Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., & Brew, J. (2019). Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:191003771.

  • Wulczyn, E., & Thain, N., Dixon, L. (2017). Ex machina: Personal attacks seen at scale. In Proceedings of the 26th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee (pp. 1391–1399).

  • Ybarra, M. L., Mitchell, K. J., Wolak, J., & Finkelhor, D. (2006). Examining characteristics and associated distress related to internet harassment: Findings from the second youth internet safety survey. Pediatrics, 118(4), e1169–e1177.

    Article  Google Scholar 

Download references

Funding

This research was sponsored by NSF CAREER BCS-1846531 to MD.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Morteza Dehghani.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (TEX 34 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kennedy, B., Atari, M., Davani, A.M. et al. Introducing the Gab Hate Corpus: defining and applying hate-based rhetoric to social media posts at scale. Lang Resources & Evaluation 56, 79–108 (2022). https://doi.org/10.1007/s10579-021-09569-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-021-09569-x

Keywords

Profiles

  1. Morteza Dehghani