skip to main content
10.1145/3477495.3531945acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article
Public Access

Bias Mitigation for Toxicity Detection via Sequential Decisions

Published: 07 July 2022 Publication History

Abstract

Increased social media use has contributed to the greater prevalence of abusive, rude, and offensive textual comments. Machine learning models have been developed to detect toxic comments online, yet these models tend to show biases against users with marginalized or minority identities (e.g., females and African Americans). Established research in debiasing toxicity classifiers often (1) takes a static or batch approach, assuming that all information is available and then making a one-time decision; and (2) uses a generic strategy to mitigate different biases (e.g., gender and racial biases) that assumes the biases are independent of one another. However, in real scenarios, the input typically arrives as a sequence of comments/words over time instead of all at once. Thus, decisions based on partial information must be made while additional input is arriving. Moreover, social bias is complex by nature. Each type of bias is defined within its unique context, which, consistent with intersectionality theory within the social sciences, might be correlated with the contexts of other forms of bias. In this work, we consider debiasing toxicity detection as a sequential decision-making process where different biases can be interdependent. In particular, we study debiasing toxicity detection with two aims: (1) to examine whether different biases tend to correlate with each other; and (2) to investigate how to jointly mitigate these correlated biases in an interactive manner to minimize the total amount of bias. At the core of our approach is a framework built upon theories of sequential Markov Decision Processes that seeks to maximize the prediction accuracy and minimize the bias measures tailored to individual biases. Evaluations on two benchmark datasets empirically validate the hypothesis that biases tend to be correlated and corroborate the effectiveness of the proposed sequential debiasing strategy.

Supplementary Material

MP4 File (SIGIR22-fp0979.mp4)
This video is a brief presentation of the paper. Please refer to the paper for more details.

References

[1]
Google Perspective API. 2022. https://www.perspectiveapi.com/.
[2]
RE Bellman. 1957. A markov decision process. journal of Mathematical Mechanics. (1957).
[3]
Su Lin Blodgett, Lisa Green, and Brendan O'Connor. 2016. Demographic Dialectal Variation in Social Media: A Case Study of African-American English. In EMNLP. 1119--1130.
[4]
Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. NIPS, Vol. 29 (2016), 4349--4357.
[5]
Eloi Brassard-Gourdeau and Richard Khoury. 2019. Subversive toxicity detection using sentiment information. In Proceedings of the Third Workshop on Abusive Language Online. 1--10.
[6]
Lu Cheng, Ruocheng Guo, Yasin Silva, Deborah Hall, and Huan Liu. 2019 a. Hierarchical attention networks for cyberbullying detection on the instagram social network. In SDM. SIAM, 235--243.
[7]
Lu Cheng, Jundong Li, Yasin Silva, Deborah Hall, and Huan Liu. 2019 b. PI-bully: Personalized cyberbullying detection with peer influence. In The 28th International Joint Conference on Artificial Intelligence (IJCAI) .
[8]
Lu Cheng, Jundong Li, Yasin N Silva, Deborah L Hall, and Huan Liu. 2019 c. Xbully: Cyberbullying detection within a multi-modal context. In WSDM . 339--347.
[9]
Lu Cheng, Ahmadreza Mosallanezhad, Yasin Silva, Deborah Hall, and Huan Liu. 2021 a. Mitigating Bias in Session-based Cyberbullying Detection: A Non-Compromising Approach. In ACL .
[10]
Lu Cheng, Kush R Varshney, and Huan Liu. 2021 b. Socially responsible AI algorithms: issues, purposes, and challenges. Journal of Artificial Intelligence Research, Vol. 71 (2021), 1137--1181.
[11]
Paula Czarnowska, Yogarshi Vyas, and Kashif Shah. 2021. Quantifying Social Biases in NLP: A Generalization and Empirical Comparison of Extrinsic Fairness Metrics. TACL (2021).
[12]
Maral Dadvar, Dolf Trieschnigg, Roeland Ordelman, and Franciska de Jong. 2013. Improving cyberbullying detection with user context. In ECIR. Springer, 693--696.
[13]
Thomas Davidson, Debasmita Bhattacharya, and Ingmar Weber. 2019. Racial Bias in Hate Speech and Abusive Language Detection Datasets. In Proceedings of the Third Workshop on Abusive Language Online. 25--35.
[14]
Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and mitigating unintended bias in text classification. In AIES . 67--73.
[15]
Hong Fan, Wu Du, Abdelghani Dahou, Ahmed A Ewees, Dalia Yousri, Mohamed Abd Elaziz, Ammar H Elsheikh, Laith Abualigah, and Mohammed AA Al-qaness. 2021. Social Media Toxicity Classification Using Deep Learning: Real-World Application UK Brexit. Electronics, Vol. 10, 11 (2021), 1332.
[16]
Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, Vol. 115, 16 (2018), E3635--E3644.
[17]
Oguzhan Gencoglu. 2020. Cyberbullying detection with fairness constraints. IEEE Internet Computing, Vol. 25, 1 (2020), 20--29.
[18]
Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019. Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets. In EMNLP-IJCNLP. 1161--1166.
[19]
He He. 2016. SEQUENTIAL DECISIONS AND PREDICTIONS IN NATURAL LANGUAGE PROCESSING. PhD Dissertation (2016).
[20]
Homa Hosseinmardi, Sabrina Arredondo Mattson, Rahat Ibn Rafiq, Richard Han, Qin Lv, and Shivakant Mishra. 2015. Detection of cyberbullying incidents on the instagram social network. arXiv preprint arXiv:1503.03909 (2015).
[21]
Jae Yeon Kim, Carlos Ortiz, Sarah Nam, Sarah Santiago, and Vivek Datta. 2020. Intersectional bias in hate speech and abusive language datasets. In ICWSM 2020 Data Challenge Workshop .
[22]
Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, and Yulia Tsvetkov. 2019. Measuring Bias in Contextualized Word Representations. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing . 166--172.
[23]
Marzieh Mozafari, Reza Farahbakhsh, and Noël Crespi. 2020. Hate speech detection and racial bias mitigation in social media based on BERT model. PloS one, Vol. 15, 8 (2020), e0237861.
[24]
Vinita Nahar, Xue Li, and Chaoyi Pang. 2013. An effective approach for cyberbullying detection. Communications in information science and management engineering, Vol. 3, 5 (2013), 238.
[25]
Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang. 2016. Abusive language detection in online user content. In WWW. 145--153.
[26]
Andrew Ortony, Gerald L Clore, and Mark A Foss. 1987. The referential structure of the affective lexicon. Cognitive science, Vol. 11, 3 (1987), 341--364.
[27]
Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Reducing Gender Bias in Abusive Language Detection. In EMNLP . 2799--2804.
[28]
John Pavlopoulos, Prodromos Malakasiotis, and Ion Androutsopoulos. 2017. Deep Learning for User Comment Moderation. In Proceedings of the First Workshop on Abusive Language Online. 25--35.
[29]
John Pavlopoulos, Jeffrey Sorensen, Lucas Dixon, Nithum Thain, and Ion Androutsopoulos. 2020. Toxicity Detection: Does Context Really Matter?. In ACL. 4296--4305.
[30]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In EMNLP. 1532--1543. http://www.aclweb.org/anthology/D14--1162
[31]
Fabio Poletto, Valerio Basile, Manuela Sanguinetti, Cristina Bosco, and Viviana Patti. 2021. Resources and benchmark corpora for hate speech detection: a systematic review. Language Resources and Evaluation, Vol. 55, 2 (2021), 477--523.
[32]
Amir H Razavi, Diana Inkpen, Sasha Uritsky, and Stan Matwin. 2010. Offensive language detection using multi-level classification. In Canadian Conference on Artificial Intelligence. Springer, 16--27.
[33]
Paul Röttger, Bertram Vidgen, Dong Nguyen, Zeerak Waseem, Helen Margetts, and Janet Pierrehumbert. 2021. Hatecheck: Functional tests for hate speech detection models. In ACL .
[34]
Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith. 2019. The risk of racial bias in hate speech detection. In ACL. 1668--1678.
[35]
Anna Schmidt and Michael Wiegand. 2017. A survey on hate speech detection using natural language processing. In Proceedings of the fifth international workshop on natural language processing for social media . 1--10.
[36]
Anna Squicciarini, Sarah Rajtmajer, Y Liu, and Christopher Griffin. 2015. Identification and characterization of cyberbullying dynamics in an online social network. In ASONAM. 280--285.
[37]
Ditch the Label Anti Bullying Charity. 2013. Ditch the Label Anti Bullying Charity: The annual cyberbullying survey 2013. https://www.ditchthelabel.org/wp-content/uploads/2016/07/cyberbullying2013.pdf . Accessed: 2020-09--18.
[38]
William Warner and Julia Hirschberg. 2012. Detecting hate speech on the world wide web. In Proceedings of the second workshop on language in social media. 19--26.
[39]
Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop. 88--93.
[40]
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In NAACL HLT . 1480--1489.
[41]
Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. 2019. Predicting the Type and Target of Offensive Posts in Social Media. In NAACL HLT . 1415--1420.
[42]
Guanhua Zhang, Bing Bai, Junqi Zhang, Kun Bai, Conghui Zhu, and Tiejun Zhao. 2020. Demographics should not be the reason of toxicity: Mitigating discrimination in text classifications with instance weighting. In ACL .
[43]
Xuhui Zhou, Maarten Sap, Swabha Swayamdipta, Noah A Smith, and Yejin Choi. 2021. Challenges in automated debiasing for toxic language detection. In EACL .

Cited By

View all
  • (2024)A Comprehensive Approach to Bias Mitigation for Sentiment Analysis of Social Media DataApplied Sciences10.3390/app14231147114:23(11471)Online publication date: 9-Dec-2024
  • (2024)SMART-TBI: Design and Evaluation of the Social Media Accessibility and Rehabilitation Toolkit for Users with Traumatic Brain InjuryProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3675641(1-19)Online publication date: 27-Oct-2024
  • (2023)Anatomy of Hate Speech DatasetsProceedings of the 34th ACM Conference on Hypertext and Social Media10.1145/3603163.3609158(1-11)Online publication date: 4-Sep-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2022
3569 pages
ISBN:9781450387323
DOI:10.1145/3477495
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. sequential decision-making
  2. social media
  3. toxicity detection
  4. unintended bias

Qualifiers

  • Research-article

Funding Sources

Conference

SIGIR '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)195
  • Downloads (Last 6 weeks)26
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Comprehensive Approach to Bias Mitigation for Sentiment Analysis of Social Media DataApplied Sciences10.3390/app14231147114:23(11471)Online publication date: 9-Dec-2024
  • (2024)SMART-TBI: Design and Evaluation of the Social Media Accessibility and Rehabilitation Toolkit for Users with Traumatic Brain InjuryProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3675641(1-19)Online publication date: 27-Oct-2024
  • (2023)Anatomy of Hate Speech DatasetsProceedings of the 34th ACM Conference on Hypertext and Social Media10.1145/3603163.3609158(1-11)Online publication date: 4-Sep-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media