CCPC: A Hierarchical Chinese Corpus for Patronizing and Condescending Language Detection

Wang, Hongbo; Li, Mingda; Lu, Junyu; Yang, Liang; Xia, Hebin; Lin, Hongfei

doi:10.1007/978-3-031-44696-2_50

Hongbo Wang¹¹,
Mingda Li¹¹,
Junyu Lu¹¹,
Liang Yang¹¹,
Hebin Xia¹¹ &
…
Hongfei Lin¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14303))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

1339 Accesses

Abstract

Patronizing and Condescending Language (PCL) is a form of implicitly toxic speech aimed at vulnerable groups with the potential to cause them long-term harm. As an emerging field of toxicity detection, it still lacks high-quality annotated corpora (especially in the Chinese field). Existing PCL datasets lack fine-grained annotation of toxicity level, resulting in a loss of edge information. In this paper, we make the first attempt at fine-grained condescending detection in Chinese. First, we propose CondescendCN Frame, a hierarchical framework for fine-grained condescending detection. On this basis, we introduce CCPC, a hierarchical Chinese corpus for PCL, with 11k structured annotations of social media comments from Sina Weibo and Zhihu. We find that adding toxicity strength (TS) can effectively improve the detection ability of PCL and demonstrate that the trained model still retains decent detection capabilities after being migrated to a larger variety of media data (over 120k).Due to the subjective ambiguity of PCL, more contextual information and subject knowledge expansion are critically required for this field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

NewsCom-TOX: a corpus of comments on news articles annotated for toxicity in Spanish

Article Open access 17 January 2024

OLID-BR: offensive language identification dataset for Brazilian Portuguese

Article 03 May 2023

Technical Challenges to Automated Detection of Toxic Language

Notes

1.
Our dataset and code are available on https://github.com/dut-laowang/CCPC.
2.
https://huggingface.co/bert-base-uncased.
3.
https://huggingface.co/bert-base-multilingual-cased.
4.
https://huggingface.co/bert-base-Chinese.

References

Bell, K.M.: Raising Africa?: celebrity and the rhetoric of the white saviour. PORTAL: J. Multi. Int. Stud. 10(1), 1–24 (2013)
Google Scholar
Bussone, A., Stumpf, S., O’Sullivan, D.: The role of explanations on trust and reliance in clinical decision support systems. In: 2015 International Conference on Healthcare Informatics, pp. 160–169. IEEE (2015)
Google Scholar
Caselli, T., Basile, V., Mitrović, J., Granitzer, M.: Hatebert: Retraining BERT for abusive language detection in english. arXiv preprint arXiv:2010.12472 (2020)
Dixon, L., Li, J., Sorensen, J., Thain, N., Vasserman, L.: Measuring and mitigating unintended bias in text classification. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 67–73 (2018)
Google Scholar
Fortuna, P., da Silva, J.R., Wanner, L., Nunes, S., et al.: A hierarchically-labeled Portuguese hate speech dataset. In: Proceedings of the Third Workshop on Abusive Language Online, pp. 94–104 (2019)
Google Scholar
Huckin, T.: Textual silence and the discourse of homelessness. Discourse Society 13(3), 347–372 (2002)
Article Google Scholar
Mathew, B., Saha, P., Yimam, S.M., Biemann, C., Goyal, P., Mukherjee, A.: Hatexplain: a benchmark dataset for explainable hate speech detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 14867–14875 (2021)
Google Scholar
Ng, S.H.: Language-based discrimination: blatant and subtle forms. J. Lang. Soc. Psychol. 26(2), 106–122 (2007)
Article Google Scholar
Parekh, P., Patel, H.: Toxic comment tools: a case study. Int. J. Adv. Res. Comput. Sci. 8(5) (2017)
Google Scholar
Pérez-Almendros, C., Espinosa-Anke, L., Schockaert, S.: Don’t patronize me! an annotated dataset with patronizing and condescending language towards vulnerable communities. arXiv preprint arXiv:2011.08320 (2020)
Price, I., et al.: Six attributes of unhealthy conversation. arXiv preprint arXiv:2010.07410 (2020)
Spertus, E.: Smokey: Automatic recognition of hostile messages. In: AAAAI/IAAI, pp. 1058–1065 (1997)
Google Scholar
Straubhaar, R.: The stark reality of the ‘white saviour’complex and the need for critical consciousness: a document analysis of the early journals of a freirean educator. Compare: J. Comparative Int. Educ. 45(3), 381–400 (2015)
Google Scholar
Van Aken, B., Risch, J., Krestel, R., Löser, A.: Challenges for toxic comment classification: An in-depth error analysis. arXiv preprint arXiv:1809.07572 (2018)
Wang, Z., Potts, C.: Talkdown: A corpus for condescension detection in context. arXiv preprint arXiv:1909.11272 (2019)
Wong, G., Derthick, A.O., David, E., Saw, A., Okazaki, S.: The what, the why, and the how: a review of racial microaggressions research in psychology. Race Soc. Probl. 6, 181–200 (2014)
Article Google Scholar
Xu, J.: Xu at semeval-2022 task 4: Pre-BERT neural network methods vs post-BERT Roberta approach for patronizing and condescending language detection. arXiv preprint arXiv:2211.06874 (2022)
Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., Kumar, R.: Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983 (2019)
Zhou, J., et al.: Towards identifying social bias in dialog systems: Frame, datasets, and benchmarks. arXiv preprint arXiv:2202.08011 (2022)
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics, pp. 159–174 (1977)
Google Scholar
Lu, J., Xu, B., Zhang, X., Min, C., Yang, L., Lin, H.: Facilitating fine-grained detection of Chinese toxic language: Hierarchical taxonomy, resources, and benchmarks (2023)
Google Scholar
Pérez-Almendros, C., Anke, L.E., Schockaert, S.: Pre-training language models for identifying patronizing and condescending language: an analysis. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 3902–3911 (2022)
Google Scholar
Lu, J., et al.: Guts at semeval-2022 task 4: Adversarial training and balancing methods for patronizing and condescending language detection. In: Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pp. 432–437 (2022)
Google Scholar
Min, C., et al.: Finding hate speech with auxiliary emotion detection from self-training multi-label learning perspective. Inform. Fusion 96, 214–223 (2023)
Article Google Scholar
Lu, J., et al.: Hate speech detection via dual contrastive learning. Speech, and Language Processing. IEEE/ACM Trans. Audio (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
Hongbo Wang, Mingda Li, Junyu Lu, Liang Yang, Hebin Xia & Hongfei Lin

Authors

Hongbo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Mingda Li
View author publications
You can also search for this author in PubMed Google Scholar
Junyu Lu
View author publications
You can also search for this author in PubMed Google Scholar
Liang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hebin Xia
View author publications
You can also search for this author in PubMed Google Scholar
Hongfei Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongfei Lin .

Editor information

Editors and Affiliations

Emory University, Atlanta, GA, USA
Fei Liu
Microsoft Research Asia, Beijing, China
Nan Duan
Soochow University, Suzhou, China
Qingting Xu
Soochow University, Suzhou, China
Yu Hong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, H., Li, M., Lu, J., Yang, L., Xia, H., Lin, H. (2023). CCPC: A Hierarchical Chinese Corpus for Patronizing and Condescending Language Detection. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14303. Springer, Cham. https://doi.org/10.1007/978-3-031-44696-2_50

Download citation

DOI: https://doi.org/10.1007/978-3-031-44696-2_50
Published: 08 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44695-5
Online ISBN: 978-3-031-44696-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)

CCPC: A Hierarchical Chinese Corpus for Patronizing and Condescending Language Detection

Abstract

Access this chapter

Similar content being viewed by others

NewsCom-TOX: a corpus of comments on news articles annotated for toxicity in Spanish

OLID-BR: offensive language identification dataset for Brazilian Portuguese

Technical Challenges to Automated Detection of Toxic Language

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

CCPC: A Hierarchical Chinese Corpus for Patronizing and Condescending Language Detection

Abstract

Access this chapter

Similar content being viewed by others

NewsCom-TOX: a corpus of comments on news articles annotated for toxicity in Spanish

OLID-BR: offensive language identification dataset for Brazilian Portuguese

Technical Challenges to Automated Detection of Toxic Language

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation