Skip to main content

CCPC: A Hierarchical Chinese Corpus for Patronizing and Condescending Language Detection

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14303))

  • 796 Accesses

Abstract

Patronizing and Condescending Language (PCL) is a form of implicitly toxic speech aimed at vulnerable groups with the potential to cause them long-term harm. As an emerging field of toxicity detection, it still lacks high-quality annotated corpora (especially in the Chinese field). Existing PCL datasets lack fine-grained annotation of toxicity level, resulting in a loss of edge information. In this paper, we make the first attempt at fine-grained condescending detection in Chinese. First, we propose CondescendCN Frame, a hierarchical framework for fine-grained condescending detection. On this basis, we introduce CCPC, a hierarchical Chinese corpus for PCL, with 11k structured annotations of social media comments from Sina Weibo and Zhihu. We find that adding toxicity strength (TS) can effectively improve the detection ability of PCL and demonstrate that the trained model still retains decent detection capabilities after being migrated to a larger variety of media data (over 120k).Due to the subjective ambiguity of PCL, more contextual information and subject knowledge expansion are critically required for this field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Our dataset and code are available on https://github.com/dut-laowang/CCPC.

  2. 2.

    https://huggingface.co/bert-base-uncased.

  3. 3.

    https://huggingface.co/bert-base-multilingual-cased.

  4. 4.

    https://huggingface.co/bert-base-Chinese.

References

  1. Bell, K.M.: Raising Africa?: celebrity and the rhetoric of the white saviour. PORTAL: J. Multi. Int. Stud. 10(1), 1–24 (2013)

    Google Scholar 

  2. Bussone, A., Stumpf, S., O’Sullivan, D.: The role of explanations on trust and reliance in clinical decision support systems. In: 2015 International Conference on Healthcare Informatics, pp. 160–169. IEEE (2015)

    Google Scholar 

  3. Caselli, T., Basile, V., Mitrović, J., Granitzer, M.: Hatebert: Retraining BERT for abusive language detection in english. arXiv preprint arXiv:2010.12472 (2020)

  4. Dixon, L., Li, J., Sorensen, J., Thain, N., Vasserman, L.: Measuring and mitigating unintended bias in text classification. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 67–73 (2018)

    Google Scholar 

  5. Fortuna, P., da Silva, J.R., Wanner, L., Nunes, S., et al.: A hierarchically-labeled Portuguese hate speech dataset. In: Proceedings of the Third Workshop on Abusive Language Online, pp. 94–104 (2019)

    Google Scholar 

  6. Huckin, T.: Textual silence and the discourse of homelessness. Discourse Society 13(3), 347–372 (2002)

    Article  Google Scholar 

  7. Mathew, B., Saha, P., Yimam, S.M., Biemann, C., Goyal, P., Mukherjee, A.: Hatexplain: a benchmark dataset for explainable hate speech detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 14867–14875 (2021)

    Google Scholar 

  8. Ng, S.H.: Language-based discrimination: blatant and subtle forms. J. Lang. Soc. Psychol. 26(2), 106–122 (2007)

    Article  Google Scholar 

  9. Parekh, P., Patel, H.: Toxic comment tools: a case study. Int. J. Adv. Res. Comput. Sci. 8(5) (2017)

    Google Scholar 

  10. Pérez-Almendros, C., Espinosa-Anke, L., Schockaert, S.: Don’t patronize me! an annotated dataset with patronizing and condescending language towards vulnerable communities. arXiv preprint arXiv:2011.08320 (2020)

  11. Price, I., et al.: Six attributes of unhealthy conversation. arXiv preprint arXiv:2010.07410 (2020)

  12. Spertus, E.: Smokey: Automatic recognition of hostile messages. In: AAAAI/IAAI, pp. 1058–1065 (1997)

    Google Scholar 

  13. Straubhaar, R.: The stark reality of the ‘white saviour’complex and the need for critical consciousness: a document analysis of the early journals of a freirean educator. Compare: J. Comparative Int. Educ. 45(3), 381–400 (2015)

    Google Scholar 

  14. Van Aken, B., Risch, J., Krestel, R., Löser, A.: Challenges for toxic comment classification: An in-depth error analysis. arXiv preprint arXiv:1809.07572 (2018)

  15. Wang, Z., Potts, C.: Talkdown: A corpus for condescension detection in context. arXiv preprint arXiv:1909.11272 (2019)

  16. Wong, G., Derthick, A.O., David, E., Saw, A., Okazaki, S.: The what, the why, and the how: a review of racial microaggressions research in psychology. Race Soc. Probl. 6, 181–200 (2014)

    Article  Google Scholar 

  17. Xu, J.: Xu at semeval-2022 task 4: Pre-BERT neural network methods vs post-BERT Roberta approach for patronizing and condescending language detection. arXiv preprint arXiv:2211.06874 (2022)

  18. Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., Kumar, R.: Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983 (2019)

  19. Zhou, J., et al.: Towards identifying social bias in dialog systems: Frame, datasets, and benchmarks. arXiv preprint arXiv:2202.08011 (2022)

  20. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics, pp. 159–174 (1977)

    Google Scholar 

  21. Lu, J., Xu, B., Zhang, X., Min, C., Yang, L., Lin, H.: Facilitating fine-grained detection of Chinese toxic language: Hierarchical taxonomy, resources, and benchmarks (2023)

    Google Scholar 

  22. Pérez-Almendros, C., Anke, L.E., Schockaert, S.: Pre-training language models for identifying patronizing and condescending language: an analysis. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 3902–3911 (2022)

    Google Scholar 

  23. Lu, J., et al.: Guts at semeval-2022 task 4: Adversarial training and balancing methods for patronizing and condescending language detection. In: Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pp. 432–437 (2022)

    Google Scholar 

  24. Min, C., et al.: Finding hate speech with auxiliary emotion detection from self-training multi-label learning perspective. Inform. Fusion 96, 214–223 (2023)

    Article  Google Scholar 

  25. Lu, J., et al.: Hate speech detection via dual contrastive learning. Speech, and Language Processing. IEEE/ACM Trans. Audio (2023)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongfei Lin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, H., Li, M., Lu, J., Yang, L., Xia, H., Lin, H. (2023). CCPC: A Hierarchical Chinese Corpus for Patronizing and Condescending Language Detection. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14303. Springer, Cham. https://doi.org/10.1007/978-3-031-44696-2_50

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44696-2_50

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44695-5

  • Online ISBN: 978-3-031-44696-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics