Skip to main content

Generalized Cohen’s Kappa: A Novel Inter-rater Reliability Metric for Non-mutually Exclusive Categories

  • Conference paper
  • First Online:
Human Interface and the Management of Information (HCII 2023)

Abstract

Qualitative coding of large datasets has been a valuable tool for qualitative researchers. In terms of inter-rater reliability, existing metrics have not evolved to fit current approaches, presenting a variety of restrictions. In this paper, we propose Generalized Cohen’s kappa, a novel IRR metric that can be applied in a variety of qualitative coding situations, such as variable number of coders, texts, and non-mutually exclusive categories. We show that under the preconditions for Cohen’s kappa, GCK performs very similarly, thus demonstrating their interchangeability. We then extend GCK to the aforementioned situations and demonstrate it to be stable under different permutations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Andrés, A.M., Marzo, P.F.: Delta: a new measure of agreement between two raters. Br. J. Math. Stat. Psychol. 57(1), 1–19 (2004)

    Article  MathSciNet  Google Scholar 

  2. Bazeley, P.: Issues in mixing qualitative and quantitative approaches to research. Appl. Qual. Methods Mark. Manag. Res. 141, 156 (2004)

    Google Scholar 

  3. Brooks, M., et al.: Statistical affect detection in collaborative chat. In: Proceedings of the 2013 Conference on Computer Supported Cooperative Work, pp. 317–328 (2013)

    Google Scholar 

  4. Charmaz, K.: Constructing Grounded Theory: A Practical Guide Through Qualitative Analysis. Sage, London (2006)

    Google Scholar 

  5. Cicchetti, D.V., Shoinralter, D., Tyrer, P.J.: The effect of number of rating scale categories on levels of interrater reliability: a monte carlo investigation. Appl. Psychol. Meas. 9(1), 31–36 (1985)

    Article  Google Scholar 

  6. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)

    Article  Google Scholar 

  7. Eagan, B., Brohinsky, J., Wang, J., Shaffer, D.W.: Testing the reliability of inter-rater reliability. In: Proceedings of the Tenth International Conference on Learning Analytics & Knowledge, pp. 454–461 (2020)

    Google Scholar 

  8. Eagan, B.R., Rogers, B., Serlin, R., Ruis, A.R., Arastoopour Irgens, G., Shaffer, D.W.: Can we rely on IRR? testing the assumptions of inter-rater reliability. In: International Conference on Computer Supported Collaborative Learning, pp. 529–532 (2017)

    Google Scholar 

  9. Epstein, M.H., Harniss, M.K., Pearson, N., Ryser, G.: The behavioral and emotional rating scale: test-retest and inter-rater reliability. J. Child Fam. Stud. 8(3), 319–327 (1999)

    Article  Google Scholar 

  10. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378 (1971)

    Article  Google Scholar 

  11. Fleiss, J.L.: Measuring agreement between two judges on the presence or absence of a trait. Biometrics 31, 651–659 (1975)

    Google Scholar 

  12. Ghosh, S., Figueroa, A.: Establishing TikTo as a platform for informal learning: evidence from mixed-methods analysis of creators and viewers. In: Proceedings of the 56th Hawaii International Conference on System Sciences, pp. 2431–2440 (2023)

    Google Scholar 

  13. Ghosh, S., Froelich, N., Aragon, C.: “i love you, my dear friend": Analyzing the role of emotions in the building of friendships in online fanfiction communities. In: Proceedings of the 15th International Conference on Social Computing and Social Media in the context of the 25th International Conference on Human-Computer Interaction (HCI International). Springer (2023)

    Google Scholar 

  14. Gisev, N., Bell, J.S., Chen, T.F.: Interrater agreement and interrater reliability: key concepts, approaches, and applications. Res. Social Adm. Pharm. 9(3), 330–338 (2013)

    Article  Google Scholar 

  15. Gwet, K.: Kappa statistic is not satisfactory for assessing the extent of agreement between raters. Stat. Methods Inter-Rater Reliab. Assess. 1(6), 1–6 (2002)

    Google Scholar 

  16. Gwet, K.L.: Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol. 61(1), 29–48 (2008)

    Article  MathSciNet  Google Scholar 

  17. Kirilenko, A.P., Stepchenkova, S.: Inter-coder agreement in one-to-many classification: fuzzy kappa. PLoS ONE 11(3), e0149787 (2016)

    Article  Google Scholar 

  18. Krippendorff, K.: Computing krippendorff’s alpha-reliability (2011)

    Google Scholar 

  19. McDonald, N., Schoenebeck, S., Forte, A.: Reliability and inter-rater reliability in qualitative research: Norms and guidelines for CSCW and HCI practice. In: Proceedings of the ACM on Human-Computer Interaction, vol. 3 (CSCW), pp. 1–23 (2019)

    Google Scholar 

  20. Owen, A.: Monte Carlo Theory. Methods and Examples, Stanford (2013)

    Google Scholar 

  21. Scott, W.A.: Reliability of content analysis: the case of nominal scale coding. Public opinion quarterly pp. 321–325 (1955)

    Google Scholar 

  22. Tinsley, H.E., Weiss, D.J.: Interrater reliability and agreement. In: Handbook of applied multivariate statistics and mathematical modeling, pp. 95–124. Elsevier (2000)

    Google Scholar 

  23. Uebersax, J.S.: Diversity of decision-making models and the measurement of interrater agreement. Psychol. Bull. 101(1), 140 (1987)

    Article  Google Scholar 

  24. Yin, K., Aragon, C., Evans, S., Davis, K.: Where no one has gone before: A meta-dataset of the world’s largest fanfiction repository. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 6106–6110 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sourojit Ghosh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Figueroa, A., Ghosh, S., Aragon, C. (2023). Generalized Cohen’s Kappa: A Novel Inter-rater Reliability Metric for Non-mutually Exclusive Categories. In: Mori, H., Asahi, Y. (eds) Human Interface and the Management of Information. HCII 2023. Lecture Notes in Computer Science, vol 14015. Springer, Cham. https://doi.org/10.1007/978-3-031-35132-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-35132-7_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-35131-0

  • Online ISBN: 978-3-031-35132-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics