Abstract
Qualitative coding of large datasets has been a valuable tool for qualitative researchers. In terms of inter-rater reliability, existing metrics have not evolved to fit current approaches, presenting a variety of restrictions. In this paper, we propose Generalized Cohen’s kappa, a novel IRR metric that can be applied in a variety of qualitative coding situations, such as variable number of coders, texts, and non-mutually exclusive categories. We show that under the preconditions for Cohen’s kappa, GCK performs very similarly, thus demonstrating their interchangeability. We then extend GCK to the aforementioned situations and demonstrate it to be stable under different permutations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Andrés, A.M., Marzo, P.F.: Delta: a new measure of agreement between two raters. Br. J. Math. Stat. Psychol. 57(1), 1–19 (2004)
Bazeley, P.: Issues in mixing qualitative and quantitative approaches to research. Appl. Qual. Methods Mark. Manag. Res. 141, 156 (2004)
Brooks, M., et al.: Statistical affect detection in collaborative chat. In: Proceedings of the 2013 Conference on Computer Supported Cooperative Work, pp. 317–328 (2013)
Charmaz, K.: Constructing Grounded Theory: A Practical Guide Through Qualitative Analysis. Sage, London (2006)
Cicchetti, D.V., Shoinralter, D., Tyrer, P.J.: The effect of number of rating scale categories on levels of interrater reliability: a monte carlo investigation. Appl. Psychol. Meas. 9(1), 31–36 (1985)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)
Eagan, B., Brohinsky, J., Wang, J., Shaffer, D.W.: Testing the reliability of inter-rater reliability. In: Proceedings of the Tenth International Conference on Learning Analytics & Knowledge, pp. 454–461 (2020)
Eagan, B.R., Rogers, B., Serlin, R., Ruis, A.R., Arastoopour Irgens, G., Shaffer, D.W.: Can we rely on IRR? testing the assumptions of inter-rater reliability. In: International Conference on Computer Supported Collaborative Learning, pp. 529–532 (2017)
Epstein, M.H., Harniss, M.K., Pearson, N., Ryser, G.: The behavioral and emotional rating scale: test-retest and inter-rater reliability. J. Child Fam. Stud. 8(3), 319–327 (1999)
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378 (1971)
Fleiss, J.L.: Measuring agreement between two judges on the presence or absence of a trait. Biometrics 31, 651–659 (1975)
Ghosh, S., Figueroa, A.: Establishing TikTo as a platform for informal learning: evidence from mixed-methods analysis of creators and viewers. In: Proceedings of the 56th Hawaii International Conference on System Sciences, pp. 2431–2440 (2023)
Ghosh, S., Froelich, N., Aragon, C.: “i love you, my dear friend": Analyzing the role of emotions in the building of friendships in online fanfiction communities. In: Proceedings of the 15th International Conference on Social Computing and Social Media in the context of the 25th International Conference on Human-Computer Interaction (HCI International). Springer (2023)
Gisev, N., Bell, J.S., Chen, T.F.: Interrater agreement and interrater reliability: key concepts, approaches, and applications. Res. Social Adm. Pharm. 9(3), 330–338 (2013)
Gwet, K.: Kappa statistic is not satisfactory for assessing the extent of agreement between raters. Stat. Methods Inter-Rater Reliab. Assess. 1(6), 1–6 (2002)
Gwet, K.L.: Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol. 61(1), 29–48 (2008)
Kirilenko, A.P., Stepchenkova, S.: Inter-coder agreement in one-to-many classification: fuzzy kappa. PLoS ONE 11(3), e0149787 (2016)
Krippendorff, K.: Computing krippendorff’s alpha-reliability (2011)
McDonald, N., Schoenebeck, S., Forte, A.: Reliability and inter-rater reliability in qualitative research: Norms and guidelines for CSCW and HCI practice. In: Proceedings of the ACM on Human-Computer Interaction, vol. 3 (CSCW), pp. 1–23 (2019)
Owen, A.: Monte Carlo Theory. Methods and Examples, Stanford (2013)
Scott, W.A.: Reliability of content analysis: the case of nominal scale coding. Public opinion quarterly pp. 321–325 (1955)
Tinsley, H.E., Weiss, D.J.: Interrater reliability and agreement. In: Handbook of applied multivariate statistics and mathematical modeling, pp. 95–124. Elsevier (2000)
Uebersax, J.S.: Diversity of decision-making models and the measurement of interrater agreement. Psychol. Bull. 101(1), 140 (1987)
Yin, K., Aragon, C., Evans, S., Davis, K.: Where no one has gone before: A meta-dataset of the world’s largest fanfiction repository. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 6106–6110 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Figueroa, A., Ghosh, S., Aragon, C. (2023). Generalized Cohen’s Kappa: A Novel Inter-rater Reliability Metric for Non-mutually Exclusive Categories. In: Mori, H., Asahi, Y. (eds) Human Interface and the Management of Information. HCII 2023. Lecture Notes in Computer Science, vol 14015. Springer, Cham. https://doi.org/10.1007/978-3-031-35132-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-35132-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35131-0
Online ISBN: 978-3-031-35132-7
eBook Packages: Computer ScienceComputer Science (R0)