Inter-rater reliability for emotion annotation in human–computer interaction: comparison and methodological improvements

Siegert, Ingo; Böck, Ronald; Wendemuth, Andreas

doi:10.1007/s12193-013-0129-9

Inter-rater reliability for emotion annotation in human–computer interaction: comparison and methodological improvements

Original Paper
Published: 29 November 2013

Volume 8, pages 17–28, (2014)
Cite this article

Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Ingo Siegert¹,
Ronald Böck¹ &
Andreas Wendemuth¹

1219 Accesses
44 Citations
3 Altmetric
Explore all metrics

Abstract

To enable a naturalistic human–computer interaction the recognition of emotions and intentions experiences increased attention and several modalities are comprised to cover all human communication abilities. For this reason, naturalistic material is recorded, where the subjects are guided through an interaction with crucial points, but with the freedom to react individually. This material captures realistic user reactions but lacks of clear labels. So, a good transcription and annotation of the given material is essential. For that, the assignment of human annotators has become widely accepted. A good measurement for the reliability of labelled material is the inter-rater agreement. In this paper we investigate the achieved inter-rater agreement utilizing Krippendorff’s alpha for emotional annotated interaction corpora and present methods to improve the reliability, we show that the reliabilities obtained with different methods does not differ much, so a choice could rely on other aspects. Furthermore, a multimodal presentation of the items in their natural order increases the reliability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Assessing the Attitude Towards Artificial Intelligence: Introduction of a Short Measure in German, Chinese, and English Language

Article Open access 23 September 2020

Cornelia Sindermann, Peng Sha, … Christian Montag

Emotional Expression: Advances in Basic Emotion Theory

Article 07 February 2019

Dacher Keltner, Disa Sauter, … Alan Cowen

Ethical considerations in emotion recognition technologies: a review of the literature

Article 20 June 2023

Amelia Katirai

References

Altman DG (1991) Practical statistics for medical research. Chapman & Hall, London
Google Scholar
Artstein R, Poesio M (2008) Inter-coder agreement for computational linguistics. Comput Linguist 34(4):555–596
Google Scholar
Batliner A, Hacker C, Steidl S, Nöth E, Russell M, Wong M (2004) “You stupid tin box”-children interacting with the AIBO robot: a cross-linguistic emotional speech corpus. In: Proceedings of LREC, pp 865–868
Böck R, Siegert I, Vlasenko B, Wendemuth A, Haase M, Lange J (2011) A processing tool for emotionally coloured speech. In: Proceedings of ICME, s.p.
Bradley M, Lang P (1994) Measuring emotion: the self-assessment manikin and the semantic differential. J Behav Ther Exp Psy 25(1):49–59
Google Scholar
Burger S, MacLaren V, Yu H (2002) The ISL meeting corpus: the impact of meeting type on speech style. In: Proceedings of the international conference on spoken language processing, pp 301–304
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of german emotional speech. In: Proceedings of interspeech, pp 1517–1520
Callejas Z, Lpez-Czar R (2008) Influence of contextual information in emotion annotation for spoken dialogue systems. Speech Commun 50(5):416–433
Google Scholar
Cauldwell RT (2000) Where did the anger go? The role of context in interpreting emotion in speech. In: Proceedings of ITRW on speech and, emotion, pp 127–131
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 24(1):37–46
Article Google Scholar
Cowie R, Cornelius RR (2003) Describing the emotional states that are expressed in speech. Speech Commun 40(1–2):5–32
Article MATH Google Scholar
Crawford JR, Henry JD (2004) The positive and negative affect schedule (PANAS): construct validity, measurement properties and normative data in a large non-clinical sample. Br J Clin Psychol 43(3):245–265
Google Scholar
Cronbach L (1951) Coefficient alpha and the internal structure of tests. Psychometrika 16(3):297–334
Google Scholar
Devillers L, Vasilescu I (2004) Reliability of lexical and prosodic cues in two real-life spoken dialog corpora. In: Proceedings of LREC, pp 865–868
Devillers L, Vidrascu L, Lamel L (2005) Challenges in real-life emotion annotation and machine learning based detection. Neural Netw 18(4):407–422
Article Google Scholar
Douglas-Cowie E, Cowie R, Schröder M (2000) A new emotion database: considerations, sources and scope. In: Proceedings of ITRW on speech and, emotion, pp 39–44
Douglas-Cowie E, Cowie R, Sneddon I, Cox C, Lowry O, McRorie M, Martin JC, Devillers L, Abrilian S, Batliner A, Amir N, Karpouzis K (2007) The HUMAINE database: addressing the collection and annotation of naturalistic and induced emotional data. In: Proceedings of ACII. Berlin, Heidelberg, pp 488–500
Douglas-Cowie E, Devillers L, Martin JC, Cowie R, Savvidou S, Abrilian S, Cox C (2005) Multimodal databases of everyday emotion: facing up to complexity. In: Proceedings of EUROSPEECH, pp 813–816
Eggink J, Bland D (2012) A large scale experiment for mood-based classification of TV programmes. In: Proceedings of ICME, pp 140–145
Ekman P (1992) Are there basic emotions? Psychol Rev 99(3):550–553
Google Scholar
El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit 44(3):572–587
Article MATH Google Scholar
Engberg IS, Hansen AV (1996) Documentation of the danish emotional speech database (DES). Technical report, Center for Person, Kommunikation, Aalborg University, Denmark . Internal aau report
Feinstein AR, Cicchetti DV (1990) High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol 43(6):543–549
Article Google Scholar
Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382
Google Scholar
Fleiss JL, Levin B, Paik MC (1991) Statistical methods for rates & proportions, 3rd edn. Wiley, Hoboken
Google Scholar
Fragopanagos N, Taylor J (2005) Emotion recognition in human-computer interaction. Neural Netw 18(4):389–405
Article Google Scholar
Frommer J, Michaelis B, Rösner D, Wendemuth A, Friesen R, Haase M, Kunze M, Andrich R, Lange J, Panning A, Siegert I (2012) Towards emotion and affect detection in the multimodal LAST MINUTE corpus. In: Proceedings of LREC, pp 3064–3069
Frommer J, Rösner D, Haase M, Lange J, Friesen R, Otto M (2012) Detection and avoidance of failures in dialogues-Wizard of Oz Experiment Operator’s Manual. Pabst Science Publishers
Gehm T, Scherer K (1988) Factors determining the dimensions of subjective emotional space. In: Scherer K (ed) Facets of emotion: recent research. Erlbaum, Hillsdale, NJ, pp 99–114
Gnjatović M, Rösner D (2008) The NIMITEK corpus of affected behavior in human-machine interaction. In: Proceedings of LREC, pp 5–8
Grandjean D, Sander D, Scherer K (2008) Conscious emotional experience emerges as a function of multilevel, appraisal-driven response synchronization. Conscious Cogn 17(2):484–495
Article Google Scholar
Grimm M, Kroschel K (2005) Evaluation of natural emotions using self assessment manikins. In: IEEE workshop on automatic speech recognition and understanding, pp 381–385
Grimm M, Kroschel K, Narayanan S (2008) The vera am mittag german audio-visual emotional speech database. In: Proceedings of ICME, pp 865–868
Gwet KL (2008) Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol 61(1):29–48
Article MathSciNet Google Scholar
Gwet KL (2008) Intrarater reliability. In: D’Agostino RB, Sullivan L, Massaro J (eds) Wiley encyclopedia of clinical trials. Wiley, Hoboken, pp 473–485
Google Scholar
Hayes AF, Krippendorff K (2007) Answering the call for a standard reliability measure for coding data. Commun Methods Meas 24(1):77–89
Article Google Scholar
Ibáñez J (2011) Showing emotions through movement and symmetry. Comput Hum Behav 27(1):561–567
Google Scholar
Izard CE, Libero DZ, Putnam P, Haynes OM (1993) Stability of emotion experiences and their relations to traits of personality. J Pers Soc Psychol 64(5):847–860
Google Scholar
Krippendorff K (2007) Computing Krippendorff’s alpha reliability. University of Pennsylvania, Annenberg School for Communication, Technical report
Krippendorff K (2012) Content analysis: an introduction to its methodology, 3rd edn. SAGE Publications, Thousand Oaks
Google Scholar
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33(1):159–174
Google Scholar
Lang PJ (1980) Behavioral treatment and bio-behavioral assessment: computer applications. In: Sidowski JB, Johnson JH, Williams TA (eds) Technology in mental health care delivery systems. Ablex Pub. Corp., pp 119–137
Lee CM, Narayanan S (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13(2):293–303
Article Google Scholar
McDougall W (1926) An introduction to social psychology, revised edn. John W. Luce & Co, Boston
McKeown G, Valstar M, Cowie R, Pantic M (2010) The semaine corpus of emotionally coloured character interactions. In: Proceedings of ICME, pp 1079–1084
McKeown G, Valstar M, Cowie R, Pantic M, Schroder M (2012) The semaine database: annotated multimodal records of emotionally coloured conversations between a person and a limited agent. IEEE Trans Affect Comput 3(1):5–17
Google Scholar
Mehrabian A (1970) A semantic space for nonverbal behavior. J Consult Clin Psychol 35(2):248–257
Google Scholar
Morris JD (1995) SAM: the self-assessment manikin an efficient cross-cultural measurement of emotional response. J Advert Res 35(6):63–68
Google Scholar
Morris JD, McMullen JS (1994) Measuring multiple emotional responses to a single television commercial. Adv Consum Res 21:175–180
Article Google Scholar
Osgood CE, Miron MS, May WH (1975) Cross-cultural universals of affective meaning. University of Illinois Press, Urbana
Plutchik R (1980) Emotion, a psychoevolutionary synthesis. Harper & Row, New York
Google Scholar
Pugmire D (1994) Real emotion. Philos Phenomen Res 54(1):105–122
Rösner D, Friesen R, Otto M, Lange J, Haase M, Frommer J (2011) Intentionality in interacting with companion systems G an empirical approach. In: Human-Computer interaction. Towards mobile and intelligent interaction environments, LNCS, vol 6763. Springer, Berlin, Heidelberg, pp 593–602
Russel J, Mehrabian A (1974) Distinguishing anger and anxiety in terms of emotional response factors. J Consult Clin Psychol 42:79–83
Article Google Scholar
Russel JA (1980) Three dimensions of emotion. J Pers Soc Psychol 39(9):1161–1178
Article Google Scholar
Sacharin V, Schlegel K, Scherer KR (2012) Geneva emotion wheel rating study. Center for Person, Kommunikation, Aalborg University, NCCR Affective Sciences, Technical report
Scherer K (2005) What are emotions? and how can they be measured? Soc Sci Inform 44(4):695–729
Article Google Scholar
Scherer KR (2001) Appraisal considered as a process of multilevel sequential checking, vol 92. Oxford University Press, Oxford, pp. 92–120
Schimmack U (1997) The Berlin everyday language mood inventory (BELMI): toward the content valid assessment of moods. Diagnostica 43(2):150–173
Google Scholar
Schmitt N (1996) Uses and abuses of coefficient alpha. Psychol Assess 8(4):350–353
Google Scholar
Schröder M, Cowie R, Douglas-Cowie E, Savvidou S, McMahon E, Sawey M (2000) Feeltrace: An instrument for recording perceived emotion in real time. In: Proceedings of ITRW on speech and, emotion, pp 19–24
Sharp H, Rogers Y, Preece J (2007) Interaction design: beyond human-computer interaction, 2nd edn. Wiley, London
Siegert I, Böck R, Wendemuth A (2013) The influence of context knowledge for multimodal affective annotation. In: Human-computer interaction, Part V, HCII 2013, LNCS, vol 8008. Springer, Berlin, pp 381–390
Siegert I, Böck R, Philippou-Hübner D, Vlasenko B, Wendemuth A (2011) Appropriate emotional labeling of non-acted speech using basic emotions, Geneva emotion wheel and self assessment Manikins. In: Proceedings of ICME, s.p.
Siegert I, Böck R, Wendemuth A (2012) The influence of context knowledge for multimodal annotation on natural material. In: Joint proceedings of the IVA 2012 workshops, pp 25–32
Sijtsma K (2009) On the use, the misuse, and the very limited usefulness of cronbachGs alpha. Psychometrika 74(1):107–120
Article MATH MathSciNet Google Scholar
Sojka P, Horak A, Kopecek I, Pala K (eds) (2012) Aggression detection in speech using sensor and semantic information, vol 7499. Springer, Berlin
Truong KP, van Leeuwen DA, de Jong FM (2012) Speech-based recognition of self-reported and observed emotion in a dimensional space. Speech Commun 54(9):1049–1063
Article Google Scholar
Truong KP, Neerincx MA, van Leeuwen DA (2008) Assessing agreement of observer- and self-annotations in spontaneous multimodal emotion data. In: Proceedings of interspeech, pp 318–321
Watson D, Clark LA, Tellegen A (1988) Development and validation of brief measures of positive and negative affect: the PANAS scales. J Pers Soc Psychol 54(6):1063–1070
Article Google Scholar
Wendemuth A, Biundo S (2012) A companion technology for cognitive technical systems. In: Cognitive behavioural systems, Lecture Notes in Computer Science, vol 7403, Springer, Berlin, pp 89–103
Wundt W (1922/1863) Vorlesungen über die Menschen- und Tierseele. L. Voss, Leipzig
Yang YH, Lin YC, Su YF, Chen H (2007) Music emotion classification: a regression approach. In: Proceedings of ICME, pp 208–211
Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58
Article Google Scholar

Download references

Acknowledgments

This research was supported by the Transregional Collaborative Research Centre SFB/TRR 62 “A Companion-Technology for Cognitive Technical Systems” (http://www.sfb-trr-62.de) funded by the German Research Foundation (DFG). Portions of the research in this article use the Semaine Database, collected for the Semaine project (http://www.semaine-db.eu) [45].

Author information

Authors and Affiliations

IIKT and CBBS, Otto von Guericke University, 39016 , Magdeburg, Germany
Ingo Siegert, Ronald Böck & Andreas Wendemuth

Authors

Ingo Siegert
View author publications
You can also search for this author in PubMed Google Scholar
Ronald Böck
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Wendemuth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ingo Siegert.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Siegert, I., Böck, R. & Wendemuth, A. Inter-rater reliability for emotion annotation in human–computer interaction: comparison and methodological improvements. J Multimodal User Interfaces 8, 17–28 (2014). https://doi.org/10.1007/s12193-013-0129-9

Download citation

Received: 07 April 2013
Accepted: 10 October 2013
Published: 29 November 2013
Issue Date: March 2014
DOI: https://doi.org/10.1007/s12193-013-0129-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Inter-rater reliability for emotion annotation in human–computer interaction: comparison and methodological improvements

Abstract

Access this article

Similar content being viewed by others

Assessing the Attitude Towards Artificial Intelligence: Introduction of a Short Measure in German, Chinese, and English Language

Emotional Expression: Advances in Basic Emotion Theory

Ethical considerations in emotion recognition technologies: a review of the literature

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Inter-rater reliability for emotion annotation in human–computer interaction: comparison and methodological improvements

Abstract

Access this article

Similar content being viewed by others

Assessing the Attitude Towards Artificial Intelligence: Introduction of a Short Measure in German, Chinese, and English Language

Emotional Expression: Advances in Basic Emotion Theory

Ethical considerations in emotion recognition technologies: a review of the literature

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation