Abstract
Affective and human-centered computing have attracted an abundance of attention during the past years, mainly due to the abundance of environments and applications able to exploit and adapt to multimodal input from the users. The combination of facial expressions with prosody information allows us to capture the users’ emotional state in an unintrusive manner, relying on the best performing modality in cases where one modality suffers from noise or bad sensing conditions. In this paper, we describe a multi-cue, dynamic approach to detect emotion in naturalistic video sequences, where input is taken from nearly real world situations, contrary to controlled recording conditions of audiovisual material. Recognition is performed via a recurrent neural network, whose short term memory and approximation capabilities cater for modeling dynamic events in facial and prosodic expressivity. This approach also differs from existing work in that it models user expressivity using a dimensional representation, instead of detecting discrete ‘universal emotions’, which are scarce in everyday human-machine interaction. The algorithm is deployed on an audiovisual database which was recorded simulating human-human discourse and, therefore, contains less extreme expressivity and subtle variations of a number of emotion labels. Results show that in turns lasting more than a few frames, recognition rates rise to 98%.
Similar content being viewed by others
References
Ai H, Litman D, Forbes-Riley K, Rotaru M, Tetreault J, Purandare A (2006) Using system and user performance features to improve emotion detection in spoken tutoring dialogs. In: Proceedings of interspeech ICSLP, Pittsburgh, PA
Ambady A, Rosenthal R (1992) Thin slices of expressive B predictors of interpersonal consequences: a meta-analysis. Psychol Bull 111(2):256–274
Ang J, Dhilon R, Krupski A, Shriberg E, Stolcke A (2002) Prosody based automatic detection of annoyance and frustration in human computer dialog. In: Proc of ICSLP, pp 2037–2040
Ashraf AB, Lucey S, Cohn JF, Chen T, Ambadar Z, Prkachin K, Solomon P, Theobald BJ (2007) The painful face: pain expression recognition using active appearance models. In: Proc ninth ACM int’l conf multimodal interfaces (ICMI’07), pp 9–14
Bartlett MS, Littlewort G, Frank M, Lainscsek C, Fasel I, Movellan J (2005) Recognizing facial expression: machine learning and application to spontaneous behavior. In: Proc IEEE int’l conf computer vision and pattern recognition (CVPR’05), pp 568–573
Batliner A, Fischer K, Huber R, Spilker J, Noeth E (2003) How to find trouble in communication. Speech Commun, 40:117–143
Batliner A, Huber R, Niemann H, Noeth E, Spilker J, Fischer K (2000) The recognition of emotion. In: Wahlster W: Verbmobil: foundations of speech-to-speech translations. Springer, New York, pp 122–130
Bertolami R, Bunke H Early feature stream integration versus decision level combination in a multiple classifier system for text line recognition. In: 18th international conference on pattern recognition (ICPR’06)
Busso C et al. (2004) Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proc sixth ACM int’l conf multimodal interfaces (ICMI’04), pp 205–211
Caridakis G, Malatesta L, Kessous L, Amir N, Paouzaiou A, Karpouzis K (2006) Modeling naturalistic affective states via facial and vocal expression recognition. In: Proc eighth ACM int’l conf multimodal interfaces (ICMI’06), pp 146–154
Cohen I, Sebe N, Garg A, Chen LS, Huang TS (2003) Facial expression recognition from video sequences: temporal and static modeling. Comput Vis Image Underst 91:160–187
Cohen PR (2001) Multimodal interaction: a new focal area for AI. In: IJCAI, pp 1467–1473
Cohen PR, Johnston M, McGee D, Oviatt S, Clow J, Smith I (1998) The efficiency of multimodal interaction: A case study. In: Proceedings of international conference on spoken language processing, ICSLP’98, Australia
Cohn JF, Schmidt KL (2004) The timing of facial motion in posed and spontaneous smiles. Int J Wavelets Multiresolut Inf Process 2:1–12
Cohn JF (2006) Foundations of human computing: facial expression and emotion. In: Proc eighth ACM int’l conf multimodal interfaces (ICMI’06), pp 233–238
Cootes T, Edwards G, Taylor C (2001) Active appearance models. IEEE PAMI 23(6):681–685
Cowie R, Cornelius R (2003) Describing the emotional states that are expressed in speech. Speech Commun 40:5–32
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag
Cowie R, Douglas-Cowie E, Savvidou S, McMahon E, Sawey M, Schroder M (2000) ‘FeelTrace’: an instrument for recording perceived emotion in real time. In: Proceedings of ISCA workshop on speech and emotion, pp 19–24
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human-computer interaction. In: IEEE Signal Process Mag, 32–80
De Silva LC, Ng PC (2000) Bimodal emotion recognition. In: Proc face and gesture recognition conf, pp 332–335
Devillers L, Vidrascu L (2007) Real-life emotion recognition human-human call center data with acoustic and lexical cues. In: Müller C, Schötz S (eds) Speaker characterization. Springer, Berlin (to appear)
Duric Z, Gray WD, Heishman R, Li F, Rosenfeld A, Schoelles MJ, Schunn C, Wechsler H (2002) Integrating perceptual and cognitive modeling for adaptive and intelligent human-computer interaction. In: Proc IEEE, vol 90(7), pp 1272–1289
Ekman P, Friesen WV (1978) The facial action coding system: a technique for the measurement of facial movement. Consulting Psychologists Press, San Francisco
Ekman P, Friesen WV (1982) Felt, false, and miserable smiles. J Nonverbal Behav, 6:238–252
Ekman P, Rosenberg EL (2005) What the face reveals: basic and applied studies of spontaneous expression using the facial action coding system, 2nd edn. Oxford University Press, London
Ekman P (1993) Facial expression and emotion. Am Psychol 48:384–392
Elman JL (1990) Finding structure in time. Cogn Sci 14:179–211
Elman JL (1991) Distributed representations, simple recurrent networks, and grammatical structure. Mach Learn 7:195–224
Essa IA, Pentland AP (1997) Coding, analysis, interpretation, and recognition of facial expressions. IEEE Trans Pattern Anal Mach Intell 19(7):757–763
Fasel B, Luttin J (2003) Automatic facial expression analysis: survey. Pattern Recogn 36(1):259–275
Fragopanagos N, Taylor JG (2005) Emotion recognition in human computer interaction. Neural Netw 18:389–405
Frank MG, Ekman P (1993) Not all smiles are created equal: differences between enjoyment and other smiles. Humor: Int J Res Humor 6:9–26
Go HJ, Kwak KC, Lee DJ, Chun MG (2003) Emotion recognition from facial image and speech signal. In: Proc int’l conf soc of instrument and control engineers, pp 2890–2895
Gunes H, Piccardi M (2005) Fusing face and body gesture for machine recognition of emotions. In: 2005 IEEE international workshop on robots and human interactive communication, pp 306–311
Hammer A, Tino P (2003) Recurrent neural networks with small weights implement definite memory machines. Neural Comput 15(8):1897–1929
Haykin S (1999) Neural networks: a comprehensive foundation. Prentice Hall, New York
Hoch S, Althoff F, McGlaun G, Rigoll G (2005) Bimodal fusion of emotional data in an automotive environment. In: Proc 30th int’l conf acoustics, speech, and signal processing (ICASSP ’05), vol II, pp 1085–1088
Ioannou S, Raouzaiou A, Tzouvaras V, Mailis T, Karpouzis K, Kollias S (2005) Emotion recognition through facial expression analysis based on a neurofuzzy network. Neural Netw 18(4):423–435. Special issue on emotion: understanding & recognition
Ioannou S, Caridakis G, Karpouzis K, Kollias S (2007) Robust feature detection for facial expression recognition. EURASIP J Image Video Process 2007(2)
Jaimes A, Sebe N (2005) Multimodal human computer interaction: a survey. In: IEEE international workshop on human computer interaction, ICCV 2005, Beijing, China
Jaimes A (2006) Human-centered multimedia: culture, deployment, and access. IEEE Multimedia Mag 13(1)
Kapoor A, Picard RW, Ivanov Y (2004) Probabilistic combination of multiple modalities to detect interest. In: Proc of IEEE ICPR
Kapoor A, Burleson W, Picard RW (2007) Automatic prediction of frustration. Int J Human-Comput Stud 65(8):724–736
Karpouzis K, Caridakis G, Kessous L, Amir N, Raouzaiou A, Malatesta L, Kollias S (2007) Modeling naturalistic affective states via facial, vocal, and bodily expressions recognition. In: Huang T, Nijholt A, Pantic M, Pentland A (eds) Lecture notes in artificial intelligence, vol 4451. Springer, Berlin. pp 91–112. Special Volume on AI for Human Computing
Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13(2):293–303
Leung SH, Wang SL, Lau WH (2004) Lip image segmentation using fuzzy clustering incorporating an elliptic shape function. IEEE Trans Image Process 13(1)
Lisetti CL, Nasoz F (2002) MAUI: a multimodal affective user interface. In: Proc 10th ACM int’l conf multimedia (Multimedia ’02), pp 161–170
Littlewort G, Bartlett MS, Fasel I, Susskind J, Movellan J (2006) Dynamics of facial expression extracted automatically from video. Image Vis Comput 24:615–625
Littlewort GC, Bartlett MS, Lee K (2007) Faces of pain: automated measurement of spontaneous facial expressions of genuine and posed pain. In: Proc ninth ACM int’l conf multimodal interfaces (ICMI’07), pp 15–21
Maat L, Pantic M (2006) Gaze-X: adaptive affective multimodal interface for single-user office scenarios. In: Proc eighth ACM int’l conf multimodal interfaces (ICMI’06), pp 171–178
Mehrabian A (1968) Communication without words. Psychol. Today 2(4):53–56
Mertens P (2004) The prosogram: semi-automatic transcription of prosody based on a tonal perception model. In: Bel B, Marlien I (eds) Proc of speech Prosody, Japan
Neiberg D, Elenius K, Karlsson I, Laskowski K (2006) Emotion recognition in spontaneous speech. In: Proceedings of fonetik 2006, pp 101–104
Oudeyer PY (2003) The production and recognition of emotions in speech: features and algorithms. Int J Human-Comput Interact 59(1–2):157–183
Oviatt S (1999) Ten myths of multimodal interaction. Commun ACM 42(11):74–81
Oviatt S, DeAngeli A, Kuhn K (1997) Integration and synchronization of input modes during multimodal human-computer interaction. In: Proceedings of conference on human factors in computing systems CHI’97. ACM, New York, pp 415–422
Pal P, Iyer AN, Yantorno RE (2006) Emotion detection from infant facial expressions and cries. In: Proc IEEE int’l conf acoustics, speech and signal processing (ICASSP’06), vol 2, pp 721–724
Pantic M, Patras I (2006) Dynamics of facial expression: recognition of facial actions and their temporal segments from face profile image sequences. IEEE Trans Syst Man Cybern, Part B 36(2):433–449
Pantic M, Rothkrantz LJM (2003) Towards an affect-sensitive multimodal human-computer interaction. Proc IEEE 91(9):1370–1390
Pantic M, Bartlett MS (2007) Machine analysis of facial expressions. In: Delac K, Grgic M (eds) Face recognition, I-Tech Education and Publishing, pp 377–416
Pantic M, Rothkrantz LJM (2000) Automatic analysis of facial expressions: the state of the art. IEEE Trans Pattern Anal Mach Intell 22(12):1424–1445
Pantic M, Rothkrantz LJM (2003) Toward an affect-sensitive multimodal human-computer interaction. Proc IEEE 91(9):1370–1390
Pantic M, Rothkrantz LJM (2000) Expert system for automatic analysis of facial expressions. Image Vis Comput 18:881–905
Pantic M (2005) Face for interface. In: Pagani M (ed) The encyclopedia of multimedia technology and networking. Idea Group Reference, Hershey, vol 1, pp 308–314
Pantic M, Sebe N, Cohn JF, Huang T (2005) Affective multi-modal human-computer interaction. In: Proc 13th ACM int’l conf multimedia (Multimedia ’05), pp 669–676
Pentland A (2005) Socially aware computation and communication. Computer 38(3):33–40
Petridis S, Pantic M (2008) Audiovisual discrimination between laughter and speech. In: IEEE int’l conf acoustics, speech, and signal processing (ICASSP), pp 5117–5120
Picard RW (1997) Affective computing. MIT Press, Cambridge
Picard RW (2000) Towards computers that recognize and respond to user emotion. IBM Syst J 39(3–4):705–719
Rogozan A (1999) Discriminative learning of visual data for audiovisual speech recognition. Int J Artif Intell Tools 8:43–52
Russell JA, Mehrabian A (1977) Evidence for a three-factor theory of emotions. J Res Pers 11:273–294
Russell JA, Bachorowski J, Fernandez-Dols J (2003) Facial and vocal expressions of emotion. Ann Rev Psychol 54:329–349
Samal A, Iyengar PA (1992) Automatic recognition and analysis of human faces and facial expressions: a survey. Pattern Recogn 25(1):65–77
Sander D, Grandjean D, Scherer KR (2005) A system approach to appraisal mechanisms in emotion. Neural Netw 18:317–352
Schaefer M, Zimmermann HG (2006) Recurrent neural networks are universal approximators, ICANN 2006, pp 632–640
Scherer KR (1999) Appraisal theory. In: Dalgleish T, Power MJ (eds) Handbook of cognition and emotion, pp 637–663. Wiley, New York
Schlosberg H (1954) A scale for judgment of facial expressions. J Exp Psychol 29:497–510
Sebe N, Lew MS, Cohen I, Sun Y, Gevers T, Huang TS (2004) Authentic facial expression analysis. In: International conference on automatic face and gesture recognition (FG’04), Seoul, Korea, May 2004, pp 517–522
Sebe N, Cohen I, Huang TS (2005) Multimodal emotion recognition. Handbook of pattern recognition and computer vision. World Scientific, Singapore
Sebe N, Cohen I, Gevers T, Huang TS (2006) Emotion recognition based on joint visual and audio cues. In: Proc 18th int’l conf pattern recognition (ICPR’06), pp 1136–1139
Song M, Bu J, Chen C, Li N (2004) Audio-visual-based emotion recognition: a new approach. In: Proc int’l conf computer vision and pattern recognition (CVPR’04), pp 1020–1025
Teissier P, Robert-Ribes J, Schwartz JL (1999) Comparing models for audiovisual fusion in a noisy-vowel recognition task. IEEE Trans Speech Audio Process 7:629–642
Tekalp M, Ostermann J (2000) Face and 2-D mesh animation in MPEG-4. Signal Process Image Commun 15:387–421
Tian YL, Kanade T, Cohn JF (2001) Recognizing action units for facial expression analysis. IEEE Trans PAMI 23(2)
Tian YL, Kanade T, Cohn JF (2005) Facial expression analysis. In: Li SZ, Jain AK (eds) Handbook of face recognition, pp 247–276. Springer, Berlin
Tomasi C, Kanade T (1991) Detection and tracking of point features. Technical Report CMU-CS-91-132, Carnegie Mellon University, April 1991
Valstar MF, Gunes H, Pantic M (2007) How to distinguish posed from spontaneous smiles using geometric features. In: Proc ninth ACM int’l conf multimodal interfaces (ICMI’07), pp 38–45
Valstar M, Pantic M, Ambadar Z, Cohn JF (2006) Spontaneous versus posed facial behavior: automatic analysis of Brow actions. In: Proc eight int’l conf multimodal interfaces (ICMI’06), pp 162–170
Wang Y, Guan L (2005) Recognizing human emotion from audiovisual information. In: Proc int’l conf acoustics, speech, and signal processing (ICASSP ’05), pp 1125–1128
Weizenbaum J (1966) ELIZA—a computer program for the study of natural language communication between man and machine. Commun ACM 9(1):36–35
Whissel CM (1989) The dictionary of affect in language. In: Plutchnik R, Kellerman H (eds) Emotion: theory, research and experience: the measurement of emotions. Academic Press, New York, vol 4, pp 113–131
Williams A (2002) Facial expression of pain: an evolutionary account. Behav Brain Sci 25(4):439–488
Wu L, Oviatt SL, Cohen PR (1999) Multimodal integration—a statistical view. IEEE Trans Multimedia 1(4)
Young JW (1993) Head and face anthropometry of adult U.S. civilians. FAA Civil Aeromedical Institute, 1963–1993 (final report 1993)
Zeng Z, Tu J, Liu M, Huang TS, Pianfetti B, Roth D, Levinson S (2007) Audio-visual affect recognition. IEEE Trans Multimedia 9(2):424–428
Zeng Z, Tu J, Liu M, Huang TS, Pianfetti B, Roth D, Levinson S (2007) Audio-visual affect recognition. IEEE Trans Multimedia 9(2)
Zeng Z, Hu Y, Liu M, Fu Y, Huang TS (2006) Training combination strategy of multi-stream fused hidden Markov model for audio-visual affect recognition. In: Proc 14th ACM int’l conf multimedia (Multimedia’06), pp 65–68
Zeng Z, Hu Y, Roisman GI, Wen Z, Fu Y, Huang TS (2007) Audio-visual spontaneous emotion recognition. In: Huang TS, Nijholt A, Pantic M, Pentland A (eds) Artificial intelligence for human computing, pp 72–90. Springer, Berlin
Zeng Z, Pantic M, Roisman GI, Huang TS (2007) A survey of affect recognition methods: audio, visual, and spontaneous expressions. In: Proc ninth ACM int’l conf multimodal interfaces (ICMI’07), pp 126–133
Zeng Z, Tu J, Liu M, Zhang T, Rizzolo N, Zhang Z, Huang TS, Roth D, Levinson S (2004) Bimodal HCI-related emotion recognition. In: Proc sixth ACM int’l conf multimodal interfaces (ICMI’04), pp 137–143
Zeng Z, Tu J, Pianfetti P, Liu M, Zhang T, Zhang Z, Huang TS, Levinson S (2005) Audio-visual affect recognition through multi-stream fused HMM for HCI. In: Proc IEEE int’l conf computer vision and pattern recognition (CVPR’05), pp 967–972
Zeng Z, Tu J, Liu M, Huang TS, Pianfetti B, Roth D, Levinson S (2007) Audio-visual affect recognition. IEEE Trans Multimedia 9(2):424–428
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Caridakis, G., Karpouzis, K., Wallace, M. et al. Multimodal user’s affective state analysis in naturalistic interaction. J Multimodal User Interfaces 3, 49–66 (2010). https://doi.org/10.1007/s12193-009-0030-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12193-009-0030-8