Abstract
Automatic multimodal recognition of spontaneous emotional expressions is a largely unexplored and challenging problem. In this paper, we explore audio-visual emotion recognition in a realistic human conversation setting—the Adult Attachment Interview (AAI). Based on the assumption that facial expression and vocal expression are at the same coarse affective states, positive and negative emotion sequences are labeled according to Facial Action Coding System. Facial texture in visual channel and prosody in audio channel are integrated in the framework of Adaboost multi-stream hidden Markov model (AdaMHMM) in which the Adaboost learning scheme is used to build component HMM fusion. Our approach is evaluated in AAI spontaneous emotion recognition experiments.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Picard, R.W.: Affective Computing. MIT Press, Cambridge (1997)
Litman, D.J., Forbes-Riley, K.: Predicting Student Emotions in Computer-Human Tutoring Dialogues. In: Proc. of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL) (July 2004)
Kapoor, A., Picard, R.W.: Multimodal Affect Recognition in Learning Environments. In: ACM Multimedia, pp. 677–682 (2005)
Viola, P.: Robust Real-Time Face Detection. Int. Journal of Computer Vision. 57(2), 137–154 (2004)
Polzin, S.T., Waibel, A.: Pronunciation Variations in Emotional Speech. In: Proceedings of the ESCA Workshop, pp. 103–108 (1999)
Athanaselis, T., et al.: ASR for Emotional Speech: Clarifying the Issues and Enhancing Performance. Neural Networks 18, 437–444 (2005)
Steeneken, H.J.M., Hansen, J.H.L.: Speech under stress conditions: Overview of the effect of speech production and on system performance. In: Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 4, pp. 2079–2082 (1999)
Okawa, S., Bocchieri, E., Potamianos, A.: Multi-band Speech Recognition in noisy environments. In: ICASSP, pp. 641–644 (1998)
Garg, A., et al.: Frame-dependent multi-stream reliability indicators for audio-visual speech recognition. In: ICASSP (2003)
Ekman, P., Friesen, W.V., Ellsworth, P.: Emotion in the Human Face. Pergamon Press, Elmsford (1972)
Izard, C.: The face of Emotion. Appleton-Century-Crofts, New York (1971)
Scherer, K.R.: Feelings integrate the central representation of appraisal-driven response organization in emotion. In: Manstead, A.S.R., Frijda, N.H., Fischer, A.H. (eds.) Feelings and emotions, The Amsterdam symposium, pp. 136–157. Cambridge University Press, Cambridge (2004)
Ekman, P., Friensen, W.V., Hager, J.: Facial Action Unit System. A Human Face (2002)
Ekman, P., Rosenberg, E.L.: What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using Facial Action Coding System, 2nd edn. Oxford University Express, Oxford (2005)
Russell, J.A., Bachorowski, J.A., Fernandez-Dols, J.: Facial and Vocal Expressions of Emotion. Annual Review Psychology 54, 329–349 (2003)
Ekman, P., Oster, H.: Facial Expressions of Emotion. Annual Review Psychology 30, 527–554 (1979)
Roisman, G.I., Tsai, J.L., Chiang, K.S.: The Emotional Integration of Childhood Experience: Physiological, Facial Expressive, and Self-reported Emotional Response During the Adult Attachment Interview. Developmental Psychology 40(5), 776–789 (2004)
Cohn, J.F., Tronick, E.: Mother Infant Interaction: the sequence of dyadic states at three, six and nine months. Development Psychology 23, 68–77 (1988)
Fried, E.: The impact of nonverbal communication of facial affect on children’s learning. PhD thesis, Rutgers University, New Brunswick, NJ (1976)
Ekman, P., Matsumoto, D., Friesen, W.: Facial Expression in Affective Disorders. In: Ekman, P., Rosenberg, E.L. (eds.) What the Face Reveals, pp. 429–439 (2005)
Zeng, Z., et al.: One-class classification on spontaneous facial expressions. In: Automatic Face and Gesture Recognition, pp. 281–286 (2006)
Bourlard, H., Dupont, S.: A new ASR approach based on independent processing and recombination of partial frequency bands. In: ICSLP (1996)
Devillers, L., Vidrascu, L., Lamel, L.: Challenges in real-life emotion annotation and machine learning based detection. Neural Networks 18, 407–422 (2005)
Ekman, P., et al.: Ekman-Hager Facial Action Exemplars. Human Interaction Laboratory, University of California, San Francisco (unpublished)
Kanade, T., Cohn, J., Tian, Y.: Comprehensive Database for Facial Expression Analysis. In: Proceeding of International Conference on Face and Gesture Recognition, pp. 46–53 (2000)
Pantic, M., et al.: Web-based database for facial expression analysis. In: Int. Conf. on Multimedia and Expo (2005)
Chen, L.S.: Joint Processing of Audio-Visual Informa-tion for the Recognition of Emotional Expressions in Human-Computer Interaction. PhD thesis, UIUC (2000)
Cowie, R., Douglas-Cowie, E., Cox, C.: Beyond emotion archetypes: Databases for emotion modelling using neural networks. Neural Networks 18, 371–388 (2005)
Ekman, P., Rosenberg, E. (eds.): What the face reveals. Oxford University Press, Oxford (1997)
Cohn, J.F., Schmidt, K.L.: The timing of Facial Motion in Posed and Spontaneous Smiles. International Journal of Wavelets, Multiresolution and Information Processing 2, 1–12 (2004)
Valstar, M.F., et al.: Spontaneous vs. Posed Facial Behavior: Automatic Analysis of Brow Actions. In: Int. Conf. on Multimedia Interfaces, pp. 162–170 (2006)
Ekman, P.: Strong Evidence for Universals in Facial Expressions: A Reply to Russell’s Mistaken Critique. Psychological Bulletin 115(2), 268–287 (1994)
Pantic, M., Rothkrantz, L.J.M.: Toward an affect-sensitive multimodal human-computer interaction. Proceedings of the IEEE 91(9), 1370–1390 (2003)
Pantic, M., et al.: Affective Multimodal Human-Computer Interaction. In: Proc. ACM Int’l Conf. on Multimedia, November 2005, pp. 669–676 (2005)
Sebe, N., et al.: Multimodal Approaches for Emotion Recognition: A Survey. In: Proc. Of SPIE-IS&T Electronic Imaging. SPIE, vol. 5670, pp. 56–67 (2005)
Cowie, R., et al.: Emotion Recognition in Human-Computer Interaction. IEEE Signal Processing Magazine, 32–80 (January 2001)
Chen, L., Huang, T.S.: Emotional expressions in audiovisual human computer interaction. In: Int. Conf. on Multimedia & Expo, pp. 423–426 (2000)
Chen, L., et al.: Multimodal human emotion/ expression recognition. In: Int. Conf. on Automatic Face & Gesture Recognition, pp. 396–401 (1998)
De Silva, L.C., Ng, P.C.: Bimodal emotion recognition. In: Int. Conf. on Automatic Face & Gesture Recognition, pp. 332–335 (2000)
Yoshitomi, Y., et al.: Effect of sensor fusion for recognition of emotional states using voice, face image and thermal image of face. In: Proc. ROMAN, pp. 178–183 (2000)
Hoch, S., et al.: Bimodal fusion of emotional data in an automotive environment. In: ICASSP, vol. II, pp. 1085–1088 (2005)
Wang, Y., Guan, L.: Recognizing human emotion from audiovisual information. In: ICASSP, vol. II, pp. 1125–1128 (2005)
Zeng, Z., et al.: Training Combination Strategy of Multi-stream Fused Hidden Markov Model for Audio-visual Affect Recognition. In: Proc. ACM Int’l Conf. on Multimedia, pp. 65–68 (2005)
Zeng, Z., et al.: Audio-visual Affect Recognition through Multi-stream Fused HMM for HCI. In: Int. Conf. Computer Vision and Pattern Recognition, pp. 967–972 (2005)
Zeng, Z., et al.: Multi-stream Confidence Analysis for Audio-Visual Affect Recognition. In: Int. Conf. on Affective Computing and Intelligent Interaction, pp. 946–971 (2005)
Zeng, Z., et al.: Audio-visual Affect Recognition in Activation-evaluation Space. In: Int. Conf. on Multimedia & Expo, pp. 828–831 (2005)
Zeng, Z., et al.: Audio-visual Affect Recognition. IEEE Transactions on Multimedia, in press (2007)
Song, M., et al.: Audio-visual based emotion recognition—A new approach. In: Int. Conf. Computer Vision and Pattern Recognition, pp. 1020–1025 (2004)
Busso, C., et al.: Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information. In: Int. Conf. Multimodal Interfaces, pp. 205–211 (2004)
Hoch, S., et al.: Bimodal fusion of emotional data in an automotive environment. In: ICASSP, vol. II, pp. 1085–1088 (2005)
Wang, Y., Guan, L.: Recognizing human emotion from audiovisual information. In: ICASSP, vol. II, pp. 1125–1128 (2005)
Go, H.J., et al.: Emotion recognition from facial image and speech signal. In: Int. Conf. of the Society of Instrument and Control Engineers, pp. 2890–2895 (2003)
Bartlett, M.S., et al.: Recognizing Facial Expression: Machine Learning and Application to Spontaneous Behavior. In: IEEE CVPR’05 (2005)
Sebe, N., et al.: Authentic Facial Expression Analysis. In: Int. Conf. on Automatic Face and Gesture Recognition (2004)
Zeng, Z., et al.: Spontaneous Emotional Facial Expression Detection. Journal of Multimedia 1(5), 1–8 (2006)
Valstar, M.F., et al.: Spontaneous vs. Posed Facial Behavior: Automatic Analysis of Brow Actions. In: Int. Conf. on Multimodal Interfaces, pp. 162–170 (2006)
Cohn, J.F., et al.: Automatic Analysis and recognition of brow actions and head motion in spontaneous facial behavior. In: Int. Conf. on Systems, Man & Cybernetics, vol. 1, pp. 610–616 (2004)
Litman, D.J., Forbes-Riley, K.: Predicting Student Emotions in Computer-Human Tutoring Dialogues. In: Proc. of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL) (July 2004)
Batliner, A., et al.: How to find trouble in communication. Speech Communication 40, 117–143 (2003)
Neiberg, D., Elenius, K., Laskowski, K.: Emotion Recognition in Spontaneous Speech Using GMM. In: Int. Conf. on Spoken Language Processing, pp. 809–812 (2006)
Ang, J., et al.: Prosody-based automatic detection of annoyance and frustration in human-computer dialog. In: ICSLP (2002)
Fragopanagos, F., Taylor, J.G.: Emotion recognition in human-computer interaction. Neural Networks 18, 389–405 (2005)
Garidakis, G., et al.: Modeling Naturalistic Affective States via Facial and Vocal Expression Recognition. In: Int. Conf. on Multimodal Interfaces, pp. 146–154 (2006)
Cowie, R., et al.: ’Feeltrace’: an instrument for recording perceived emotion in real time. In: Proceedings of the ISCA Workshop on Speech and Emotion, pp. 19–24 (2000)
Maat, L., Pantic, M.: Gaze-X: Adaptive Affective Multimodal Interface for Single-User Office Scenarios. In: Int. Conf. on Multimodal Interfaces, pp. 171–178 (2006)
Lanitis, A., Taylor, C., Cootes, T.: A Unified Approach to Coding and Interpreting Face Images. In: Proc. International Conf. on Computer Vision, pp. 368–373 (1995)
Black, M., Yacoob, Y.: Tracking and Recognizing Rigid and Non-rigid Facial Motions Using Local Parametric Models of Image Motion. In: Proc. Int. Conf. on Computer Vision, pp. 374–381 (1995)
Rosenblum, M., Yacoob, Y., Davis, L.: Human Expression Recognition from Motion Using a Radial Basis Function Network Architecture. IEEE Trans. on Neural Network 7(5), 1121–1138 (1996)
Essa, I., Pentland, A.: Coding, Analysis, Interpretation, and Recognition of Facial Expressions. IEEE Trans. On Pattern Analysis and Machine Intelligence 19(7), 757–767 (1997)
Cohen, L., et al.: Facial expression recognition from video sequences: Temporal and static modeling. Computer Vision and Image Understanding 91(1-2), 160–187 (2003)
Tian, Y., Kanade, T., Cohn, J.F.: Recognizing Action Units for Facial Expression Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(2), 97–115 (2001)
Pantic, M., Patras, I.: Dynamics of Facial Expression: Recognition of Facial Actions and Their Temporal Segments from Face Profile Image Sequences. IEEE Transactions on Systems, Man and Cybernetics - Part B 36(2), 433–449 (2006)
Kwon, O.W., et al.: Emotion Recognition by Speech Signals. In: EUROSPEECH (2003)
Polzin, T.: Detecting Verbal and Non-verbal cues in the communication of emotion. PhD thesis, Carnegie Mellon University (1999)
Amir, N., Ron, S.: Toward Automatic Classification of Emotions in Speech. In: Proc. ICSLP, pp. 555–558 (1998)
Dellaert, F., Polzin, T., Waibel, A.: Recognizing Emotion in Speech. In: Proc. ICSLP, pp. 1970–1973 (1996)
Petrushin, V.A.: Emotion Recognition in Speech Signal. In: Proc. ICSLP, pp. 222–225 (2000)
Pantic, M., et al.: Human Computing and Machine Understanding of Human Behavior: A Survey. In: Int. Conf. Multimodal Interfaces, pp. 233–238 (2006)
Huang, D.: Physiological, subjective, and behavioral Responses of Chinese American and European Americans during moments of peak emotional intensity. Honor Bachelor thesis, Psychology, University of Minnesota (1999)
Tao, H., Huang, T.S.: Explanation-based facial motion tracking using a piecewise Bezier volume deformation mode. In: IEEE CVPR’99, vol. 1, pp. 611–617 (1999)
Wen, Z., Huang, T.: Capturing Subtle Facial Motions in 3D Face Tracking. In: Intl. Conf. on Computer Vision (ICCV), pp. 1343–1350 (2003)
He, X., et al.: Learning a Locality Preserving Subspace for Visual Recognition. In: Int. Conf. on Computer Vision (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Zeng, Z., Hu, Y., Roisman, G.I., Wen, Z., Fu, Y., Huang, T.S. (2007). Audio-Visual Spontaneous Emotion Recognition. In: Huang, T.S., Nijholt, A., Pantic, M., Pentland, A. (eds) Artifical Intelligence for Human Computing. Lecture Notes in Computer Science(), vol 4451. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72348-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-540-72348-6_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72346-2
Online ISBN: 978-3-540-72348-6
eBook Packages: Computer ScienceComputer Science (R0)