Fusion of Audio- and Visual Cues for Real-Life Emotional Human Robot Interaction

Rabie, Ahmad; Handmann, Uwe

doi:10.1007/978-3-642-23123-0_35

Ahmad Rabie¹⁸ &
Uwe Handmann¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 6835))

Included in the following conference series:

Joint Pattern Recognition Symposium

1948 Accesses
6 Citations

Abstract

Recognition of emotions from multimodal cues is of basic interest for the design of many adaptive interfaces in human-machine interaction (HMI) in general and human-robot interaction (HRI) in particular. It provides a means to incorporate non-verbal feedback in the course of interaction. Humans express their emotional and affective state rather unconsciously exploiting their different natural communication modalities such as body language, facial expression and prosodic intonation. In order to achieve applicability in realistic HRI settings, we develop person-independent affective models. In this paper, we present a study on multimodal recognition of emotions from such auditive and visual cues for interaction interfaces. We recognize six classes of basic emotions plus the neutral one of talking persons. The focus hereby lies on the simultaneous online visual and accoustic analysis of speaking faces. A probabilistic decision level fusion scheme based on Bayesian networks is applied to draw benefit of the complementary information from both – the acoustic and the visual – cues. We compare the performance of our state of the art recognition systems for separate modalities to the improved results after applying our fusion scheme on both DaFEx database and a real-life data that captured directly from robot. We furthermore discuss the results with regard to the theoretical background and future applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Battocchi, A., Pianesi, F., Goren-Bar, D.: A first evaluation study of a database of kinetic facial expressions (dafex). In: Proc. Int. Conf. Multimodal Interfaces, pp. 214–221. ACM Press, New York (2005)
Google Scholar
Ekman, P., Friesen, W.: Unmasking the Face: A Guide to Recognizing Emotions from Facial Expressions. Prentice Hall, Englewood Cliffs (1975)
Google Scholar
Paleari, M., Lisetti, C.L.: Toward multimodal fusion of affective cues. In: Proc. ACM Int. Workshop on Human-Centered Multimedia, pp. 99–108. ACM, New York (2006)
Chapter Google Scholar
Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Lee, S., Neumann, U., Narayanan, S.: Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proc. Int. Conf. Multimodal Interfaces (2004)
Google Scholar
Caridakis, G., Malatesta, L., Kessous, L., Amir, N., Raouzaiou, A., Karpouzis, K.: Modeling naturalistic affective states via facial and vocal expressions recognition. In: Proc. Int. Conf. Multimodal Interfaces, pp. 146–154. ACM, New York (2006)
Google Scholar
Zeng, Z., Hu, Y., Fu, Y., Huang, T.S., Roisman, G.I., Wen, Z.: Audio-visual emotion recognition in adult attachment interview. In: Proc. Int. Conf. on Multimodal Interfaces, pp. 139–145. ACM, New York (2006)
Google Scholar
Massaro, D.W., Egan, P.B.: Perceiving affect from the voice and the face. Psychonomoic Bulletin and Review (3), 215–221
Google Scholar
de Gelder, B., Vroomen, J.: Bimodal emotion perception: integration across separate modalities, cross-modal perceptula grouping or perception of multimodal events? Cognition and Emotion 14, 321–324 (2000)
Article Google Scholar
Schwartz, J.L.: Why the FLMP should not be applied to McGurk data. or how to better compare models in the bazesian framework. In: Proc. Int. Conf. Audio-Visual Speech Processing, pp. 77–82 (2003)
Google Scholar
Fagel, S.: Emotional mcgurk effect. In: Proc. Int. Conf. on Speech Prosody, Dresden, Germany (2006)
Google Scholar
Rabie, A., Lang, C., Hanheide, M., Castrillon-Santana, M., Sagerer, G.: Automatic initialization for facial analysis in interactive robotics (2008)
Google Scholar
Hegel, F., Spexard, T., Vogt, T., Horstmann, G., Wrede, B.: Playing a different imitation game: Interaction with an empathic android robot. In: Proc. Int. Conf. Humanoid Robots, pp. 56–61 (2006)
Google Scholar
Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. PAMI 23, 681–685 (2001)
Article Google Scholar
Castrillón, M., Déniz, O., Guerra, C., Hernández, M.: Encara2: Real-time detection of multiple faces at different resolutions in video streams. Journal of Visual Communication and Image Representation 18, 130–140 (2007)
Article Google Scholar
Hanheide, M., Wrede, S., Lang, C., Sagerer, G.: Who am i talking with? a face memory for social robots (2008)
Google Scholar
Vogt, T., André, E., Bee, N.: Emovoice — A framework for online recognition of emotions from voice. In: Proc. Workshop on Perception and Interactive Technologies for Speech-Based Systems, Irsee, Germany (2008)
Google Scholar
Hall, M.A.: Correlation-based feature subset selection for machine learning. Master’s thesis, University of Waikato, New Zealand (1998)
Google Scholar
Vogt, T., André, E.: Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition. In: Proc. of IEEE Int. Conf. on Multimedia & Expo., Amsterdam, The Netherlands (2005)
Google Scholar
Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transaction on Pattern Analysis and Macine Intellegence 31, 39–58 (2009)
Article Google Scholar
Rabie, A., Vogt, T., Hanheide, M., Wrede, B.: Evaluation and discussion of multi-modal emotion recognition. In: ICCEE (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Informatics, University of Applied Sciences; HRW, Mülheim & Bottrop, Germany
Ahmad Rabie & Uwe Handmann

Authors

Ahmad Rabie
View author publications
You can also search for this author in PubMed Google Scholar
Uwe Handmann
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Visual Sensorics and Information Processing Lab, Johann-Wolfgang Goethe University, 60054, Frankfurt/Main, Germany
Rudolf Mester
Computer Vision Laboratory, Linköping University, 58183, Linköping, Sweden
Michael Felsberg

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rabie, A., Handmann, U. (2011). Fusion of Audio- and Visual Cues for Real-Life Emotional Human Robot Interaction. In: Mester, R., Felsberg, M. (eds) Pattern Recognition. DAGM 2011. Lecture Notes in Computer Science, vol 6835. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23123-0_35

Download citation

DOI: https://doi.org/10.1007/978-3-642-23123-0_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23122-3
Online ISBN: 978-3-642-23123-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics