skip to main content
10.1145/2808196.2811638acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multi-modal Dimensional Emotion Recognition using Recurrent Neural Networks

Published: 26 October 2015 Publication History

Abstract

Emotion recognition has been an active research area with both wide applications and big challenges. This paper presents our effort for the Audio/Visual Emotion Challenge (AVEC2015), whose goal is to explore utilizing audio, visual and physiological signals to continuously predict the value of the emotion dimensions (arousal and valence). Our system applies the Recurrent Neural Networks (RNN) to model temporal information. We explore various aspects to improve the prediction performance including: the dominant modalities for arousal and valence prediction, duration of features, novel loss functions, directions of Long Short Term Memory (LSTM), multi-task learning, different structures for early feature fusion and late fusion. Best settings are chosen according to the performance on the development set. Competitive experimental results compared with the baseline show the effectiveness of the proposed methods.

References

[1]
D. Litman and K. Forbes, Recognizing emotions from student speech in tutoring dialogues. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2003.
[2]
D.J. France, R.G. Shiavi, S. Silverman, M. Wilkes. Acoustical properties of speech as indicators of depression and suicidal risk. IEEE Trans. Biomedical Eng. 47(7) (2000) pp. 829--837.
[3]
Stacy Marsella. Computationally Modeling Human Emotion. Communications of the ACM, Vol. 57 No. 12, Pages 56--67, 2015.
[4]
M. Ayadi, M. Kamel, F. Karray. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(2011) 572--587.
[5]
D. Wu, T. D. Parsons, E. Mower, S. Narayanan. Speech emotion estimation in 3D space. Multimedia and Expo (ICME), 2010.
[6]
M. Wollmer, M. Kaiser, F. Eyben, B. Schuller. LSTM Modelling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, 2012.
[7]
L. Chao, J. Tao, M. Yang, Y. Li, Z. Wen. Multi-scale temporal modeling for dimensional emotion recognition in video. proc. 4rd ACM international workshop on Audio/Visual emotion challenge, 2014.
[8]
R. Socher, A. Perelygin, J.Y. Wu, J. Chuang, C.D. Manning, A.Y. Ng, C.P. Potts. Recursive Deep Models for Semantic Compositionality over a Sentiment Treebank. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013.
[9]
Yichuan Tang. Deep Learning with Linear Support Vector Machines. In Workshop on Representational Learning, ICML 2013, Atlanta, USA, 2013.
[10]
S. Piana, A. Staglianò, F. Odone, A. Verri, A. Camurri. Real-time Automatic Emotion Recognition from Body Gestures. http://arxiv.org/pdf/1402.5047v1.pdf
[11]
G. Castellano, S. D. Villalba, A. Camurri. Recognising Human Emotions from Body Movement and Gesture Dynamics. Affective Computing and Intelligent Interaction, 2007.
[12]
G. Chanel, J. Kronegg, D. Grandjean, and T. Pun. Emotion assessment: Arousal evaluation using EEG's and peripheral physiological signals (Tech. Rep. 05.02). Geneva, Switzerland: Computer Vision Group, Computing Science Center, University of Geneva, 2002.
[13]
V. Rozgic, S. Ananthakrishnan, S. Saleem, R. Kumar, A. Vembu, R. Prasad. Emotion Recognition using Acoustic and Lexical Features. INTERSPEECH, 2012.
[14]
Z. Zeng, M. Pantic, G. I. Rosiman, and T. S. Huang. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. Trans. on Pattern Analysis and Machine Intelligence, vol. 31, no. 1, pp. 39--58, 2009.
[15]
H. Gunes, M. Pantic. Automatic, dimensional and continuous emotion recognition. International Journal of Synthetic Emotions (IJSE), 1(1): 68--99, 2010.
[16]
M. Wöllmer, M. Kaiser, F. Eyben, B. Schuller, G. Rigoll. LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, Volume 31, Issue 2, Pages 153--163, February 2013.
[17]
R. Xia, J. Deng, B. Schuller, Y. Liu. Modeling gender information for emotion recognition using Denoising autoencoder. 990--994, ICASSP, 2014.
[18]
Z. Huang, M. Dong, Q. Mao, Y. Zhan. Speech Emotion Recognition Using CNN. Proceedings of the ACM International Conference on Multimedia, MM14, Orlando, FL, USA, 2014.
[19]
S.E. Kahou, C. pal, X. Bouthillier, P.Froumenty, C. Gulcehre, R. Memisevic, P. Vincent, A. Courville, and Y. Bengio. Combining Modality Specific Deep Neural Networks for Emotion Recognition in Video. In Proceedings of the 15th ACM International Conference on Multimodal Interaction (ICMI'13) pp. 543--550, 2013.
[20]
F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J.P. Thiran, T. Ebrahimi, D. Lalanne, B. Schuller. Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Pattern Recognition Letters, 29 November 2014.
[21]
C. Shan, S. Gong, and P.W. McOwan. Beyond facial expressions: Learning human emotion from body gestures. British Machine Vision Conference, Warwick, UK, 2007.
[22]
F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie, M. Pantic. The AV+EC 2015 Multimodal Affect Recognition Challenge: Bridging Across Audio, Video, and Physiological Data, AVEC Workshop, 2015.
[23]
F. Eyben, M. Wöllmer, B. Schuller. OpenSMILE -- The Munich Versatile and Fast Open-Source Audio Feature Extractor. Proc. ACM Multimedia (MM), Florence, Italy, pp. 1459--1462, 2010.
[24]
Q. Jin, C. Li, S. Chen, H. Wu, Speech Emotion Recognition With Acoustic And Lexical Features, ICASSP, Brisbane, Australia, 2015.
[25]
B. Schuller, A. Batliner, S. Steidl, D. Seppi. Recognizing Realistic Emotions and Affect in Speech: State of the Art and Lessons Leant from the First Challenge. Speech Communication, 53(10), pp. 1062--1087, 2011.
[26]
S. Hochreiter, J. Schmidhuber. Long short-term memory. Neural Computation, 1997, 9(8):1735--1780.
[27]
Alex Graves. Generating Sequences with Recurrent Neural Networks. http://arxiv.org/pdf/1308.0850v5.pdf, 2013.
[28]
F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-Farley and Y. Bengio. Theano: new features and speed improvements. NIPS 2012 deep learning workshop.
[29]
M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997, 45(11): 2673--268.
[30]
Robert Goodell Brown. Smoothing Forecasting and Prediction of Discrete Time Series. Englewood Cliffs, NJ: Prentice-Hall, 1963.
[31]
T. Tieleman, and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012.
[32]
I. Sutskever, J. Martens, G. Dahl et al. On the importance of initialization and momentum in deep learning. Proceedings of International Conference on Machine Learning, 2013.
[33]
Theanets: https://github.com/lmjohns3/theanets/

Cited By

View all
  • (2025)Voice-Based Smart System for Emotion Recognition and RegulationCognitive Computing and Cyber Physical Systems10.1007/978-3-031-77081-4_37(474-488)Online publication date: 9-Feb-2025
  • (2024)Machine Learning and EmotionsMachine and Deep Learning Techniques for Emotion Detection10.4018/979-8-3693-4143-8.ch001(1-23)Online publication date: 22-Mar-2024
  • (2024)Multimodality Fusion Aspects of Medical Diagnosis: A Comprehensive ReviewBioengineering10.3390/bioengineering1112123311:12(1233)Online publication date: 5-Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
AVEC '15: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge
October 2015
90 pages
ISBN:9781450337434
DOI:10.1145/2808196
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. affective computing
  2. emotion recognition
  3. recurrent neural network

Qualifiers

  • Research-article

Funding Sources

Conference

MM '15
Sponsor:
MM '15: ACM Multimedia Conference
October 26, 2015
Brisbane, Australia

Acceptance Rates

AVEC '15 Paper Acceptance Rate 9 of 15 submissions, 60%;
Overall Acceptance Rate 52 of 98 submissions, 53%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)51
  • Downloads (Last 6 weeks)4
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Voice-Based Smart System for Emotion Recognition and RegulationCognitive Computing and Cyber Physical Systems10.1007/978-3-031-77081-4_37(474-488)Online publication date: 9-Feb-2025
  • (2024)Machine Learning and EmotionsMachine and Deep Learning Techniques for Emotion Detection10.4018/979-8-3693-4143-8.ch001(1-23)Online publication date: 22-Mar-2024
  • (2024)Multimodality Fusion Aspects of Medical Diagnosis: A Comprehensive ReviewBioengineering10.3390/bioengineering1112123311:12(1233)Online publication date: 5-Dec-2024
  • (2024)EEG Emotion Classification Based on Graph Convolutional NetworkApplied Sciences10.3390/app1402072614:2(726)Online publication date: 15-Jan-2024
  • (2024)Overview of Multimodal Machine LearningACM Transactions on Asian and Low-Resource Language Information Processing10.1145/370103124:1(1-20)Online publication date: 17-Oct-2024
  • (2024)Contrastive Learning for Multimodal Classification of Crisis related TweetsProceedings of the ACM Web Conference 202410.1145/3589334.3648143(4555-4564)Online publication date: 13-May-2024
  • (2024)Image-text multimodal classification via cross-attention contextual transformer with modality-collaborative learningJournal of Electronic Imaging10.1117/1.JEI.33.4.04304233:04Online publication date: 1-Jul-2024
  • (2024)Multi-Rater Consensus Learning for Modeling Multiple Sparse Ratings of Affective BehaviourIEEE Transactions on Affective Computing10.1109/TAFFC.2023.329727015:3(859-871)Online publication date: 1-Jul-2024
  • (2024)Emotions in conceptual spacesPhilosophical Psychology10.1080/09515089.2024.2330477(1-27)Online publication date: 24-Mar-2024
  • (2024)Hierarchical multimodal-fusion of physiological signals for emotion recognition with scenario adaption and contrastive alignmentInformation Fusion10.1016/j.inffus.2023.102129103:COnline publication date: 1-Mar-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media