research-article

Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks

Authors:

Hichem SahliAuthors Info & Claims

AVEC '15: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge

Pages 73 - 80

https://doi.org/10.1145/2808196.2811641

Published: 26 October 2015 Publication History

Abstract

This paper presents our system design for the Audio-Visual Emotion Challenge ($AV^{+}EC$ 2015). Besides the baseline features, we extract from audio the functionals on low-level descriptors (LLDs) obtained via the YAAFE toolbox, and from video the Local Phase Quantization from Three Orthogonal Planes (LPQ-TOP) features. From the physiological signals, we extract 52 electro-cardiogram (ECG) features and 22 electro-dermal activity (EDA) features from various analysis domains. The extracted features along with the $AV^{+}EC$ 2015 baseline features of audio, ECG or EDA are concatenated for a further feature selection step, in which the concordance correlation coefficient (CCC), instead of the usual Pearson correlation coefficient (CC), has been used as objective function. In addition, offsets between the features and the arousal/valence labels are considered in both feature selection and modeling of the affective dimensions. For the fusion of multimodal features, we propose a Deep Bidirectional Long Short-Term Memory Recurrent Neural Network (DBLSTM-RNN) based multimodal affect prediction framework, in which the initial predictions from the single modalities via the DBLSTM-RNNs are firstly smoothed with Gaussian smoothing, then input into a second layer of DBLSTM-RNN for the final prediction of affective state. Experimental results show that our proposed features and the DBLSTM-RNN based fusion framework obtain very promising results. On the development set, the obtained CCC is up to 0.824 for arousal and 0.688 for valence, and on the test set, the CCC is 0.747 for arousal and 0.609 for valence.

References

[1]

F. Agrafioti, D. Hatzinakos, and A. K. Anderson. ECG pattern analysis for emotion detection. IEEE Transactions on Affective Computing, 3(1):102--115, 2012.

Digital Library

[2]

L. Chao, J. Tao, M. Yang, Y. Li, and Z. Wen. Multi-scale temporal modelling for dimensional emotion recognition in video. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pages 11--18. ACM, 2014.

Digital Library

[3]

T. Costa, D. Galati, and E. Rognoni. The hurst exponent of cardiac response to positive and negative emotional film stimuli using wavelet. Autonomic Neuroscience, 151(2):183--185, 2009.

[4]

F. Eyben, F. Weninger, F. Gross, and B. Schuller. Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In Proc. of ACM MM, pages 835--838. ACM, 2013.

Digital Library

[5]

Y. Fan, Y. Qian, F. Xie, and F. K. Soong. TTS synthesis with bidirectional LS™ based recurrent neural networks. In Proc. Interspeech, 2014.

[6]

R. Fernandez, A. Rendel, B. Ramabhadran, and R. Hoory. Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), 2014.

[7]

A. Graves, N. Jaitly, and A.-r. Mohamed. Hybrid speech recognition with deep bidirectional LS™. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 273--278. IEEE, 2013.

[8]

A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6645--6649. IEEE, 2013.

[9]

X. Guo. Study of emotion recognition based on electrocardiogram and RBF neural network. Procedia Engineering, 15:2408--2412, 2011.

[10]

R. Gupta, N. Malandrakis, B. Xiao, T. Guha, M. Van Segbroeck, M. Black, A. Potamianos, and S. Narayanan. Multimodal prediction of affective dimensions and depression in human-computer interactions. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pages 33--40. ACM, 2014.

Digital Library

[11]

H.Gunes, M. A. Nicolaou, and M. Pantic. Continuous analysis of affect from voice and face. Computer Analysis of Human Behaviour. Springer Verlag, London, pages 255--291, 2011.

[12]

A. Jan, H. Meng, Y. Gaus, F. Zhang, and S. Turabzadeh. Hilbert-Huang transform based physiological signals analysis for emotion recognition. In Signal Processing and Information Technology (ISSPIT), 2009 IEEE International Symposium on. IEEE, 2009.

[13]

A. Jan, H. Meng, Y. Gaus, F. Zhang, and S. Turabzadeh. Automatic depression scale prediction using facial expression dynamics and regression. In Proceedings of the 4th ACM international workshop on Audio/visual emotion challenge, pages 73--80. ACM, 2014.

Digital Library

[14]

J. Kim and E. Andre. Emotion recognition based on physiological changes in music listening. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(12):2067--2083, 2008.

Digital Library

[15]

S. D. Kreibig. Autonomic nervous system activity in emotion: a review. Biological Psychology, 84(3):394--421, 2010.

[16]

B. Mathieu, S. Essid, T. Fillon, J. Prado, and G. Richard. YAAFE, an easy to use and efficient audio feature extraction software. Proceedings of Ismir Conference, 2010.

[17]

H. Meng, D. Huang, H. Wang, H. Yang, M. Al-Shuraifi, and Y. Wang. Depression recognition based on dynamic facial and vocal expression features using partial least square regression. In Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge, pages 21--29. ACM, 2013.

Digital Library

[18]

V. Mitra, E. Shriberg, M. McLaren, A. Kathol, C. Richey, D. Vergyri, and M. Graciarena. The SRI AVEC-2014 evaluation system. In Proceedings of the 4th ACM international workshop on Audio/visual emotion challenge, pages 93--101. ACM, 2014.

Digital Library

[19]

M. A. Nicolaou, H. Gunes, and M. Pantic. Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. Affective Computing, IEEE Transactions on, 2(2):92--105, 2011.

Digital Library

[20]

J. Nicolle, V. Rapp, K. Bailly, L. Prevost, and M. Chetouani. Robust continuous prediction of human emotions using multiscale dynamic cues. In Proceedings of the 14th ACM international conference on multimodal interaction, pages 501--508. ACM, 2012.

Digital Library

[21]

P. Prenter. Splines and Variational Methods. Wiley, New York, 1989.

[22]

P. Pudil, J. Novovicová, and J. Kittler. Floating search methods in feature selection. Pattern recognition letters, 15(11):1119--1125, 1994.

Digital Library

[23]

F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie, and M. Pantic. AV

[24]

EC 2015 - the first affect recognition challenge bridging across audio, video, and physiological data. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge (AVEC), ACM MM, Brisbane, Australia, October 2015.

Digital Library

[25]

F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne. Introducing the RECOLA Multimodal Corpus of Remote Collaborative and Affective Interactions. In Proceedings of Face and Gestures 2013, 2nd IEEE International Workshop on Emotion Representation, Analysis and Synthesis in Continuous Time and Space (EmoSPACE), Shanghai, China, April 2013.

[26]

H. Sak, A. Senior, and F. Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), 2014.

[27]

B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, and M. Pantic. AVEC 2011--the first international audio/visual emotion challenge. In Affective Computing and Intelligent Interaction, pages 415--424. Springer, 2011.

Digital Library

[28]

B. Schuller, M. Valster, F. Eyben, R. Cowie, and M. Pantic. AVEC 2012: the continuous audio/visual emotion challenge. In Proceedings of the 14th ACM international conference on Multimodal interaction, pages 449--456. ACM, 2012.

Digital Library

[29]

G. Valenza, P. Allegrini, A. Lanata, and E. P. Scilingo. Dominant Lyapunov exponent and approximate entropy in heart rate variability during emotional visual elicitation. Frontiers in Neuroengineering, 5(3), 2012.

[30]

G. Valenza, L. Citi, A. Lanata, E. Scilingo, and R. Barbieri. Revealing real-time emotional responses: a personalized assessment based on heartbeat dynamics. Nature-SCIENTIFIC REPORTS, 2014.

[31]

G. Valenza, A. Lanata, and E. Scilingo. The role of nonlinear dynamics in affective valence and arousal recognition. Affective Computing, IEEE Transactions On, 3(2):237--249, 2012.

Digital Library

[32]

M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajewski, R. Cowie, and M. Pantic. AVEC 2014: 3d dimensional affect and depression recognition challenge. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pages 3--10. ACM, 2014.

Digital Library

[33]

M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia, S. Schnieder, R. Cowie, and M. Pantic. AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. In Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge, pages 3--10. ACM, 2013.

Digital Library

[34]

L. Van Der Maaten. Audio-visual emotion challenge 2012: a simple approach. In Proceedings of the 14th ACM international conference on Multimodal interaction, pages 473--476. ACM, 2012.

Digital Library

[35]

D. Ververidis and C. Kotropoulos. Fast sequential floating forward selection applied to emotional speech features estimated on des and susas data collections. In Proc. XIV European Signal Processing Conf, 2006.

[36]

M. W02llmer, M. Kaiser, F. Eyben, B. Schuller, and G. Rigoll. LS™-modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, 31(2):153--63, 2013.

Digital Library

[37]

N. Wang, E. Ambikairajah, B. Celler, and N. Lovell. Accelerometry based classification of gait patterns using empirical mode decomposition. In Proc. of ICASSP, pages 617--620. IEEE, 2008.

[38]

F. Weninger, J. Bergmann, and B. Schuller. Introducing CURRENNT--the Munich open-source CUDA recurrent neural network toolkit. Journal of Machine Learning Research, 15, 2014.

Digital Library

[39]

F. Weninger, J. Geiger, M. Wöllmer, B. Schuller, and G. Rigoll. Feature enhancement by deep LS™ networks for ASR in reverberant multisource environments. Computer Speech and Language, 28(4):888--902, 2014.

[40]

M. Wöllmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas-Cowie, and R. Cowie. Abandoning emotion classes-Towards continuous emotion recognition with modelling of long-range dependencies. In INTERSPEECH, pages 597--600, 2008.

[41]

M. Wöllmer, B. Schuller, F. Eyben, and G. Rigoll. Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. Selected Topics in Signal Processing, IEEE Journal of, 4(5):867--881, 2010.

Cited By

Hinduja SDarzi AErtugrul IProvenza NGadot RStorch ESheth SGoodman WCohn J(2024)Multimodal Prediction of Obsessive-Compulsive Disorder and Comorbid Depression Severity and Energy Delivered by Deep Brain ElectrodesIEEE Transactions on Affective Computing10.1109/TAFFC.2024.339511715:4(2025-2041)Online publication date: Oct-2024
https://doi.org/10.1109/TAFFC.2024.3395117
Santhosh JPai AIshimaru S(2024)Toward an Interactive Reading Experience: Deep Learning Insights and Visual Narratives of Engagement and EmotionIEEE Access10.1109/ACCESS.2024.335074512(6001-6016)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3350745
Xu MChen SLian ZLiu BAmiriparian SChrist LKonig ACowen AMeßner ECambria ESchuller B(2023)Humor Detection System for MuSE 2023: Contextual Modeling, Pesudo Labelling, and Post-smoothingProceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation10.1145/3606039.3613107(35-41)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1145/3606039.3613107
Show More Cited By

Index Terms

Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
2. Hardware
  1. Communication hardware, interfaces and storage
    1. Signal processing systems

Recommendations

Long Short Term Memory Recurrent Neural Network based Multimodal Dimensional Emotion Recognition
AVEC '15: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge

This paper presents our effort to the Audio/Visual+ Emotion Challenge (AV+EC2015), whose goal is to predict the continuous values of the emotion dimensions arousal and valence from audio, visual and physiology modalities. The state of art classifier for ...
Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network
AVEC '17: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge

The continuous dimensional emotion can depict subtlety and complexity of emotional change, which is an inherently challenging problem with growing attention. This paper presents our automatic prediction of dimensional emotional state for Audio-Visual ...
Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks
AVEC'18: Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop

This paper presents our effects for Cross-cultural Emotion Sub-challenge in the Audio/Visual Emotion Challenge (AVEC) 2018, whose goal is to predict the level of three emotional dimensions time-continuously in a cross-cultural setup. We extract the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

AVEC '15: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge

October 2015

90 pages

ISBN:9781450337434

DOI:10.1145/2808196

General Chairs:
Fabien Ringeval
University of Passau, Germany
,
Björn Schuller
University of Passau/Imperial College London, Germany/UK
,
Michel Valstar
University of Nottingham, UK
,
Roddy Cowie
Queen's University Belfast, UK
,
Maja Pantic
Imperial College London/Twente University, UK/The Netherlands

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MM '15

Sponsor:

SIGMM

MM '15: ACM Multimedia Conference

October 26, 2015

Brisbane, Australia

Acceptance Rates

AVEC '15 Paper Acceptance Rate 9 of 15 submissions, 60%;

Overall Acceptance Rate 52 of 98 submissions, 53%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

106
Total Citations
View Citations
1,392
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)1

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hinduja SDarzi AErtugrul IProvenza NGadot RStorch ESheth SGoodman WCohn J(2024)Multimodal Prediction of Obsessive-Compulsive Disorder and Comorbid Depression Severity and Energy Delivered by Deep Brain ElectrodesIEEE Transactions on Affective Computing10.1109/TAFFC.2024.339511715:4(2025-2041)Online publication date: Oct-2024
https://doi.org/10.1109/TAFFC.2024.3395117
Santhosh JPai AIshimaru S(2024)Toward an Interactive Reading Experience: Deep Learning Insights and Visual Narratives of Engagement and EmotionIEEE Access10.1109/ACCESS.2024.335074512(6001-6016)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3350745
Xu MChen SLian ZLiu BAmiriparian SChrist LKonig ACowen AMeßner ECambria ESchuller B(2023)Humor Detection System for MuSE 2023: Contextual Modeling, Pesudo Labelling, and Post-smoothingProceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation10.1145/3606039.3613107(35-41)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1145/3606039.3613107
Mao SSejdić E(2023)A Review of Recurrent Neural Network-Based Methods in Computational PhysiologyIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.314536534:10(6983-7003)Online publication date: Oct-2023
https://doi.org/10.1109/TNNLS.2022.3145365
Praveen RCardinal PGranger E(2023)Audio–Visual Fusion for Emotion Recognition in the Valence–Arousal Space Using Joint Cross-AttentionIEEE Transactions on Biometrics, Behavior, and Identity Science10.1109/TBIOM.2022.32330835:3(360-373)Online publication date: Jul-2023
https://doi.org/10.1109/TBIOM.2022.3233083
Boeker MJakobsen PRiegler MStabell LFasmer OHalvorsen PHammer H(2023)Affect Recognition in Muscular Response SignalsIEEE Access10.1109/ACCESS.2023.327972011(61914-61928)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3279720
Li XLu GYan JZhang Z(2022)A Multi-Scale Multi-Task Learning Model for Continuous Dimensional Emotion Recognition from AudioElectronics10.3390/electronics1103041711:3(417)Online publication date: 29-Jan-2022
https://doi.org/10.3390/electronics11030417
Shaojie TSamad AIsmail L(2022)Systematic literature review on audio-visual multimodal input in listening comprehensionFrontiers in Psychology10.3389/fpsyg.2022.98013313Online publication date: 6-Sep-2022
https://doi.org/10.3389/fpsyg.2022.980133
Jeong HJeong YPark YKim KPark JKang D(2022)Applications of deep learning methods in digital biomarker research using noninvasive sensing dataDIGITAL HEALTH10.1177/205520762211366428(205520762211366)Online publication date: 4-Nov-2022
https://doi.org/10.1177/20552076221136642
Salman ABusso C(2022)Privacy Preserving Personalization for Video Facial Expression Recognition Using Federated LearningProceedings of the 2022 International Conference on Multimodal Interaction10.1145/3536221.3556614(495-503)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3536221.3556614
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten