skip to main content
10.1145/2808196.2811641acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks

Published: 26 October 2015 Publication History

Abstract

This paper presents our system design for the Audio-Visual Emotion Challenge ($AV^{+}EC$ 2015). Besides the baseline features, we extract from audio the functionals on low-level descriptors (LLDs) obtained via the YAAFE toolbox, and from video the Local Phase Quantization from Three Orthogonal Planes (LPQ-TOP) features. From the physiological signals, we extract 52 electro-cardiogram (ECG) features and 22 electro-dermal activity (EDA) features from various analysis domains. The extracted features along with the $AV^{+}EC$ 2015 baseline features of audio, ECG or EDA are concatenated for a further feature selection step, in which the concordance correlation coefficient (CCC), instead of the usual Pearson correlation coefficient (CC), has been used as objective function. In addition, offsets between the features and the arousal/valence labels are considered in both feature selection and modeling of the affective dimensions. For the fusion of multimodal features, we propose a Deep Bidirectional Long Short-Term Memory Recurrent Neural Network (DBLSTM-RNN) based multimodal affect prediction framework, in which the initial predictions from the single modalities via the DBLSTM-RNNs are firstly smoothed with Gaussian smoothing, then input into a second layer of DBLSTM-RNN for the final prediction of affective state. Experimental results show that our proposed features and the DBLSTM-RNN based fusion framework obtain very promising results. On the development set, the obtained CCC is up to 0.824 for arousal and 0.688 for valence, and on the test set, the CCC is 0.747 for arousal and 0.609 for valence.

References

[1]
F. Agrafioti, D. Hatzinakos, and A. K. Anderson. ECG pattern analysis for emotion detection. IEEE Transactions on Affective Computing, 3(1):102--115, 2012.
[2]
L. Chao, J. Tao, M. Yang, Y. Li, and Z. Wen. Multi-scale temporal modelling for dimensional emotion recognition in video. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pages 11--18. ACM, 2014.
[3]
T. Costa, D. Galati, and E. Rognoni. The hurst exponent of cardiac response to positive and negative emotional film stimuli using wavelet. Autonomic Neuroscience, 151(2):183--185, 2009.
[4]
F. Eyben, F. Weninger, F. Gross, and B. Schuller. Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In Proc. of ACM MM, pages 835--838. ACM, 2013.
[5]
Y. Fan, Y. Qian, F. Xie, and F. K. Soong. TTS synthesis with bidirectional LS™ based recurrent neural networks. In Proc. Interspeech, 2014.
[6]
R. Fernandez, A. Rendel, B. Ramabhadran, and R. Hoory. Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), 2014.
[7]
A. Graves, N. Jaitly, and A.-r. Mohamed. Hybrid speech recognition with deep bidirectional LS™. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 273--278. IEEE, 2013.
[8]
A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6645--6649. IEEE, 2013.
[9]
X. Guo. Study of emotion recognition based on electrocardiogram and RBF neural network. Procedia Engineering, 15:2408--2412, 2011.
[10]
R. Gupta, N. Malandrakis, B. Xiao, T. Guha, M. Van Segbroeck, M. Black, A. Potamianos, and S. Narayanan. Multimodal prediction of affective dimensions and depression in human-computer interactions. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pages 33--40. ACM, 2014.
[11]
H.Gunes, M. A. Nicolaou, and M. Pantic. Continuous analysis of affect from voice and face. Computer Analysis of Human Behaviour. Springer Verlag, London, pages 255--291, 2011.
[12]
A. Jan, H. Meng, Y. Gaus, F. Zhang, and S. Turabzadeh. Hilbert-Huang transform based physiological signals analysis for emotion recognition. In Signal Processing and Information Technology (ISSPIT), 2009 IEEE International Symposium on. IEEE, 2009.
[13]
A. Jan, H. Meng, Y. Gaus, F. Zhang, and S. Turabzadeh. Automatic depression scale prediction using facial expression dynamics and regression. In Proceedings of the 4th ACM international workshop on Audio/visual emotion challenge, pages 73--80. ACM, 2014.
[14]
J. Kim and E. Andre. Emotion recognition based on physiological changes in music listening. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(12):2067--2083, 2008.
[15]
S. D. Kreibig. Autonomic nervous system activity in emotion: a review. Biological Psychology, 84(3):394--421, 2010.
[16]
B. Mathieu, S. Essid, T. Fillon, J. Prado, and G. Richard. YAAFE, an easy to use and efficient audio feature extraction software. Proceedings of Ismir Conference, 2010.
[17]
H. Meng, D. Huang, H. Wang, H. Yang, M. Al-Shuraifi, and Y. Wang. Depression recognition based on dynamic facial and vocal expression features using partial least square regression. In Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge, pages 21--29. ACM, 2013.
[18]
V. Mitra, E. Shriberg, M. McLaren, A. Kathol, C. Richey, D. Vergyri, and M. Graciarena. The SRI AVEC-2014 evaluation system. In Proceedings of the 4th ACM international workshop on Audio/visual emotion challenge, pages 93--101. ACM, 2014.
[19]
M. A. Nicolaou, H. Gunes, and M. Pantic. Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. Affective Computing, IEEE Transactions on, 2(2):92--105, 2011.
[20]
J. Nicolle, V. Rapp, K. Bailly, L. Prevost, and M. Chetouani. Robust continuous prediction of human emotions using multiscale dynamic cues. In Proceedings of the 14th ACM international conference on multimodal interaction, pages 501--508. ACM, 2012.
[21]
P. Prenter. Splines and Variational Methods. Wiley, New York, 1989.
[22]
P. Pudil, J. Novovicová, and J. Kittler. Floating search methods in feature selection. Pattern recognition letters, 15(11):1119--1125, 1994.
[23]
F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie, and M. Pantic. AV
[24]
EC 2015 - the first affect recognition challenge bridging across audio, video, and physiological data. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge (AVEC), ACM MM, Brisbane, Australia, October 2015.
[25]
F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne. Introducing the RECOLA Multimodal Corpus of Remote Collaborative and Affective Interactions. In Proceedings of Face and Gestures 2013, 2nd IEEE International Workshop on Emotion Representation, Analysis and Synthesis in Continuous Time and Space (EmoSPACE), Shanghai, China, April 2013.
[26]
H. Sak, A. Senior, and F. Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), 2014.
[27]
B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, and M. Pantic. AVEC 2011--the first international audio/visual emotion challenge. In Affective Computing and Intelligent Interaction, pages 415--424. Springer, 2011.
[28]
B. Schuller, M. Valster, F. Eyben, R. Cowie, and M. Pantic. AVEC 2012: the continuous audio/visual emotion challenge. In Proceedings of the 14th ACM international conference on Multimodal interaction, pages 449--456. ACM, 2012.
[29]
G. Valenza, P. Allegrini, A. Lanata, and E. P. Scilingo. Dominant Lyapunov exponent and approximate entropy in heart rate variability during emotional visual elicitation. Frontiers in Neuroengineering, 5(3), 2012.
[30]
G. Valenza, L. Citi, A. Lanata, E. Scilingo, and R. Barbieri. Revealing real-time emotional responses: a personalized assessment based on heartbeat dynamics. Nature-SCIENTIFIC REPORTS, 2014.
[31]
G. Valenza, A. Lanata, and E. Scilingo. The role of nonlinear dynamics in affective valence and arousal recognition. Affective Computing, IEEE Transactions On, 3(2):237--249, 2012.
[32]
M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajewski, R. Cowie, and M. Pantic. AVEC 2014: 3d dimensional affect and depression recognition challenge. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pages 3--10. ACM, 2014.
[33]
M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia, S. Schnieder, R. Cowie, and M. Pantic. AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. In Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge, pages 3--10. ACM, 2013.
[34]
L. Van Der Maaten. Audio-visual emotion challenge 2012: a simple approach. In Proceedings of the 14th ACM international conference on Multimodal interaction, pages 473--476. ACM, 2012.
[35]
D. Ververidis and C. Kotropoulos. Fast sequential floating forward selection applied to emotional speech features estimated on des and susas data collections. In Proc. XIV European Signal Processing Conf, 2006.
[36]
M. W02llmer, M. Kaiser, F. Eyben, B. Schuller, and G. Rigoll. LS™-modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, 31(2):153--63, 2013.
[37]
N. Wang, E. Ambikairajah, B. Celler, and N. Lovell. Accelerometry based classification of gait patterns using empirical mode decomposition. In Proc. of ICASSP, pages 617--620. IEEE, 2008.
[38]
F. Weninger, J. Bergmann, and B. Schuller. Introducing CURRENNT--the Munich open-source CUDA recurrent neural network toolkit. Journal of Machine Learning Research, 15, 2014.
[39]
F. Weninger, J. Geiger, M. Wöllmer, B. Schuller, and G. Rigoll. Feature enhancement by deep LS™ networks for ASR in reverberant multisource environments. Computer Speech and Language, 28(4):888--902, 2014.
[40]
M. Wöllmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas-Cowie, and R. Cowie. Abandoning emotion classes-Towards continuous emotion recognition with modelling of long-range dependencies. In INTERSPEECH, pages 597--600, 2008.
[41]
M. Wöllmer, B. Schuller, F. Eyben, and G. Rigoll. Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. Selected Topics in Signal Processing, IEEE Journal of, 4(5):867--881, 2010.

Cited By

View all
  • (2024)Multimodal Prediction of Obsessive-Compulsive Disorder and Comorbid Depression Severity and Energy Delivered by Deep Brain ElectrodesIEEE Transactions on Affective Computing10.1109/TAFFC.2024.339511715:4(2025-2041)Online publication date: Oct-2024
  • (2024)Toward an Interactive Reading Experience: Deep Learning Insights and Visual Narratives of Engagement and EmotionIEEE Access10.1109/ACCESS.2024.335074512(6001-6016)Online publication date: 2024
  • (2023)Humor Detection System for MuSE 2023: Contextual Modeling, Pesudo Labelling, and Post-smoothingProceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation10.1145/3606039.3613107(35-41)Online publication date: 1-Nov-2023
  • Show More Cited By

Index Terms

  1. Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      AVEC '15: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge
      October 2015
      90 pages
      ISBN:9781450337434
      DOI:10.1145/2808196
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 26 October 2015

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. DBLSTM-RNN
      2. audio and video features
      3. multimodal fusion
      4. offset
      5. physiological feature

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      MM '15
      Sponsor:
      MM '15: ACM Multimedia Conference
      October 26, 2015
      Brisbane, Australia

      Acceptance Rates

      AVEC '15 Paper Acceptance Rate 9 of 15 submissions, 60%;
      Overall Acceptance Rate 52 of 98 submissions, 53%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)41
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 27 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Multimodal Prediction of Obsessive-Compulsive Disorder and Comorbid Depression Severity and Energy Delivered by Deep Brain ElectrodesIEEE Transactions on Affective Computing10.1109/TAFFC.2024.339511715:4(2025-2041)Online publication date: Oct-2024
      • (2024)Toward an Interactive Reading Experience: Deep Learning Insights and Visual Narratives of Engagement and EmotionIEEE Access10.1109/ACCESS.2024.335074512(6001-6016)Online publication date: 2024
      • (2023)Humor Detection System for MuSE 2023: Contextual Modeling, Pesudo Labelling, and Post-smoothingProceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation10.1145/3606039.3613107(35-41)Online publication date: 1-Nov-2023
      • (2023)A Review of Recurrent Neural Network-Based Methods in Computational PhysiologyIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.314536534:10(6983-7003)Online publication date: Oct-2023
      • (2023)Audio–Visual Fusion for Emotion Recognition in the Valence–Arousal Space Using Joint Cross-AttentionIEEE Transactions on Biometrics, Behavior, and Identity Science10.1109/TBIOM.2022.32330835:3(360-373)Online publication date: Jul-2023
      • (2023)Affect Recognition in Muscular Response SignalsIEEE Access10.1109/ACCESS.2023.327972011(61914-61928)Online publication date: 2023
      • (2022)A Multi-Scale Multi-Task Learning Model for Continuous Dimensional Emotion Recognition from AudioElectronics10.3390/electronics1103041711:3(417)Online publication date: 29-Jan-2022
      • (2022)Systematic literature review on audio-visual multimodal input in listening comprehensionFrontiers in Psychology10.3389/fpsyg.2022.98013313Online publication date: 6-Sep-2022
      • (2022)Applications of deep learning methods in digital biomarker research using noninvasive sensing dataDIGITAL HEALTH10.1177/205520762211366428(205520762211366)Online publication date: 4-Nov-2022
      • (2022)Privacy Preserving Personalization for Video Facial Expression Recognition Using Federated LearningProceedings of the 2022 International Conference on Multimodal Interaction10.1145/3536221.3556614(495-503)Online publication date: 7-Nov-2022
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media