skip to main content
10.1145/3133944.3133949acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition

Published: 23 October 2017 Publication History

Abstract

Automatic emotion recognition is a challenging task which can make great impact on improving natural human computer interactions. In this paper, we present our effort for the Affect Subtask in the Audio/Visual Emotion Challenge (AVEC) 2017, which requires participants to perform continuous emotion prediction on three affective dimensions: Arousal, Valence and Likability based on the audiovisual signals. We highlight three aspects of our solutions: 1) we explore and fuse different hand-crafted and deep learned features from all available modalities including acoustic, visual, and textual modalities, and we further consider the interlocutor influence for the acoustic features; 2) we compare the effectiveness of non-temporal model SVR and temporal model LSTM-RNN and show that the LSTM-RNN can not only alleviate the feature engineering efforts such as construction of contextual features and feature delay, but also improve the recognition performance significantly; 3) we apply multi-task learning strategy for collaborative prediction of multiple emotion dimensions with shared representations according to the fact that different emotion dimensions are correlated with each other. Our solutions achieve the CCC of 0.675, 0.756 and 0.509 on arousal, valence, and likability respectively on the challenge testing set, which outperforms the baseline system with corresponding CCC of 0.375, 0.466, and 0.246 on arousal, valence, and likability.

References

[1]
Stacy Marsella and Jonathan Gratch. Computationally modeling human emotion. Communications of the ACM, 57(12):56--67, 2014.
[2]
Fabien Ringeval, Björn Schuller, Michel Valstar, Jonathan Gratch, Roddy Cowie, Stefan Scherer, Sharon Mozgai, Nicholas Cummins, Maximilian Schimitt, and Maja Pantic. Avec 2017: Real-life depression, and affect recognition workshop and challenge. In Proceedings of the 7th International Workshop on Audio/Visual Emotion Challenge, pages 3--10. ACM, 2017.
[3]
Linlin Chao, Jianhua Tao, Minghao Yang, Ya Li, and Zhengqi Wen. Multi-scale temporal modeling for dimensional emotion recognition in video. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pages 11--18. ACM, 2014.
[4]
Linlin Chao, Jianhua Tao, Minghao Yang, Ya Li, and Zhengqi Wen. Long short term memory recurrent neural network based multimodal dimensional emotion recognition. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pages 65--72. ACM, 2015.
[5]
Kevin Brady, Youngjune Gwon, Pooya Khorrami, Elizabeth Godoy, William Campbell, Charlie Dagli, and Thomas S Huang. Multi-modal audio, video and physiological sensor learning for continuous emotion prediction. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pages 97--104. ACM, 2016.
[6]
Filip Povolny, Pavel Matejka, Michal Hradis, Anna Popková, Lubomir Otrusina, Pavel Smrz, Ian Wood, Cecile Robin, and Lori Lamel. Multimodal emotion recognition for avec 2016 challenge. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pages 75--82. ACM, 2016.
[7]
Shizhe Chen and Qin Jin. Multi-modal dimensional emotion recognition using recurrent neural networks. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pages 49--56. ACM, 2015.
[8]
Lang He, Dongmei Jiang, Le Yang, Ercheng Pei, Peng Wu, and Hichem Sahli. Multimodal affective dimension prediction using deep bidirectional long short-term memory recurrent neural networks. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pages 73--80. ACM, 2015.
[9]
Chung-Hsien Wu, Jen-Chun Lin, and Wen-Li Wei. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA transactions on signal and information processing, 3, 2014.
[10]
Viktor Rozgić, Sankaranarayanan Ananthakrishnan, Shirin Saleem, Rohit Kumar, Aravind Namandi Vembu, and Rohit Prasad. Emotion recognition using acoustic and lexical features. In Thirteenth Annual Conference of the International Speech Communication Association, 2012.
[11]
Zhaocheng Huang, Ting Dang, Nicholas Cummins, Brian Stasak, Phu Le, Vidhyasaharan Sethu, and Julien Epps. An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pages 41--48. ACM, 2015.
[12]
JunKai Chen, Zenghai Chen, Zheru Chi, and Hong Fu. Emotion recognition in the wild with feature fusion and multiple kernel learning. In Proceedings of the 16th International Conference on Multimodal Interaction, pages 508--513. ACM, 2014.
[13]
Shizhe Chen and Qin Jin. Multi-modal conditional attention fusion for dimensional emotion prediction. In Proceedings of the 2016 ACM on Multimedia Conference, pages 571--575. ACM, 2016.
[14]
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.
[15]
Rui Xia and Yang Liu. Leveraging valence and activation information via multi-task learning for categorical emotion recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5301--5305. IEEE, 2015.
[16]
Fabien Ringeval, Florian Eyben, Eleni Kroupi, Anil Yuce, Jean-Philippe Thiran, Touradj Ebrahimi, Denis Lalanne, and Björn Schuller. Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Pattern Recognition Letters, 66:22--30, 2015.
[17]
Jonathan Chang and Stefan Scherer. Learning representations of emotional speech with deep convolutional generative adversarial networks. arXiv preprint arXiv:1705.02394, 2017.
[18]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franccois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1--35, 2016.
[19]
George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, et al. English conversational telephone speech recognition by humans and machines. arXiv preprint arXiv:1703.02136, 2017.
[20]
Florian Eyben, Martin Wöllmer, and Björn Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, pages 1459--1462. ACM, 2010.
[21]
Björn Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Communication, 53(9):1062--1087, 2011.
[22]
Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, pages 892--900, 2016.
[23]
Chi-Chun Lee, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. Modeling Mutual Influence of Interlocutor Emotion States in Dyadic Spoken Interactions. In Proceedings of Interspeech 2009, Brighton, UK, September 2009.
[24]
Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and Zhengyou Zhang. Training deep networks for facial expression recognition with crowd-sourced label distribution. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 279--283. ACM, 2016.
[25]
Gao Huang, Zhuang Liu, Kilian Q. Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. CVPR, 2017.
[26]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119, 2013.
[27]
German Word Embeddings. http://devmount.github.io/GermanWordEmbeddings/, 2017. {Online; accessed 19-July-2017}.
[28]
English Word Embeddings. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing, 2017. {Online; accessed 19-July-2017}.
[29]
Alex J Smola and Bernhard Schölkopf. A tutorial on support vector regression. Statistics and computing, 14(3):199--222, 2004.
[30]
Haşim Sak, Andrew Senior, and Françoise Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
[31]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.

Cited By

View all
  • (2025)Content-Aware Efficient Learner for Audio-Visual Emotion RecognitionSocial Robotics10.1007/978-981-96-1151-5_4(31-40)Online publication date: 7-Feb-2025
  • (2024)Balancing Between Privacy and Utility for Affect Recognition Using Multitask Learning in Differential Privacy–Added Federated Learning Settings: Quantitative StudyJMIR Mental Health10.2196/6000311(e60003-e60003)Online publication date: 23-Dec-2024
  • (2024)A Unimodal Valence-Arousal Driven Contrastive Learning Framework for Multimodal Multi-Label Emotion RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681638(622-631)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
AVEC '17: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge
October 2017
78 pages
ISBN:9781450355025
DOI:10.1145/3133944
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dimensional emotion
  2. lstm
  3. multi-task learning
  4. multimodal features

Qualifiers

  • Research-article

Funding Sources

  • National Key Research and Development Plan

Conference

MM '17
Sponsor:
MM '17: ACM Multimedia Conference
October 23, 2017
California, Mountain View, USA

Acceptance Rates

AVEC '17 Paper Acceptance Rate 8 of 17 submissions, 47%;
Overall Acceptance Rate 52 of 98 submissions, 53%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)97
  • Downloads (Last 6 weeks)9
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Content-Aware Efficient Learner for Audio-Visual Emotion RecognitionSocial Robotics10.1007/978-981-96-1151-5_4(31-40)Online publication date: 7-Feb-2025
  • (2024)Balancing Between Privacy and Utility for Affect Recognition Using Multitask Learning in Differential Privacy–Added Federated Learning Settings: Quantitative StudyJMIR Mental Health10.2196/6000311(e60003-e60003)Online publication date: 23-Dec-2024
  • (2024)A Unimodal Valence-Arousal Driven Contrastive Learning Framework for Multimodal Multi-Label Emotion RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681638(622-631)Online publication date: 28-Oct-2024
  • (2024)COLD Fusion: Calibrated and Ordinal Latent Distribution Fusion for Uncertainty-Aware Multimodal Emotion RecognitionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332577046:2(805-822)Online publication date: Feb-2024
  • (2024)Modeling High-Order Relationships: Brain-Inspired Hypergraph-Induced Multimodal-Multitask Framework for Semantic ComprehensionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.325235935:9(12142-12156)Online publication date: Sep-2024
  • (2024)Multi-Rater Consensus Learning for Modeling Multiple Sparse Ratings of Affective BehaviourIEEE Transactions on Affective Computing10.1109/TAFFC.2023.329727015:3(859-871)Online publication date: 1-Jul-2024
  • (2024)A Comprehensive Analysis of Speech Depression Recognition SystemsSoutheastCon 202410.1109/SoutheastCon52093.2024.10500078(1509-1518)Online publication date: 15-Mar-2024
  • (2024)Visual Modality Continuous Emotion Recognition Based on Vision Transformer and Transfer Learning2024 16th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC)10.1109/IHMSC62065.2024.00016(34-38)Online publication date: 24-Aug-2024
  • (2024)The 6th Affective Behavior Analysis in-the-wild (ABAW) Competition2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00461(4587-4598)Online publication date: 17-Jun-2024
  • (2024)HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion RecognitionInformation Fusion10.1016/j.inffus.2024.102382108(102382)Online publication date: Aug-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media