research-article

Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition

Authors:

Shuai WangAuthors Info & Claims

AVEC '17: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge

Pages 19 - 26

https://doi.org/10.1145/3133944.3133949

Published: 23 October 2017 Publication History

Abstract

Automatic emotion recognition is a challenging task which can make great impact on improving natural human computer interactions. In this paper, we present our effort for the Affect Subtask in the Audio/Visual Emotion Challenge (AVEC) 2017, which requires participants to perform continuous emotion prediction on three affective dimensions: Arousal, Valence and Likability based on the audiovisual signals. We highlight three aspects of our solutions: 1) we explore and fuse different hand-crafted and deep learned features from all available modalities including acoustic, visual, and textual modalities, and we further consider the interlocutor influence for the acoustic features; 2) we compare the effectiveness of non-temporal model SVR and temporal model LSTM-RNN and show that the LSTM-RNN can not only alleviate the feature engineering efforts such as construction of contextual features and feature delay, but also improve the recognition performance significantly; 3) we apply multi-task learning strategy for collaborative prediction of multiple emotion dimensions with shared representations according to the fact that different emotion dimensions are correlated with each other. Our solutions achieve the CCC of 0.675, 0.756 and 0.509 on arousal, valence, and likability respectively on the challenge testing set, which outperforms the baseline system with corresponding CCC of 0.375, 0.466, and 0.246 on arousal, valence, and likability.

References

[1]

Stacy Marsella and Jonathan Gratch. Computationally modeling human emotion. Communications of the ACM, 57(12):56--67, 2014.

Digital Library

[2]

Fabien Ringeval, Björn Schuller, Michel Valstar, Jonathan Gratch, Roddy Cowie, Stefan Scherer, Sharon Mozgai, Nicholas Cummins, Maximilian Schimitt, and Maja Pantic. Avec 2017: Real-life depression, and affect recognition workshop and challenge. In Proceedings of the 7th International Workshop on Audio/Visual Emotion Challenge, pages 3--10. ACM, 2017.

Digital Library

[3]

Linlin Chao, Jianhua Tao, Minghao Yang, Ya Li, and Zhengqi Wen. Multi-scale temporal modeling for dimensional emotion recognition in video. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pages 11--18. ACM, 2014.

Digital Library

[4]

Linlin Chao, Jianhua Tao, Minghao Yang, Ya Li, and Zhengqi Wen. Long short term memory recurrent neural network based multimodal dimensional emotion recognition. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pages 65--72. ACM, 2015.

Digital Library

[5]

Kevin Brady, Youngjune Gwon, Pooya Khorrami, Elizabeth Godoy, William Campbell, Charlie Dagli, and Thomas S Huang. Multi-modal audio, video and physiological sensor learning for continuous emotion prediction. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pages 97--104. ACM, 2016.

Digital Library

[6]

Filip Povolny, Pavel Matejka, Michal Hradis, Anna Popková, Lubomir Otrusina, Pavel Smrz, Ian Wood, Cecile Robin, and Lori Lamel. Multimodal emotion recognition for avec 2016 challenge. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pages 75--82. ACM, 2016.

Digital Library

[7]

Shizhe Chen and Qin Jin. Multi-modal dimensional emotion recognition using recurrent neural networks. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pages 49--56. ACM, 2015.

Digital Library

[8]

Lang He, Dongmei Jiang, Le Yang, Ercheng Pei, Peng Wu, and Hichem Sahli. Multimodal affective dimension prediction using deep bidirectional long short-term memory recurrent neural networks. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pages 73--80. ACM, 2015.

Digital Library

[9]

Chung-Hsien Wu, Jen-Chun Lin, and Wen-Li Wei. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA transactions on signal and information processing, 3, 2014.

[10]

Viktor Rozgić, Sankaranarayanan Ananthakrishnan, Shirin Saleem, Rohit Kumar, Aravind Namandi Vembu, and Rohit Prasad. Emotion recognition using acoustic and lexical features. In Thirteenth Annual Conference of the International Speech Communication Association, 2012.

[11]

Zhaocheng Huang, Ting Dang, Nicholas Cummins, Brian Stasak, Phu Le, Vidhyasaharan Sethu, and Julien Epps. An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pages 41--48. ACM, 2015.

Digital Library

[12]

JunKai Chen, Zenghai Chen, Zheru Chi, and Hong Fu. Emotion recognition in the wild with feature fusion and multiple kernel learning. In Proceedings of the 16th International Conference on Multimodal Interaction, pages 508--513. ACM, 2014.

Digital Library

[13]

Shizhe Chen and Qin Jin. Multi-modal conditional attention fusion for dimensional emotion prediction. In Proceedings of the 2016 ACM on Multimedia Conference, pages 571--575. ACM, 2016.

Digital Library

[14]

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.

Digital Library

[15]

Rui Xia and Yang Liu. Leveraging valence and activation information via multi-task learning for categorical emotion recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5301--5305. IEEE, 2015.

[16]

Fabien Ringeval, Florian Eyben, Eleni Kroupi, Anil Yuce, Jean-Philippe Thiran, Touradj Ebrahimi, Denis Lalanne, and Björn Schuller. Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Pattern Recognition Letters, 66:22--30, 2015.

Digital Library

[17]

Jonathan Chang and Stefan Scherer. Learning representations of emotional speech with deep convolutional generative adversarial networks. arXiv preprint arXiv:1705.02394, 2017.

[18]

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franccois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1--35, 2016.

Digital Library

[19]

George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, et al. English conversational telephone speech recognition by humans and machines. arXiv preprint arXiv:1703.02136, 2017.

[20]

Florian Eyben, Martin Wöllmer, and Björn Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, pages 1459--1462. ACM, 2010.

Digital Library

[21]

Björn Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Communication, 53(9):1062--1087, 2011.

Digital Library

[22]

Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, pages 892--900, 2016.

[23]

Chi-Chun Lee, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. Modeling Mutual Influence of Interlocutor Emotion States in Dyadic Spoken Interactions. In Proceedings of Interspeech 2009, Brighton, UK, September 2009.

[24]

Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and Zhengyou Zhang. Training deep networks for facial expression recognition with crowd-sourced label distribution. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 279--283. ACM, 2016.

Digital Library

[25]

Gao Huang, Zhuang Liu, Kilian Q. Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. CVPR, 2017.

[26]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119, 2013.

Digital Library

[27]

German Word Embeddings. http://devmount.github.io/GermanWordEmbeddings/, 2017. {Online; accessed 19-July-2017}.

[28]

English Word Embeddings. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing, 2017. {Online; accessed 19-July-2017}.

[29]

Alex J Smola and Bernhard Schölkopf. A tutorial on support vector regression. Statistics and computing, 14(3):199--222, 2004.

Digital Library

[30]

Haşim Sak, Andrew Senior, and Françoise Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.

[31]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.

Cited By

Huang GLin WLiu L(2025)Content-Aware Efficient Learner for Audio-Visual Emotion RecognitionSocial Robotics10.1007/978-981-96-1151-5_4(31-40)Online publication date: 7-Feb-2025
https://doi.org/10.1007/978-981-96-1151-5_4
Benouis MAndre ECan Y(2024)Balancing Between Privacy and Utility for Affect Recognition Using Multitask Learning in Differential Privacy–Added Federated Learning Settings: Quantitative StudyJMIR Mental Health10.2196/6000311(e60003-e60003)Online publication date: 23-Dec-2024
https://doi.org/10.2196/60003
Zheng WYu JXia RCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)A Unimodal Valence-Arousal Driven Contrastive Learning Framework for Multimodal Multi-Label Emotion RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681638(622-631)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681638
Show More Cited By

Index Terms

Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
    2. Philosophical/theoretical foundations of artificial intelligence
      1. Cognitive science

Recommendations

Emotion Recognition in Continuous Mandarin Chinese Speech: Visualizing Emotional Expression from Continuous Speech in a 2D Emotional Space
Coarse-to-Fine Speech Emotion Recognition Based on Multi-Task Learning
Abstract
Speech emotion recognition is very challenging because the definition of emotion is uncertain and the feature representation is complex. Accurate feature representation is one of the key factors for successful speech emotion recognition. Studies ...
Speech Emotion Recognition via Attention-based DNN from Multi-Task Learning
SenSys '18: Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems

Speech unlocks the huge potentials in emotion recognition. High accurate and real-time understanding of human emotion via speech assists Human-Computer Interaction. Previous works are often limited in either coarse-grained emotion learning tasks or the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

AVEC '17: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge

October 2017

78 pages

ISBN:9781450355025

DOI:10.1145/3133944

General Chairs:
Fabien Ringeval
University of Grenoble Alpes
,
Björn Schuller
University of Passau/Imperial College London
,
Michel Valstar
University of Nottingham
,
Jonathan Gratch
University of Southern California
,
Roddy Cowie
Queen's University Belfast
,
Maja Pantic
Imperial College London/Twente University, UK/The Netherlands

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Plan

Conference

MM '17

Sponsor:

SIGMM

MM '17: ACM Multimedia Conference

October 23, 2017

California, Mountain View, USA

Acceptance Rates

AVEC '17 Paper Acceptance Rate 8 of 17 submissions, 47%;

Overall Acceptance Rate 52 of 98 submissions, 53%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

103
Total Citations
View Citations
1,910
Total Downloads

Downloads (Last 12 months)97
Downloads (Last 6 weeks)9

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huang GLin WLiu L(2025)Content-Aware Efficient Learner for Audio-Visual Emotion RecognitionSocial Robotics10.1007/978-981-96-1151-5_4(31-40)Online publication date: 7-Feb-2025
https://doi.org/10.1007/978-981-96-1151-5_4
Benouis MAndre ECan Y(2024)Balancing Between Privacy and Utility for Affect Recognition Using Multitask Learning in Differential Privacy–Added Federated Learning Settings: Quantitative StudyJMIR Mental Health10.2196/6000311(e60003-e60003)Online publication date: 23-Dec-2024
https://doi.org/10.2196/60003
Zheng WYu JXia RCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)A Unimodal Valence-Arousal Driven Contrastive Learning Framework for Multimodal Multi-Label Emotion RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681638(622-631)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681638
Tellamekala MAmiriparian SSchuller BAndré EGiesbrecht TValstar M(2024)COLD Fusion: Calibrated and Ordinal Latent Distribution Fusion for Uncertainty-Aware Multimodal Emotion RecognitionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332577046:2(805-822)Online publication date: Feb-2024
https://doi.org/10.1109/TPAMI.2023.3325770
Sun XYao FDing C(2024)Modeling High-Order Relationships: Brain-Inspired Hypergraph-Induced Multimodal-Multitask Framework for Semantic ComprehensionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.325235935:9(12142-12156)Online publication date: Sep-2024
https://doi.org/10.1109/TNNLS.2023.3252359
Romeo LOlugbade TPontil MBianchi-Berthouze N(2024)Multi-Rater Consensus Learning for Modeling Multiple Sparse Ratings of Affective BehaviourIEEE Transactions on Affective Computing10.1109/TAFFC.2023.329727015:3(859-871)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1109/TAFFC.2023.3297270
Hassan ABernadin S(2024)A Comprehensive Analysis of Speech Depression Recognition SystemsSoutheastCon 202410.1109/SoutheastCon52093.2024.10500078(1509-1518)Online publication date: 15-Mar-2024
https://doi.org/10.1109/SoutheastCon52093.2024.10500078
Chen ZMiao M(2024)Visual Modality Continuous Emotion Recognition Based on Vision Transformer and Transfer Learning2024 16th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC)10.1109/IHMSC62065.2024.00016(34-38)Online publication date: 24-Aug-2024
https://doi.org/10.1109/IHMSC62065.2024.00016
Kollias DTzirakis PCowen AZafeiriou SKotsia IBaird AGagne CShao CHu G(2024)The 6th Affective Behavior Analysis in-the-wild (ABAW) Competition2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00461(4587-4598)Online publication date: 17-Jun-2024
https://doi.org/10.1109/CVPRW63382.2024.00461
Sun LLian ZLiu BTao J(2024)HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion RecognitionInformation Fusion10.1016/j.inffus.2024.102382108(102382)Online publication date: Aug-2024
https://doi.org/10.1016/j.inffus.2024.102382
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten