research-article

Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks

Authors:
Jian Huang

Institute of Automation, Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences, Beijing, China
View Profile

,
Ya Li

Institute of Automation, Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences, Beijing, China
View Profile

,
Jianhua Tao

Institute of Automation, Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences, Beijing, China
View Profile

,
Zheng Lian

Institute of Automation, Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences, Beijing, China
View Profile

,
Mingyue Niu

Institute of Automation, Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences, Beijing, China
View Profile

,
Minghao Yang

Institute of Automation, Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences, Beijing, China
View Profile

AVEC'18: Proceedings of the 2018 on Audio/Visual Emotion Challenge and WorkshopOctober 2018Pages 57–64https://doi.org/10.1145/3266302.3266304

Published:15 October 2018Publication History

AVEC'18: Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop

Pages 57–64

ABSTRACT

This paper presents our effects for Cross-cultural Emotion Sub-challenge in the Audio/Visual Emotion Challenge (AVEC) 2018, whose goal is to predict the level of three emotional dimensions time-continuously in a cross-cultural setup. We extract the emotional features from audio, visual and textual modalities. The state of art regressor for continuous emotion recognition, long short term memory recurrent neural network (LSTM-RNN) is utilized. We augment the training data by replacing the original training samples with shorter overlapping samples extracted from them, thus multiplying the number of training samples and also beneficial to train emotional temporal model with LSTM-RNN. In addition, two strategies are explored to decrease the interlocutor influence to improve the performance. We also compare the performance of feature level fusion and decision level fusion. The experimental results show the efficiency of the proposed method and competitive results are obtained.

References

H. Gunes, M. Pantic. 2010. Automatic, dimensional and continuous emotion recognition. International Journal of Synthetic Emotions, vol. 1, no. 1, pp. 68--99. Google ScholarDigital Library
H. Gunes, B. Schuller. 2013. Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image Vision Comput., vol. 31, no. 2, pp. 120--136. Google ScholarDigital Library
J. Fontaine, K. Scherer, E. Roesch, P. Ellsworth. 2007. The world of emotions is not two-dimensional. Psychological Science, vol. 18, no. 12, pp. 1050--1057.Google ScholarCross Ref
B. Schuller, M. Valstar, F. Eyben, et al. 2011. Avec 2011--the first international audio/visual emotion challenge. Affective Computing and Intelligent Interaction, 415--424. Google ScholarDigital Library
B. Schuller, M. Valster, F. Eyben, et al. 2012. AVEC 2012: the continuous audio/visual emotion challenge. Proceedings of the 14th ACM international conference on Multimodal interaction. ACM, 449--456. Google ScholarDigital Library
M. Valstar, B. Schuller, K. Smith, et al. 2013. AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge. ACM, 3--10. Google ScholarDigital Library
M. Valstar, B. Schuller, K. Smith, et al. 2014. Avec 2014: 3d dimensional affect and depression recognition challenge. Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM, 3--10. Google ScholarDigital Library
F. Ringeval, B. Schuller, M. Valstar, et al. 2015. AVEC 2015: The 5th international audio/visual emotion challenge and workshop. Proceedings of the 23rd ACM international conference on Multimedia. ACM, 1335--1336. Google ScholarDigital Library
M. Valstar, J. Gratch, B. Schuller, et al. 2016. Avec 2016: Depression, mood, and emotion recognition workshop and challenge. Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 3--10. Google ScholarDigital Library
F. Ringeval, B. Schuller, M. Valstar, et al. 2017. AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge. Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 3--9. Google ScholarDigital Library
R. Fabien, B. Schuller, M. Valstar, et al. 2018. AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition. Proceedings of the 8th International Workshop on Audio/Visual Emotion Challenge. ACM 2018. Google ScholarDigital Library
F. Eyben, K. R. Scherer, B. W. Schuller, et al. 2016. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2): 190--202.Google ScholarDigital Library
A. Popková, F. Povolný, P. Matejka, et al. 2016. Investigation of Bottle-Neck Features for Emotion Recognition. International Conference on Text, Speech, and Dialogue. Springer International Publishing, 426--434.Google Scholar
F. Povolny, P. Matejka, M. Hradis, et al. 2016. Multimodal Emotion Recognition for AVEC 2016 Challenge. Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 75--82. Google ScholarDigital Library
S. Chen, Q. Jin, J. Zhao, et al. 2017. Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition. The Workshop on Audio/visual Emotion Challenge. ACM, 19--26. Google ScholarDigital Library
Y. Aytar, C. Vondrick, A. Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. Advances in Neural Information Processing Systems, 892--900. Google ScholarDigital Library
T. R. Almaev, M. F. Valstar. 2013. Local gabor binary patterns from three orthogonal planes for automatic facial expression recognition. Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on. IEEE, 356--361. Google ScholarDigital Library
B. Sun, S. Cao, L. Li, et al. 2016. Exploring multimodal visual features for continuous affect recognition. Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 83--88. Google ScholarDigital Library
H. W. Chung, C. L. Jen, L. W. Wen. 2014. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA transactions on signal and information processing.Google Scholar
R. Viktor, A. Sankaranarayanan, S. Shirin, et al. 2012. Emotion recognition using acoustic and lexical features. In Thirteenth Annual Conference of the International Speech Communication Association.Google Scholar
H. Zhaocheng, D. Ting, C. Nicholas, et al. 2015. An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, 41--48. Google ScholarDigital Library
C. JunKai, C. Zenghai, C. Zheru, F. Hong. 2014. Emotion recognition in the wild with feature fusion and multiple kernel learning. In Proceedings of the 16th International Conference on Multimodal Interaction, pages 508--513. Google ScholarDigital Library
C. Shizhe, J. Qin. 2016. Multi-modal conditional attention fusion for dimensional emotion prediction. In Proceedings of the 2016 ACM on Multimedia Conference, 571--575. Google ScholarDigital Library
L. Chao, J. Tao, M. Yang, et al. 2014. Multi-scale temporal modeling for dimensional emotion recognition in video. Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM, 11--18. Google ScholarDigital Library
M. Wöllmer, M. Kaiser, F. Eyben, et al. 2013. LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, 31(2): 153--163. Google ScholarDigital Library
J. Huang, Y. Li, J. Tao, et al. 2017. Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network. Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 11--18. Google ScholarDigital Library
S. Mariooryad, C. Busso. 2015. Correcting time-continuous emotional labels by modeling the reaction lag of evaluators. IEEE Transactions on Affective Computing, 6(2): 97--108.Google ScholarDigital Library
T. Ko, V. Peddinti, D. Povey, et al. 2015. Audio augmentation for speech recognition. Sixteenth Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
R. Girshick, J. Donahue, T. Darrell, et al. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition, 580--587. Google ScholarDigital Library
G. Keren, J. Deng, J. Pohjalainen, et al. 2016. Convolutional Neural Networks with Data Augmentation for Classifying Speakers' Native Language. Interspeech, 2393--2397.Google Scholar
C. L. Chi, B. Carlos, L. Sungbok, N. Shrikanth. 2009. Modeling Mutual Influence of Interlocutor Emotion States in Dyadic Spoken Interactions. In Proceedings of Interspeech 2009, Brighton, UK.Google Scholar
R. Zhang, A. Atsushi, S. Kobashikawa, et al. 2017. Interaction and Transition Model for Speech Emotion Recognition in Dialogue. Interspeech, 1094--1097.Google Scholar
B.Valentin, C. Chlo´e, E. Slim. 2018. Attitude classification in adjacency pairs of a human-agent interaction with hidden conditional random fields. ICASSP.Google Scholar
G. A. Bryant, H. C. Barrett. 2008. Vocal emotion recognition across disparate cultures{J}. Journal of Cognition and Culture, 8(1): 135--148.Google ScholarCross Ref
P. Ekman. 1971. Universals and cultural differences in facial expressions of emotion. Nebraska symposium on motivation. University of Nebraska Press.Google Scholar
S. M. Feraru, D. Schuller. 2015. Cross-language acoustic emotion recognition: An overview and some tendencies. Affective Computing and Intelligent Interaction (ACII), 2015 International Conference on. IEEE, 125--131. Google ScholarDigital Library
M. Bhaykar, J. Yadav, K. S. Rao. 2013. Speaker dependent, speaker independent and cross language emotion recognition from speech using GMM and HMM. Communications (NCC), 2013 National Conference on. IEEE, 1--5.Google ScholarCross Ref
F. Weninger, F. Ringeval, E. Marchi, et al. 2016. Discriminatively Trained Recurrent Neural Networks for Continuous Dimensional Emotion Recognition from Audio. IJCAI, 2196--2202. Google ScholarDigital Library
F. Eyben, M. Wöllmer, B. Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. Proceedings of the 18th ACM international conference on Multimedia. ACM, 1459--1462. Google ScholarDigital Library
B. Schuller, S. Steidl, A. Batliner, et al. 2010. The INTERSPEECH 2010 paralinguistic challenge. Eleventh Annual Conference of the International Speech Communication Association.Google Scholar
B. Tadas, R. Peter, P. M. Louis. 2016. OpenFace: An Open Source Facial Behavior Analysis Toolkit. In: Proc. IEEE Winter Conference on Applications of Computer Vision, New York, USA.Google Scholar
M. Tomas, S. Ilya, C. Kai, et al. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119. Google ScholarDigital Library
J. Pennington, R. Socher, C. Manning. 2014. Glove: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532--1543.Google ScholarCross Ref
L. Chao, J. Tao, M. Yang, et al. 2015. Long short term memory recurrent neural network based multimodal dimensional emotion recognition. Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 65--72. Google ScholarDigital Library
M. D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701Google Scholar

Index Terms

Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks
1. Information systems
  1. Information systems applications
    1. Multimedia information systems
      1. Multimedia streaming

Recommendations

Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network
AVEC '17: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge

The continuous dimensional emotion can depict subtlety and complexity of emotional change, which is an inherently challenging problem with growing attention. This paper presents our automatic prediction of dimensional emotional state for Audio-Visual ...
Read More
Automatic, Dimensional and Continuous Emotion Recognition

Recognition and analysis of human emotions have attracted a lot of interest in the past two decades and have been researched extensively in neuroscience, psychology, cognitive sciences, and computer sciences. Most of the past research in machine ...
Read More
Multimodal Emotion Recognition and Sentiment Analysis via Attention Enhanced Recurrent Model
MuSe '21: Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge

With the proliferation of user-generated videos in online websites, it becomes particularly important to achieve automatic perception and understanding of human emotion/sentiment from these videos. In this paper, we present our solutions to the MuSe-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AVEC'18: Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop
October 2018
113 pages
ISBN:9781450359832
DOI:10.1145/3266302
General Chairs:
Fabien Ringeval
Grenoble Alps University, France
,
Björn Schuller
University of Augsburg/Imperial College London, Germany/UK
,
Michel Valstar
University of Nottingham, UK
,
Roddy Cowie
Queen's University Belfast, UK
,
Maja Pantic
Imperial College London/Twente University, UK/The Netherlands
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 October 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
continuous emotion recognition
data augmentation
interlocutor influence
lstm-rnn
multimodal fusion
Qualifiers
- research-article
Conference

Acceptance Rates
AVEC'18 Paper Acceptance Rate11of23submissions,48%Overall Acceptance Rate52of98submissions,53%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 748
  Total Downloads
- Downloads (Last 12 months)50
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks

AVEC'18: Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop

ABSTRACT

References

Cited By

Index Terms

Recommendations

Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network

Automatic, Dimensional and Continuous Emotion Recognition

Multimodal Emotion Recognition and Sentiment Analysis via Attention Enhanced Recurrent Model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks

AVEC'18: Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop

ABSTRACT

References

Cited By

Index Terms

Recommendations

Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network

Automatic, Dimensional and Continuous Emotion Recognition

Multimodal Emotion Recognition and Sentiment Analysis via Attention Enhanced Recurrent Model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media