ABSTRACT
This paper presents our effects for Cross-cultural Emotion Sub-challenge in the Audio/Visual Emotion Challenge (AVEC) 2018, whose goal is to predict the level of three emotional dimensions time-continuously in a cross-cultural setup. We extract the emotional features from audio, visual and textual modalities. The state of art regressor for continuous emotion recognition, long short term memory recurrent neural network (LSTM-RNN) is utilized. We augment the training data by replacing the original training samples with shorter overlapping samples extracted from them, thus multiplying the number of training samples and also beneficial to train emotional temporal model with LSTM-RNN. In addition, two strategies are explored to decrease the interlocutor influence to improve the performance. We also compare the performance of feature level fusion and decision level fusion. The experimental results show the efficiency of the proposed method and competitive results are obtained.
- H. Gunes, M. Pantic. 2010. Automatic, dimensional and continuous emotion recognition. International Journal of Synthetic Emotions, vol. 1, no. 1, pp. 68--99. Google ScholarDigital Library
- H. Gunes, B. Schuller. 2013. Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image Vision Comput., vol. 31, no. 2, pp. 120--136. Google ScholarDigital Library
- J. Fontaine, K. Scherer, E. Roesch, P. Ellsworth. 2007. The world of emotions is not two-dimensional. Psychological Science, vol. 18, no. 12, pp. 1050--1057.Google ScholarCross Ref
- B. Schuller, M. Valstar, F. Eyben, et al. 2011. Avec 2011--the first international audio/visual emotion challenge. Affective Computing and Intelligent Interaction, 415--424. Google ScholarDigital Library
- B. Schuller, M. Valster, F. Eyben, et al. 2012. AVEC 2012: the continuous audio/visual emotion challenge. Proceedings of the 14th ACM international conference on Multimodal interaction. ACM, 449--456. Google ScholarDigital Library
- M. Valstar, B. Schuller, K. Smith, et al. 2013. AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge. ACM, 3--10. Google ScholarDigital Library
- M. Valstar, B. Schuller, K. Smith, et al. 2014. Avec 2014: 3d dimensional affect and depression recognition challenge. Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM, 3--10. Google ScholarDigital Library
- F. Ringeval, B. Schuller, M. Valstar, et al. 2015. AVEC 2015: The 5th international audio/visual emotion challenge and workshop. Proceedings of the 23rd ACM international conference on Multimedia. ACM, 1335--1336. Google ScholarDigital Library
- M. Valstar, J. Gratch, B. Schuller, et al. 2016. Avec 2016: Depression, mood, and emotion recognition workshop and challenge. Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 3--10. Google ScholarDigital Library
- F. Ringeval, B. Schuller, M. Valstar, et al. 2017. AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge. Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 3--9. Google ScholarDigital Library
- R. Fabien, B. Schuller, M. Valstar, et al. 2018. AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition. Proceedings of the 8th International Workshop on Audio/Visual Emotion Challenge. ACM 2018. Google ScholarDigital Library
- F. Eyben, K. R. Scherer, B. W. Schuller, et al. 2016. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2): 190--202.Google ScholarDigital Library
- A. Popková, F. Povolný, P. Matejka, et al. 2016. Investigation of Bottle-Neck Features for Emotion Recognition. International Conference on Text, Speech, and Dialogue. Springer International Publishing, 426--434.Google Scholar
- F. Povolny, P. Matejka, M. Hradis, et al. 2016. Multimodal Emotion Recognition for AVEC 2016 Challenge. Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 75--82. Google ScholarDigital Library
- S. Chen, Q. Jin, J. Zhao, et al. 2017. Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition. The Workshop on Audio/visual Emotion Challenge. ACM, 19--26. Google ScholarDigital Library
- Y. Aytar, C. Vondrick, A. Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. Advances in Neural Information Processing Systems, 892--900. Google ScholarDigital Library
- T. R. Almaev, M. F. Valstar. 2013. Local gabor binary patterns from three orthogonal planes for automatic facial expression recognition. Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on. IEEE, 356--361. Google ScholarDigital Library
- B. Sun, S. Cao, L. Li, et al. 2016. Exploring multimodal visual features for continuous affect recognition. Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 83--88. Google ScholarDigital Library
- H. W. Chung, C. L. Jen, L. W. Wen. 2014. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA transactions on signal and information processing.Google Scholar
- R. Viktor, A. Sankaranarayanan, S. Shirin, et al. 2012. Emotion recognition using acoustic and lexical features. In Thirteenth Annual Conference of the International Speech Communication Association.Google Scholar
- H. Zhaocheng, D. Ting, C. Nicholas, et al. 2015. An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, 41--48. Google ScholarDigital Library
- C. JunKai, C. Zenghai, C. Zheru, F. Hong. 2014. Emotion recognition in the wild with feature fusion and multiple kernel learning. In Proceedings of the 16th International Conference on Multimodal Interaction, pages 508--513. Google ScholarDigital Library
- C. Shizhe, J. Qin. 2016. Multi-modal conditional attention fusion for dimensional emotion prediction. In Proceedings of the 2016 ACM on Multimedia Conference, 571--575. Google ScholarDigital Library
- L. Chao, J. Tao, M. Yang, et al. 2014. Multi-scale temporal modeling for dimensional emotion recognition in video. Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM, 11--18. Google ScholarDigital Library
- M. Wöllmer, M. Kaiser, F. Eyben, et al. 2013. LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, 31(2): 153--163. Google ScholarDigital Library
- J. Huang, Y. Li, J. Tao, et al. 2017. Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network. Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 11--18. Google ScholarDigital Library
- S. Mariooryad, C. Busso. 2015. Correcting time-continuous emotional labels by modeling the reaction lag of evaluators. IEEE Transactions on Affective Computing, 6(2): 97--108.Google ScholarDigital Library
- T. Ko, V. Peddinti, D. Povey, et al. 2015. Audio augmentation for speech recognition. Sixteenth Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
- R. Girshick, J. Donahue, T. Darrell, et al. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition, 580--587. Google ScholarDigital Library
- G. Keren, J. Deng, J. Pohjalainen, et al. 2016. Convolutional Neural Networks with Data Augmentation for Classifying Speakers' Native Language. Interspeech, 2393--2397.Google Scholar
- C. L. Chi, B. Carlos, L. Sungbok, N. Shrikanth. 2009. Modeling Mutual Influence of Interlocutor Emotion States in Dyadic Spoken Interactions. In Proceedings of Interspeech 2009, Brighton, UK.Google Scholar
- R. Zhang, A. Atsushi, S. Kobashikawa, et al. 2017. Interaction and Transition Model for Speech Emotion Recognition in Dialogue. Interspeech, 1094--1097.Google Scholar
- B.Valentin, C. Chlo´e, E. Slim. 2018. Attitude classification in adjacency pairs of a human-agent interaction with hidden conditional random fields. ICASSP.Google Scholar
- G. A. Bryant, H. C. Barrett. 2008. Vocal emotion recognition across disparate cultures{J}. Journal of Cognition and Culture, 8(1): 135--148.Google ScholarCross Ref
- P. Ekman. 1971. Universals and cultural differences in facial expressions of emotion. Nebraska symposium on motivation. University of Nebraska Press.Google Scholar
- S. M. Feraru, D. Schuller. 2015. Cross-language acoustic emotion recognition: An overview and some tendencies. Affective Computing and Intelligent Interaction (ACII), 2015 International Conference on. IEEE, 125--131. Google ScholarDigital Library
- M. Bhaykar, J. Yadav, K. S. Rao. 2013. Speaker dependent, speaker independent and cross language emotion recognition from speech using GMM and HMM. Communications (NCC), 2013 National Conference on. IEEE, 1--5.Google ScholarCross Ref
- F. Weninger, F. Ringeval, E. Marchi, et al. 2016. Discriminatively Trained Recurrent Neural Networks for Continuous Dimensional Emotion Recognition from Audio. IJCAI, 2196--2202. Google ScholarDigital Library
- F. Eyben, M. Wöllmer, B. Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. Proceedings of the 18th ACM international conference on Multimedia. ACM, 1459--1462. Google ScholarDigital Library
- B. Schuller, S. Steidl, A. Batliner, et al. 2010. The INTERSPEECH 2010 paralinguistic challenge. Eleventh Annual Conference of the International Speech Communication Association.Google Scholar
- B. Tadas, R. Peter, P. M. Louis. 2016. OpenFace: An Open Source Facial Behavior Analysis Toolkit. In: Proc. IEEE Winter Conference on Applications of Computer Vision, New York, USA.Google Scholar
- M. Tomas, S. Ilya, C. Kai, et al. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119. Google ScholarDigital Library
- J. Pennington, R. Socher, C. Manning. 2014. Glove: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532--1543.Google ScholarCross Ref
- L. Chao, J. Tao, M. Yang, et al. 2015. Long short term memory recurrent neural network based multimodal dimensional emotion recognition. Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 65--72. Google ScholarDigital Library
- M. D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701Google Scholar
Index Terms
- Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks
Recommendations
Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network
AVEC '17: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion ChallengeThe continuous dimensional emotion can depict subtlety and complexity of emotional change, which is an inherently challenging problem with growing attention. This paper presents our automatic prediction of dimensional emotional state for Audio-Visual ...
Automatic, Dimensional and Continuous Emotion Recognition
Recognition and analysis of human emotions have attracted a lot of interest in the past two decades and have been researched extensively in neuroscience, psychology, cognitive sciences, and computer sciences. Most of the past research in machine ...
Multimodal Emotion Recognition and Sentiment Analysis via Attention Enhanced Recurrent Model
MuSe '21: Proceedings of the 2nd on Multimodal Sentiment Analysis ChallengeWith the proliferation of user-generated videos in online websites, it becomes particularly important to achieve automatic perception and understanding of human emotion/sentiment from these videos. In this paper, we present our solutions to the MuSe-...
Comments