ABSTRACT
The emotion recognition in the wild has been a hot research topic in the field of affective computing. Though some progresses have been achieved, the emotion recognition in the wild is still an unsolved problem due to the challenge of head movement, face deformation, illumination variation etc. To deal with these unconstrained challenges, we propose a bi-modality fusion method for video based emotion recognition in the wild. The proposed framework takes advantages of the visual information from facial expression sequences and the speech information from audio. The state-of-the-art CNN based object recognition models are employed to facilitate the facial expression recognition performance. A bi-direction long short term Memory (Bi-LSTM) is employed to capture dynamic information of the learned features. Additionally, to take full advantages of the facial expression information, the VGG16 network is trained on AffectNet dataset to learn a specialized facial expression recognition model. On the other hand, the audio based features, like low level descriptor (LLD) and deep features obtained by spectrogram image, are also developed to improve the emotion recognition performance. The best experimental result shows that the overall accuracy of our algorithm on the Test dataset of the EmotiW challenge is 62.78, which outperforms the best result of EmotiW2018 and ranks 2nd at the EmotiW2019 challenge.
- Abhinav Dhall, Roland Goecke, Shreya Ghosh, and year=2019 publisher=ACM Tom Gedeon, booktitle=ACM International Conference on Mutimodal Interaction 2019. [n.d.]. EmotiW 2019: Automatic Emotion, Engagement and Cohesion Prediction Tasks. ([n. d.]).Google Scholar
- Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2012. Collecting Large, Richly Annotated Facial-Expression Databases from Movies. IEEE MultiMedia 19, 3 (July 2012), 34–41.Google ScholarDigital Library
- Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005), 602 – 610. IJCNN 2005.Google ScholarDigital Library
- Da Guo, Kai Wang, Jianfei Yang, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. 2019. Exploring Regularizations with Face, Body and Image Cues for Group Cohesion Prediction. In Proceedings of the 21th ACM International Conference on Multimodal Interaction (in press). ACM.Google ScholarDigital Library
- K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.Google ScholarDigital Library
- G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. 2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2261–2269. https://doi.org/10.1109/CVPR.2017.243Google Scholar
- Heysem Kaya, Furkan Gürpinar, Sadaf Afshar, and Albert Ali Salah. 2015. Contrasting and Combining Least Squares Based Learners for Emotion Recognition in the Wild. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction(ICMI ’15). ACM, New York, NY, USA, 459–466.Google ScholarDigital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 60, 6 (May 2017), 84–90.Google ScholarDigital Library
- Markus Kächele, Martin Schels, Sascha Meudt, Günther Palm, and Friedhelm Schwenker. 2016. Revisiting the EmotiW challenge: how wild is it really?Journal on Multimodal User Interfaces 10, 2 (2016), 1–12.Google Scholar
- Chuanhe Liu, Tianhao Tang, Kui Lv, and Minghao Wang. 2018. Multi-Feature Based Emotion Recognition for Video Clips. In Proceedings of the 20th ACM International Conference on Multimodal Interaction(ICMI ’18). ACM, New York, NY, USA, 630–634.Google ScholarDigital Library
- Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký, and Sanjeev Khudanpur. 2015. Recurrent neural network based language model. In Interspeech, Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September.Google Scholar
- A. Mollahosseini, B. Hasani, and M. H. Mahoor. 2019. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Transactions on Affective Computing 10, 1 (Jan 2019), 18–31.Google ScholarDigital Library
- Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep Face Recognition. In British Machine Vision Conference.Google Scholar
- R V Shannon, F G Zeng, . Kamath, V., . Wygonski, J., and . Ekelid, M.1995. Speech recognition with primarily temporal cues. Science 270, 5234 (1995), 303–304.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. Computer Science (2014).Google Scholar
- C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1–9.Google Scholar
- Kai Wang, Jianfei Yang, Da Guo, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. 2019. Bootstrap Model Ensemble and Rank Loss for Engagement Intensity Regression. In Proceedings of the 21th ACM International Conference on Multimodal Interaction (in press). ACM.Google ScholarDigital Library
- Jianlong Wu, Zhouchen Lin, and Hongbin Zha. 2015. Multiple Models Fusion for Emotion Recognition in the Wild. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction(ICMI ’15). ACM, New York, NY, USA, 475–481.Google ScholarDigital Library
- Shuzhe Wu, Meina Kan, Zhenliang He, Shiguang Shan, and Xilin Chen. 2017. Funnel-structured cascade for multi-view face detection with alignment-awareness. Neurocomputing 221(2017), 138 – 145.Google ScholarDigital Library
- Anbang Yao, Junchao Shao, Ningning Ma, and Yurong Chen. 2015. Capturing AU-aware facial features and their latent relations for emotion recognition in the wild. In Acm on International Conference on Multimodal Interaction.Google ScholarDigital Library
- Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks?Eprint Arxiv 27(2014), 3320–3328.Google Scholar
- K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters 23, 10 (Oct 2016), 1499–1503.Google ScholarCross Ref
Recommendations
Multi-Feature Based Emotion Recognition for Video Clips
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal InteractionIn this paper, we present our latest progress in Emotion Recognition techniques, which combines acoustic features and facial features in both non-temporal and temporal mode. This paper presents the details of our techniques used in the Audio-Video ...
Audiovisual emotion recognition in wild
People express emotions through different modalities. Utilization of both verbal and nonverbal communication channels allows to create a system in which the emotional state is expressed more clearly and therefore easier to understand. Expanding the ...
Emotion recognition in the wild challenge 2016
ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal InteractionThe fourth Emotion Recognition in the Wild (EmotiW) challenge is a grand challenge in the ACM International Conference on Multimodal Interaction 2016, Tokyo. EmotiW is a series of benchmarking and competition effort for researchers working in the area ...
Comments