research-article

Bi-modality Fusion for Emotion Recognition in the Wild

Authors:
Sunan Li

Southeast University

Southeast University
View Profile

,
Wenming Zheng

Southeast University

Southeast University
View Profile

,
Yuan Zong

Southeast University

Southeast University
View Profile

,
Cheng Lu

Southeast University

Southeast University
View Profile

,
Chuangao Tang

Southeast University

Southeast University
View Profile

,
Xingxun Jiang

Southeast University

Southeast University
View Profile

,
Jiateng Liu

Southeast University

Southeast University
View Profile

,
Wanchuang Xia

Southeast University

Southeast University
View Profile

ICMI '19: 2019 International Conference on Multimodal InteractionOctober 2019Pages 589–594https://doi.org/10.1145/3340555.3355719

Published:14 October 2019Publication History

ICMI '19: 2019 International Conference on Multimodal Interaction

Pages 589–594

ABSTRACT

The emotion recognition in the wild has been a hot research topic in the field of affective computing. Though some progresses have been achieved, the emotion recognition in the wild is still an unsolved problem due to the challenge of head movement, face deformation, illumination variation etc. To deal with these unconstrained challenges, we propose a bi-modality fusion method for video based emotion recognition in the wild. The proposed framework takes advantages of the visual information from facial expression sequences and the speech information from audio. The state-of-the-art CNN based object recognition models are employed to facilitate the facial expression recognition performance. A bi-direction long short term Memory (Bi-LSTM) is employed to capture dynamic information of the learned features. Additionally, to take full advantages of the facial expression information, the VGG16 network is trained on AffectNet dataset to learn a specialized facial expression recognition model. On the other hand, the audio based features, like low level descriptor (LLD) and deep features obtained by spectrogram image, are also developed to improve the emotion recognition performance. The best experimental result shows that the overall accuracy of our algorithm on the Test dataset of the EmotiW challenge is 62.78, which outperforms the best result of EmotiW2018 and ranks 2nd at the EmotiW2019 challenge.

References

Abhinav Dhall, Roland Goecke, Shreya Ghosh, and year=2019 publisher=ACM Tom Gedeon, booktitle=ACM International Conference on Mutimodal Interaction 2019. [n.d.]. EmotiW 2019: Automatic Emotion, Engagement and Cohesion Prediction Tasks. ([n. d.]).Google Scholar
Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2012. Collecting Large, Richly Annotated Facial-Expression Databases from Movies. IEEE MultiMedia 19, 3 (July 2012), 34–41.Google ScholarDigital Library
Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005), 602 – 610. IJCNN 2005.Google ScholarDigital Library
Da Guo, Kai Wang, Jianfei Yang, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. 2019. Exploring Regularizations with Face, Body and Image Cues for Group Cohesion Prediction. In Proceedings of the 21th ACM International Conference on Multimodal Interaction (in press). ACM.Google ScholarDigital Library
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.Google Scholar
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.Google ScholarDigital Library
G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. 2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2261–2269. https://doi.org/10.1109/CVPR.2017.243Google Scholar
Heysem Kaya, Furkan Gürpinar, Sadaf Afshar, and Albert Ali Salah. 2015. Contrasting and Combining Least Squares Based Learners for Emotion Recognition in the Wild. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction(ICMI ’15). ACM, New York, NY, USA, 459–466.Google ScholarDigital Library
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 60, 6 (May 2017), 84–90.Google ScholarDigital Library
Markus Kächele, Martin Schels, Sascha Meudt, Günther Palm, and Friedhelm Schwenker. 2016. Revisiting the EmotiW challenge: how wild is it really?Journal on Multimodal User Interfaces 10, 2 (2016), 1–12.Google Scholar
Chuanhe Liu, Tianhao Tang, Kui Lv, and Minghao Wang. 2018. Multi-Feature Based Emotion Recognition for Video Clips. In Proceedings of the 20th ACM International Conference on Multimodal Interaction(ICMI ’18). ACM, New York, NY, USA, 630–634.Google ScholarDigital Library
Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký, and Sanjeev Khudanpur. 2015. Recurrent neural network based language model. In Interspeech, Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September.Google Scholar
A. Mollahosseini, B. Hasani, and M. H. Mahoor. 2019. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Transactions on Affective Computing 10, 1 (Jan 2019), 18–31.Google ScholarDigital Library
Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep Face Recognition. In British Machine Vision Conference.Google Scholar
R V Shannon, F G Zeng, . Kamath, V., . Wygonski, J., and . Ekelid, M.1995. Speech recognition with primarily temporal cues. Science 270, 5234 (1995), 303–304.Google Scholar
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. Computer Science (2014).Google Scholar
C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1–9.Google Scholar
Kai Wang, Jianfei Yang, Da Guo, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. 2019. Bootstrap Model Ensemble and Rank Loss for Engagement Intensity Regression. In Proceedings of the 21th ACM International Conference on Multimodal Interaction (in press). ACM.Google ScholarDigital Library
Jianlong Wu, Zhouchen Lin, and Hongbin Zha. 2015. Multiple Models Fusion for Emotion Recognition in the Wild. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction(ICMI ’15). ACM, New York, NY, USA, 475–481.Google ScholarDigital Library
Shuzhe Wu, Meina Kan, Zhenliang He, Shiguang Shan, and Xilin Chen. 2017. Funnel-structured cascade for multi-view face detection with alignment-awareness. Neurocomputing 221(2017), 138 – 145.Google ScholarDigital Library
Anbang Yao, Junchao Shao, Ningning Ma, and Yurong Chen. 2015. Capturing AU-aware facial features and their latent relations for emotion recognition in the wild. In Acm on International Conference on Multimodal Interaction.Google ScholarDigital Library
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks?Eprint Arxiv 27(2014), 3320–3328.Google Scholar
K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters 23, 10 (Oct 2016), 1499–1503.Google ScholarCross Ref

Recommendations

Multi-Feature Based Emotion Recognition for Video Clips
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction

In this paper, we present our latest progress in Emotion Recognition techniques, which combines acoustic features and facial features in both non-temporal and temporal mode. This paper presents the details of our techniques used in the Audio-Video ...
Read More
Audiovisual emotion recognition in wild

People express emotions through different modalities. Utilization of both verbal and nonverbal communication channels allows to create a system in which the emotional state is expressed more clearly and therefore easier to understand. Expanding the ...
Read More
Emotion recognition in the wild challenge 2016
ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction

The fourth Emotion Recognition in the Wild (EmotiW) challenge is a grand challenge in the ACM International Conference on Multimodal Interaction 2016, Tokyo. EmotiW is a series of benchmarking and competition effort for researchers working in the area ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMI '19: 2019 International Conference on Multimodal Interaction
October 2019
601 pages
ISBN:9781450368605
DOI:10.1145/3340555
Editors:
Wen Gao
Peking University, China
,
Helen Mei Ling Meng
Chinese University of Hong Kong, China
,
Matthew Turk
Toyota Technological Institute at Chicago, USA
,
Susan R. Fussell
Cornell University, USA
,
Björn Schuller
Imperial College London / University of Augsburg, UK
,
Yale Song
Microsoft Research, USA
,
Kai Yu
Shanghai Jiao Tong University, China
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 October 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Convolutional Neural Networks
Deep Learning
Emotion Recognition
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate453of1,080submissions,42%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 30
  Total Citations
  View Citations
- 559
  Total Downloads
- Downloads (Last 12 months)48
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Bi-modality Fusion for Emotion Recognition in the Wild

ICMI '19: 2019 International Conference on Multimodal Interaction

ABSTRACT

References

Cited By

Recommendations

Multi-Feature Based Emotion Recognition for Video Clips

Audiovisual emotion recognition in wild

Emotion recognition in the wild challenge 2016

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Bi-modality Fusion for Emotion Recognition in the Wild

ICMI '19: 2019 International Conference on Multimodal Interaction

ABSTRACT

References

Cited By

Recommendations

Multi-Feature Based Emotion Recognition for Video Clips

Audiovisual emotion recognition in wild

Emotion recognition in the wild challenge 2016

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media