skip to main content
10.1145/3340555.3355719acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Bi-modality Fusion for Emotion Recognition in the Wild

Published:14 October 2019Publication History

ABSTRACT

The emotion recognition in the wild has been a hot research topic in the field of affective computing. Though some progresses have been achieved, the emotion recognition in the wild is still an unsolved problem due to the challenge of head movement, face deformation, illumination variation etc. To deal with these unconstrained challenges, we propose a bi-modality fusion method for video based emotion recognition in the wild. The proposed framework takes advantages of the visual information from facial expression sequences and the speech information from audio. The state-of-the-art CNN based object recognition models are employed to facilitate the facial expression recognition performance. A bi-direction long short term Memory (Bi-LSTM) is employed to capture dynamic information of the learned features. Additionally, to take full advantages of the facial expression information, the VGG16 network is trained on AffectNet dataset to learn a specialized facial expression recognition model. On the other hand, the audio based features, like low level descriptor (LLD) and deep features obtained by spectrogram image, are also developed to improve the emotion recognition performance. The best experimental result shows that the overall accuracy of our algorithm on the Test dataset of the EmotiW challenge is 62.78, which outperforms the best result of EmotiW2018 and ranks 2nd at the EmotiW2019 challenge.

References

  1. Abhinav Dhall, Roland Goecke, Shreya Ghosh, and year=2019 publisher=ACM Tom Gedeon, booktitle=ACM International Conference on Mutimodal Interaction 2019. [n.d.]. EmotiW 2019: Automatic Emotion, Engagement and Cohesion Prediction Tasks. ([n. d.]).Google ScholarGoogle Scholar
  2. Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2012. Collecting Large, Richly Annotated Facial-Expression Databases from Movies. IEEE MultiMedia 19, 3 (July 2012), 34–41.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005), 602 – 610. IJCNN 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Da Guo, Kai Wang, Jianfei Yang, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. 2019. Exploring Regularizations with Face, Body and Image Cues for Group Cohesion Prediction. In Proceedings of the 21th ACM International Conference on Multimodal Interaction (in press). ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.Google ScholarGoogle Scholar
  6. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. 2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2261–2269. https://doi.org/10.1109/CVPR.2017.243Google ScholarGoogle Scholar
  8. Heysem Kaya, Furkan Gürpinar, Sadaf Afshar, and Albert Ali Salah. 2015. Contrasting and Combining Least Squares Based Learners for Emotion Recognition in the Wild. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction(ICMI ’15). ACM, New York, NY, USA, 459–466.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 60, 6 (May 2017), 84–90.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Markus Kächele, Martin Schels, Sascha Meudt, Günther Palm, and Friedhelm Schwenker. 2016. Revisiting the EmotiW challenge: how wild is it really?Journal on Multimodal User Interfaces 10, 2 (2016), 1–12.Google ScholarGoogle Scholar
  11. Chuanhe Liu, Tianhao Tang, Kui Lv, and Minghao Wang. 2018. Multi-Feature Based Emotion Recognition for Video Clips. In Proceedings of the 20th ACM International Conference on Multimodal Interaction(ICMI ’18). ACM, New York, NY, USA, 630–634.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký, and Sanjeev Khudanpur. 2015. Recurrent neural network based language model. In Interspeech, Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September.Google ScholarGoogle Scholar
  13. A. Mollahosseini, B. Hasani, and M. H. Mahoor. 2019. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Transactions on Affective Computing 10, 1 (Jan 2019), 18–31.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep Face Recognition. In British Machine Vision Conference.Google ScholarGoogle Scholar
  15. R V Shannon, F G Zeng, . Kamath, V., . Wygonski, J., and . Ekelid, M.1995. Speech recognition with primarily temporal cues. Science 270, 5234 (1995), 303–304.Google ScholarGoogle Scholar
  16. Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. Computer Science (2014).Google ScholarGoogle Scholar
  17. C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1–9.Google ScholarGoogle Scholar
  18. Kai Wang, Jianfei Yang, Da Guo, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. 2019. Bootstrap Model Ensemble and Rank Loss for Engagement Intensity Regression. In Proceedings of the 21th ACM International Conference on Multimodal Interaction (in press). ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jianlong Wu, Zhouchen Lin, and Hongbin Zha. 2015. Multiple Models Fusion for Emotion Recognition in the Wild. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction(ICMI ’15). ACM, New York, NY, USA, 475–481.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Shuzhe Wu, Meina Kan, Zhenliang He, Shiguang Shan, and Xilin Chen. 2017. Funnel-structured cascade for multi-view face detection with alignment-awareness. Neurocomputing 221(2017), 138 – 145.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Anbang Yao, Junchao Shao, Ningning Ma, and Yurong Chen. 2015. Capturing AU-aware facial features and their latent relations for emotion recognition in the wild. In Acm on International Conference on Multimodal Interaction.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks?Eprint Arxiv 27(2014), 3320–3328.Google ScholarGoogle Scholar
  23. K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters 23, 10 (Oct 2016), 1499–1503.Google ScholarGoogle ScholarCross RefCross Ref

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ICMI '19: 2019 International Conference on Multimodal Interaction
    October 2019
    601 pages
    ISBN:9781450368605
    DOI:10.1145/3340555

    Copyright © 2019 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 14 October 2019

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate453of1,080submissions,42%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format