ABSTRACT
In the past three years, Emotion Recognition in the Wild (EmotiW) Grand Challenge has drawn more and more attention due to its huge potential applications. In the fourth challenge, aimed at the task of video based emotion recognition, we propose a multi-clue emotion fusion (MCEF) framework by modeling human emotion from three mutually complementary sources, facial appearance texture, facial action, and audio. To extract high-level emotion features from sequential face images, we employ a CNN-RNN architecture, where face image from each frame is first fed into the fine-tuned VGG-Face network to extract face feature, and then the features of all frames are sequentially traversed in a bidirectional RNN so as to capture dynamic changes of facial textures. To attain more accurate facial actions, a facial landmark trajectory model is proposed to explicitly learn emotion variations of facial components. Further, audio signals are also modeled in a CNN framework by extracting low-level energy features from segmented audio clips and then stacking them as an image-like map. Finally, we fuse the results generated from three clues to boost the performance of emotion recognition. Our proposed MCEF achieves an overall accuracy of 56.66% with a large improvement of 16.19% with respect to the baseline.
- J. J. Abhinav Dhall, Roland Goecke, J. Hoey, and T. Gedeon. Emotiw 2016: Video and group-level emotion recognition challenges. In ICMI. ACM, 2016. Google ScholarDigital Library
- C.-N. Anagnostopoulos, T. Iliou, and I. Giannoukos. Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artificial Intelligence Review, 43(2):155–177, 2015. Google ScholarDigital Library
- D. Ciregan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In CVPR, 2012.Google Scholar
- Z. Cui, S. Xiao, J. Feng, S. Yan, and W. Zheng. Recurrent shape regression. IEEE TPAMI (under review), 2016.Google Scholar
- S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal. Recurrent neural networks for emotion recognition in video. In ICMI, pages 467–474. ACM, 2015. Google ScholarDigital Library
- F. Eyben, M. Wöllmer, and B. Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In ACM international conference on Multimedia, pages 1459–1462. ACM, 2010. Google ScholarDigital Library
- A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.Google ScholarCross Ref
- R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. Image and Vision Computing, 28(5):807–813, 2010. Google ScholarDigital Library
- S. Happy and A. Routray. Automatic facial expression recognition using features of salient facial patches. IEEE TAC, 6(1):1–12, 2015.Google Scholar
- V. Jain and E. Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. Technical Report UM-CS-2010-009, University of Massachusetts, Amherst, 2010.Google Scholar
- H. Jung, S. Lee, J. Yim, S. Park, and J. Kim. Joint fine-tuning in deep neural networks for facial expression recognition. In ICCV, 2015. Google ScholarDigital Library
- H. Kaya, F. Gürpinar, S. Afshar, and A. A. Salah. Contrasting and combining least squares based learners for emotion recognition in the wild. In ICMI, pages 459–466. ACM, 2015. Google ScholarDigital Library
- B.-K. Kim, H. Lee, J. Roh, and S.-Y. Lee. Hierarchical committee of deep cnns with exponentially-weighted decision fusion for static facial expression recognition. In ICMI, pages 427–434. ACM, 2015. Google ScholarDigital Library
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. Google ScholarDigital Library
- Q. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.Google Scholar
- Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.Google ScholarCross Ref
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. arXiv preprint arXiv:1512.02325, 2015.Google Scholar
- P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In CVPR Workshops, 2010.Google ScholarCross Ref
- O. Martin, I. Kotsia, B. Macq, and I. Pitas. The enterface’05 audio-visual emotion database. In ICDE Workshops, 2006. Google ScholarDigital Library
- O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference, 2015.Google ScholarCross Ref
- M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE TSP, 45(11):2673–2681, 1997. Google ScholarDigital Library
- Y.-I. Tian, T. Kanade, and J. F. Cohn. Recognizing action units for facial expression analysis. IEEE TPAMI, 23(2):97–115, 2001. Google ScholarDigital Library
- K. Wang, N. An, B. N. Li, Y. Zhang, and L. Li. Speech emotion recognition using fourier parameters. IEEE TAC, 6(1):69–75, 2015.Google Scholar
- P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.Google ScholarCross Ref
- J. Wu, Z. Lin, and H. Zha. Multiple models fusion for emotion recognition in the wild. In ICMI, pages 475–481. ACM, 2015. Google ScholarDigital Library
- A. Yao, J. Shao, N. Ma, and Y. Chen. Capturing au-aware facial features and their latent relations for emotion recognition in the wild. In ICMI, pages 451–458. ACM, 2015. Google ScholarDigital Library
- S. Zhang, C. Liu, H. Jiang, S. Wei, L. Dai, and Y. Hu. Feedforward sequential memory networks: A new structure to learn long-term dependency. arXiv preprint arXiv:1512.08301, 2015.Google Scholar
- W. Zheng. Multi-view facial expression recognition based on group sparse reduced-rank regression. IEEE TAC, 5(1):71–85, 2014.Google ScholarCross Ref
Index Terms
- Multi-clue fusion for emotion recognition in the wild
Recommendations
Bi-modality Fusion for Emotion Recognition in the Wild
ICMI '19: 2019 International Conference on Multimodal InteractionThe emotion recognition in the wild has been a hot research topic in the field of affective computing. Though some progresses have been achieved, the emotion recognition in the wild is still an unsolved problem due to the challenge of head movement, ...
Video-based emotion recognition in the wild using deep transfer learning and score fusion
Multimodal recognition of affective states is a difficult problem, unless the recording conditions are carefully controlled. For recognition in the wild, large variances in face pose and illumination, cluttered backgrounds, occlusions, audio and video ...
Emotion Recognition In The Wild Challenge 2014: Baseline, Data and Protocol
ICMI '14: Proceedings of the 16th International Conference on Multimodal InteractionThe Second Emotion Recognition In The Wild Challenge (EmotiW) 2014 consists of an audio-video based emotion classification challenge, which mimics the real-world conditions. Traditionally, emotion recognition has been performed on data captured in ...
Comments