short-paper

Multi-clue fusion for emotion recognition in the wild

Authors:
Jingwei Yan

Southeast University, China

Southeast University, China
View Profile

,
Wenming Zheng

Southeast University, China

Southeast University, China
View Profile

,
Zhen Cui

Southeast University, China

Southeast University, China
View Profile

,
Chuangao Tang

Southeast University, China

Southeast University, China
View Profile

,
Tong Zhang

Southeast University, China

Southeast University, China
View Profile

,
Yuan Zong

Southeast University, China

Southeast University, China
View Profile

,
Ning Sun

Nanjing University of Posts and Telecommunications, China

Nanjing University of Posts and Telecommunications, China
View Profile

ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal InteractionOctober 2016Pages 458–463https://doi.org/10.1145/2993148.2997630

Published:31 October 2016Publication History

ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction

Pages 458–463

ABSTRACT

In the past three years, Emotion Recognition in the Wild (EmotiW) Grand Challenge has drawn more and more attention due to its huge potential applications. In the fourth challenge, aimed at the task of video based emotion recognition, we propose a multi-clue emotion fusion (MCEF) framework by modeling human emotion from three mutually complementary sources, facial appearance texture, facial action, and audio. To extract high-level emotion features from sequential face images, we employ a CNN-RNN architecture, where face image from each frame is first fed into the fine-tuned VGG-Face network to extract face feature, and then the features of all frames are sequentially traversed in a bidirectional RNN so as to capture dynamic changes of facial textures. To attain more accurate facial actions, a facial landmark trajectory model is proposed to explicitly learn emotion variations of facial components. Further, audio signals are also modeled in a CNN framework by extracting low-level energy features from segmented audio clips and then stacking them as an image-like map. Finally, we fuse the results generated from three clues to boost the performance of emotion recognition. Our proposed MCEF achieves an overall accuracy of 56.66% with a large improvement of 16.19% with respect to the baseline.

References

J. J. Abhinav Dhall, Roland Goecke, J. Hoey, and T. Gedeon. Emotiw 2016: Video and group-level emotion recognition challenges. In ICMI. ACM, 2016. Google ScholarDigital Library
C.-N. Anagnostopoulos, T. Iliou, and I. Giannoukos. Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artificial Intelligence Review, 43(2):155–177, 2015. Google ScholarDigital Library
D. Ciregan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In CVPR, 2012.Google Scholar
Z. Cui, S. Xiao, J. Feng, S. Yan, and W. Zheng. Recurrent shape regression. IEEE TPAMI (under review), 2016.Google Scholar
S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal. Recurrent neural networks for emotion recognition in video. In ICMI, pages 467–474. ACM, 2015. Google ScholarDigital Library
F. Eyben, M. Wöllmer, and B. Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In ACM international conference on Multimedia, pages 1459–1462. ACM, 2010. Google ScholarDigital Library
A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.Google ScholarCross Ref
R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. Image and Vision Computing, 28(5):807–813, 2010. Google ScholarDigital Library
S. Happy and A. Routray. Automatic facial expression recognition using features of salient facial patches. IEEE TAC, 6(1):1–12, 2015.Google Scholar
V. Jain and E. Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. Technical Report UM-CS-2010-009, University of Massachusetts, Amherst, 2010.Google Scholar
H. Jung, S. Lee, J. Yim, S. Park, and J. Kim. Joint fine-tuning in deep neural networks for facial expression recognition. In ICCV, 2015. Google ScholarDigital Library
H. Kaya, F. Gürpinar, S. Afshar, and A. A. Salah. Contrasting and combining least squares based learners for emotion recognition in the wild. In ICMI, pages 459–466. ACM, 2015. Google ScholarDigital Library
B.-K. Kim, H. Lee, J. Roh, and S.-Y. Lee. Hierarchical committee of deep cnns with exponentially-weighted decision fusion for static facial expression recognition. In ICMI, pages 427–434. ACM, 2015. Google ScholarDigital Library
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. Google ScholarDigital Library
Q. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.Google Scholar
Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.Google ScholarCross Ref
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. arXiv preprint arXiv:1512.02325, 2015.Google Scholar
P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In CVPR Workshops, 2010.Google ScholarCross Ref
O. Martin, I. Kotsia, B. Macq, and I. Pitas. The enterface’05 audio-visual emotion database. In ICDE Workshops, 2006. Google ScholarDigital Library
O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference, 2015.Google ScholarCross Ref
M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE TSP, 45(11):2673–2681, 1997. Google ScholarDigital Library
Y.-I. Tian, T. Kanade, and J. F. Cohn. Recognizing action units for facial expression analysis. IEEE TPAMI, 23(2):97–115, 2001. Google ScholarDigital Library
K. Wang, N. An, B. N. Li, Y. Zhang, and L. Li. Speech emotion recognition using fourier parameters. IEEE TAC, 6(1):69–75, 2015.Google Scholar
P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.Google ScholarCross Ref
J. Wu, Z. Lin, and H. Zha. Multiple models fusion for emotion recognition in the wild. In ICMI, pages 475–481. ACM, 2015. Google ScholarDigital Library
A. Yao, J. Shao, N. Ma, and Y. Chen. Capturing au-aware facial features and their latent relations for emotion recognition in the wild. In ICMI, pages 451–458. ACM, 2015. Google ScholarDigital Library
S. Zhang, C. Liu, H. Jiang, S. Wei, L. Dai, and Y. Hu. Feedforward sequential memory networks: A new structure to learn long-term dependency. arXiv preprint arXiv:1512.08301, 2015.Google Scholar
W. Zheng. Multi-view facial expression recognition based on group sparse reduced-rank regression. IEEE TAC, 5(1):71–85, 2014.Google ScholarCross Ref

Index Terms

Multi-clue fusion for emotion recognition in the wild
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Security and privacy
  1. Security services
    1. Authentication
      1. Biometrics

Recommendations

Bi-modality Fusion for Emotion Recognition in the Wild
ICMI '19: 2019 International Conference on Multimodal Interaction

The emotion recognition in the wild has been a hot research topic in the field of affective computing. Though some progresses have been achieved, the emotion recognition in the wild is still an unsolved problem due to the challenge of head movement, ...
Read More
Video-based emotion recognition in the wild using deep transfer learning and score fusion

Multimodal recognition of affective states is a difficult problem, unless the recording conditions are carefully controlled. For recognition in the wild, large variances in face pose and illumination, cluttered backgrounds, occlusions, audio and video ...
Read More
Emotion Recognition In The Wild Challenge 2014: Baseline, Data and Protocol
ICMI '14: Proceedings of the 16th International Conference on Multimodal Interaction

The Second Emotion Recognition In The Wild Challenge (EmotiW) 2014 consists of an audio-video based emotion classification challenge, which mimics the real-world conditions. Traditionally, emotion recognition has been performed on data captured in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction
October 2016
605 pages
ISBN:9781450345569
DOI:10.1145/2993148
General Chairs:
Yukiko I. Nakano
Seikei University, Japan
,
Elisabeth André
Augsburg University, Germany
,
Toyoaki Nishida
Kyoto University, Japan
,
Program Chairs:
Louis-Philippe Morency
Carnegie Mellon University, USA
,
Carlos Busso
University of Texas at Dallas, USA
,
Catherine Pelachaud
ISIR, France / University of Paris6, France
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
AFEW
convolutional neural network (CNN)
emotion recognition in the wild
multi-clue
recurrent neural network (RNN)
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate453of1,080submissions,42%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 30
  Total Citations
  View Citations
- 696
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multi-clue fusion for emotion recognition in the wild

ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Bi-modality Fusion for Emotion Recognition in the Wild

Video-based emotion recognition in the wild using deep transfer learning and score fusion

Emotion Recognition In The Wild Challenge 2014: Baseline, Data and Protocol