short-paper

HoloNet: towards robust emotion recognition in the wild

Authors:
Anbang Yao

Intel Labs, China

Intel Labs, China
View Profile

,
Dongqi Cai

Intel Labs, China

Intel Labs, China
View Profile

,
Ping Hu

Intel Labs, China

Intel Labs, China
View Profile

,
Shandong Wang

Intel Labs, China

Intel Labs, China
View Profile

,
Liang Sha

Beihang University, China

Beihang University, China
View Profile

,
Yurong Chen

Intel Labs, China

Intel Labs, China
View Profile

ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal InteractionOctober 2016Pages 472–478https://doi.org/10.1145/2993148.2997639

Published:31 October 2016Publication History

ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction

Pages 472–478

ABSTRACT

In this paper, we present HoloNet, a well-designed Convolutional Neural Network (CNN) architecture regarding our submissions to the video based sub-challenge of the Emotion Recognition in the Wild (EmotiW) 2016 challenge. In contrast to previous related methods that usually adopt relatively simple and shallow neural network architectures to address emotion recognition task, our HoloNet has three critical considerations in network design. (1) To reduce redundant filters and enhance the non-saturated non-linearity in the lower convolutional layers, we use a modified Concatenated Rectified Linear Unit (CReLU) instead of ReLU. (2) To enjoy the accuracy gain from considerably increased network depth and maintain efficiency, we combine residual structure and CReLU to construct the middle layers. (3) To broaden network width and introduce multi-scale feature extraction property, the topper layers are designed as a variant of inception-residual structure. The main benefit of grouping these modules into the HoloNet is that both negative and positive phase information implicitly contained in the input data can flow over it in multiple paths, thus deep multi-scale features explicitly capturing emotion variation can be well extracted from multi-path sibling layers, and then can be further concatenated for robust recognition. We obtain competitive results in this year’s video based emotion recognition sub-challenge using an ensemble of two HoloNet models trained with given data only. Specifically, we obtain a mean recognition rate of 57.84%, outperforming the baseline accuracy with an absolute margin of 17.37%, and yielding 4.04% absolute accuracy gain compared to the result of last year’s winner team. Meanwhile, our method runs with a speed of several thousands of frames per second on a GPU, thus it is well applicable to real-time scenarios.

References

Chen, D. et al. 2013. Blessing of dimensionality: Highdimensional feature and its efficient compression for face verification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013), 3025 3032. Google ScholarDigital Library
Dalal, N. et al. 2006. Human detection using oriented histograms of flow and appearance. Proceedings of the 9th European Conference on Computer Vision (2006), 428 441. Google ScholarDigital Library
Dalal, N. and Triggs, B. 2005. Histograms of oriented gradients for human detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2005), 886 893. Google ScholarDigital Library
Dhall, A. et al. 2013. Emotion recognition in the wild challenge 2013. Proceedings of the 15th ACM International Conference on Multimodal Interaction (2013), 509 516. Google ScholarDigital Library
Dhall, A. et al. 2014. Emotion recognition in the wild challenge 2014: Baseline, data and protocol. Proceedings of the 16th ACM International Conference on Multimodal Interaction (2014), 461 466. Google ScholarDigital Library
Dhall, A. et al. 2015. Video and image based emotion recognition challenges in the wild: Emotiw 2015. Proceedings of the17th ACM on International Conference on Multimodal Interaction (2015), 423 426. Google ScholarDigital Library
Dhall, A. et al. 2016. Emotiw 2016: Video and group-level emotion recognition challenges. Proceedings of the18th ACM on International Conference on Multimodal Interaction (2016). Google ScholarDigital Library
Eyben, F. et al. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. Proceedings of the 18th ACM International Conference on Multimedia (2010), 1459 1462. Google ScholarDigital Library
Hassner, T. et al. 2015. Effective face frontalization in unconstrained images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), 4295 4304.Google Scholar
He, K. et al. 2016. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016).Google Scholar
Kahou, S.E. et al. 2013. Combining modality specific deep neural networks for emotion recognition in video. Proceedings of the 15th ACM International Conference on Multimodal Interaction (2013), 543 550. Google ScholarDigital Library
Krizhevsky, A. et al. 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (2012), 1097 1105. Google Scholar
Lee C.-C., et al. 2011. Emotion recognition using a hierarchical binary decision tree approach. Speech Communication. 53, 9 (2011), 1162-1171. Google ScholarDigital Library
Liao, S. et al. 2007. Learning multi-scale block local binary patterns for face recognition. Proceedings of the International Conference on Biometrics (2007), 828 837. Google ScholarDigital Library
Liu, M. et al. 2014. Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. Proceedings of the 16th ACM International Conference on Multimodal Interaction (2014), 494 501. Google ScholarDigital Library
Ojala, T. et al. 2002. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence. 24, 7 (2002), 971 987. Google ScholarDigital Library
Peng, K.-C. et al. 2015. A mixed bag of emotions: Model, predict, and transfer emotion distributions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), 860 868.Google Scholar
Shan, C. et al. 2009. Facial expression recognition based on local binary patterns: A comprehensive study. Image and Vision Computing. 27, 6 (2009), 803 816. Google ScholarDigital Library
Shang, W. et al. 2016. Understanding and improving convolutional neural networks via concatenated rectified linear units. Proceedings of the 33rd International Conference on Machine Learning (2016).Google Scholar
Shelhamer, E. et al. 2014. DIY deep learning for vision: A hands-on tutorial with caffe. Proceedings of the 13th European Conference on Computer Vision (2014).Google Scholar
Simonyan, K. and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. Proceedings of the 3rd International Conference on Learning Representations (2015).Google Scholar
Szegedy, C. et al. 2015. Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015).Google Scholar
Szegedy, C. et al. 2016. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261. (2016).Google Scholar
Szegedy, C. et al. 2015. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567.Google Scholar
(2015).Google Scholar
Taigman, Y. et al. 2014. Deepface: Closing the gap to human-level performance in face verification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014), 1701 1708. Google ScholarDigital Library
Viola, P. and Jones, M. 2001. Rapid object detection using a boosted cascade of simple features. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2001), 511 518.Google Scholar
Wang, H. et al. 2013. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision. 103, 1 (2013), 60 79.Google ScholarCross Ref
Wang, H. and Schmid, C. 2013. Action recognition with improved trajectories. Proceedings of the IEEE International Conference on Computer Vision (2013), 3551 3558. Google ScholarDigital Library
Xiong, X. and De la Torre, F. 2013. Supervised descent method and its applications to face alignment. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013), 532 539. Google ScholarDigital Library
Yao, A. et al. 2015. Capturing au-aware facial features and their latent relations for emotion recognition in the wild. Proceedings of the 17th ACM International Conference on Multimodal Interaction (2015), 451 458. Google ScholarDigital Library
Zeng, Z. et al. 2009. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Mmachine Intelligence. 31, 1 (2009), 39 58. Google ScholarDigital Library

Index Terms

HoloNet: towards robust emotion recognition in the wild
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Appearance and texture representations
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Emotion recognition in the wild from videos using images
ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction

This paper presents the implementation details of the proposed solution to the Emotion Recognition in the Wild 2016 Challenge, in the category of video-based emotion recognition. The proposed approach takes the video stream from the audio-video trimmed ...
Read More
Learning supervised scoring ensemble for emotion recognition in the wild
ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal Interaction

State-of-the-art approaches for the previous emotion recognition in the wild challenges are usually built on prevailing Convolutional Neural Networks (CNNs). Although there is clear evidence that CNNs with increased depth or width can usually bring ...
Read More
Bi-modality Fusion for Emotion Recognition in the Wild
ICMI '19: 2019 International Conference on Multimodal Interaction

The emotion recognition in the wild has been a hot research topic in the field of affective computing. Though some progresses have been achieved, the emotion recognition in the wild is still an unsolved problem due to the challenge of head movement, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction
October 2016
605 pages
ISBN:9781450345569
DOI:10.1145/2993148
General Chairs:
Yukiko I. Nakano
Seikei University, Japan
,
Elisabeth André
Augsburg University, Germany
,
Toyoaki Nishida
Kyoto University, Japan
,
Program Chairs:
Louis-Philippe Morency
Carnegie Mellon University, USA
,
Carlos Busso
University of Texas at Dallas, USA
,
Catherine Pelachaud
ISIR, France / University of Paris6, France
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Convolutional Neural Networks
Deep Learning
EmotiW 2016 Challenge
Emotion Recognition
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate453of1,080submissions,42%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 64
  Total Citations
  View Citations
- 1,315
  Total Downloads
- Downloads (Last 12 months)56
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HoloNet: towards robust emotion recognition in the wild

ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Emotion recognition in the wild from videos using images

Learning supervised scoring ensemble for emotion recognition in the wild

Bi-modality Fusion for Emotion Recognition in the Wild