ABSTRACT
In this paper, we present HoloNet, a well-designed Convolutional Neural Network (CNN) architecture regarding our submissions to the video based sub-challenge of the Emotion Recognition in the Wild (EmotiW) 2016 challenge. In contrast to previous related methods that usually adopt relatively simple and shallow neural network architectures to address emotion recognition task, our HoloNet has three critical considerations in network design. (1) To reduce redundant filters and enhance the non-saturated non-linearity in the lower convolutional layers, we use a modified Concatenated Rectified Linear Unit (CReLU) instead of ReLU. (2) To enjoy the accuracy gain from considerably increased network depth and maintain efficiency, we combine residual structure and CReLU to construct the middle layers. (3) To broaden network width and introduce multi-scale feature extraction property, the topper layers are designed as a variant of inception-residual structure. The main benefit of grouping these modules into the HoloNet is that both negative and positive phase information implicitly contained in the input data can flow over it in multiple paths, thus deep multi-scale features explicitly capturing emotion variation can be well extracted from multi-path sibling layers, and then can be further concatenated for robust recognition. We obtain competitive results in this year’s video based emotion recognition sub-challenge using an ensemble of two HoloNet models trained with given data only. Specifically, we obtain a mean recognition rate of 57.84%, outperforming the baseline accuracy with an absolute margin of 17.37%, and yielding 4.04% absolute accuracy gain compared to the result of last year’s winner team. Meanwhile, our method runs with a speed of several thousands of frames per second on a GPU, thus it is well applicable to real-time scenarios.
- Chen, D. et al. 2013. Blessing of dimensionality: Highdimensional feature and its efficient compression for face verification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013), 3025 3032. Google ScholarDigital Library
- Dalal, N. et al. 2006. Human detection using oriented histograms of flow and appearance. Proceedings of the 9th European Conference on Computer Vision (2006), 428 441. Google ScholarDigital Library
- Dalal, N. and Triggs, B. 2005. Histograms of oriented gradients for human detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2005), 886 893. Google ScholarDigital Library
- Dhall, A. et al. 2013. Emotion recognition in the wild challenge 2013. Proceedings of the 15th ACM International Conference on Multimodal Interaction (2013), 509 516. Google ScholarDigital Library
- Dhall, A. et al. 2014. Emotion recognition in the wild challenge 2014: Baseline, data and protocol. Proceedings of the 16th ACM International Conference on Multimodal Interaction (2014), 461 466. Google ScholarDigital Library
- Dhall, A. et al. 2015. Video and image based emotion recognition challenges in the wild: Emotiw 2015. Proceedings of the17th ACM on International Conference on Multimodal Interaction (2015), 423 426. Google ScholarDigital Library
- Dhall, A. et al. 2016. Emotiw 2016: Video and group-level emotion recognition challenges. Proceedings of the18th ACM on International Conference on Multimodal Interaction (2016). Google ScholarDigital Library
- Eyben, F. et al. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. Proceedings of the 18th ACM International Conference on Multimedia (2010), 1459 1462. Google ScholarDigital Library
- Hassner, T. et al. 2015. Effective face frontalization in unconstrained images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), 4295 4304.Google Scholar
- He, K. et al. 2016. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016).Google Scholar
- Kahou, S.E. et al. 2013. Combining modality specific deep neural networks for emotion recognition in video. Proceedings of the 15th ACM International Conference on Multimodal Interaction (2013), 543 550. Google ScholarDigital Library
- Krizhevsky, A. et al. 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (2012), 1097 1105. Google Scholar
- Lee C.-C., et al. 2011. Emotion recognition using a hierarchical binary decision tree approach. Speech Communication. 53, 9 (2011), 1162-1171. Google ScholarDigital Library
- Liao, S. et al. 2007. Learning multi-scale block local binary patterns for face recognition. Proceedings of the International Conference on Biometrics (2007), 828 837. Google ScholarDigital Library
- Liu, M. et al. 2014. Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. Proceedings of the 16th ACM International Conference on Multimodal Interaction (2014), 494 501. Google ScholarDigital Library
- Ojala, T. et al. 2002. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence. 24, 7 (2002), 971 987. Google ScholarDigital Library
- Peng, K.-C. et al. 2015. A mixed bag of emotions: Model, predict, and transfer emotion distributions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), 860 868.Google Scholar
- Shan, C. et al. 2009. Facial expression recognition based on local binary patterns: A comprehensive study. Image and Vision Computing. 27, 6 (2009), 803 816. Google ScholarDigital Library
- Shang, W. et al. 2016. Understanding and improving convolutional neural networks via concatenated rectified linear units. Proceedings of the 33rd International Conference on Machine Learning (2016).Google Scholar
- Shelhamer, E. et al. 2014. DIY deep learning for vision: A hands-on tutorial with caffe. Proceedings of the 13th European Conference on Computer Vision (2014).Google Scholar
- Simonyan, K. and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. Proceedings of the 3rd International Conference on Learning Representations (2015).Google Scholar
- Szegedy, C. et al. 2015. Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015).Google Scholar
- Szegedy, C. et al. 2016. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261. (2016).Google Scholar
- Szegedy, C. et al. 2015. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567.Google Scholar
- (2015).Google Scholar
- Taigman, Y. et al. 2014. Deepface: Closing the gap to human-level performance in face verification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014), 1701 1708. Google ScholarDigital Library
- Viola, P. and Jones, M. 2001. Rapid object detection using a boosted cascade of simple features. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2001), 511 518.Google Scholar
- Wang, H. et al. 2013. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision. 103, 1 (2013), 60 79.Google ScholarCross Ref
- Wang, H. and Schmid, C. 2013. Action recognition with improved trajectories. Proceedings of the IEEE International Conference on Computer Vision (2013), 3551 3558. Google ScholarDigital Library
- Xiong, X. and De la Torre, F. 2013. Supervised descent method and its applications to face alignment. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013), 532 539. Google ScholarDigital Library
- Yao, A. et al. 2015. Capturing au-aware facial features and their latent relations for emotion recognition in the wild. Proceedings of the 17th ACM International Conference on Multimodal Interaction (2015), 451 458. Google ScholarDigital Library
- Zeng, Z. et al. 2009. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Mmachine Intelligence. 31, 1 (2009), 39 58. Google ScholarDigital Library
Index Terms
- HoloNet: towards robust emotion recognition in the wild
Recommendations
Emotion recognition in the wild from videos using images
ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal InteractionThis paper presents the implementation details of the proposed solution to the Emotion Recognition in the Wild 2016 Challenge, in the category of video-based emotion recognition. The proposed approach takes the video stream from the audio-video trimmed ...
Learning supervised scoring ensemble for emotion recognition in the wild
ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal InteractionState-of-the-art approaches for the previous emotion recognition in the wild challenges are usually built on prevailing Convolutional Neural Networks (CNNs). Although there is clear evidence that CNNs with increased depth or width can usually bring ...
Bi-modality Fusion for Emotion Recognition in the Wild
ICMI '19: 2019 International Conference on Multimodal InteractionThe emotion recognition in the wild has been a hot research topic in the field of affective computing. Though some progresses have been achieved, the emotion recognition in the wild is still an unsolved problem due to the challenge of head movement, ...
Comments