skip to main content
10.1145/2993148.2997639acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

HoloNet: towards robust emotion recognition in the wild

Published:31 October 2016Publication History

ABSTRACT

In this paper, we present HoloNet, a well-designed Convolutional Neural Network (CNN) architecture regarding our submissions to the video based sub-challenge of the Emotion Recognition in the Wild (EmotiW) 2016 challenge. In contrast to previous related methods that usually adopt relatively simple and shallow neural network architectures to address emotion recognition task, our HoloNet has three critical considerations in network design. (1) To reduce redundant filters and enhance the non-saturated non-linearity in the lower convolutional layers, we use a modified Concatenated Rectified Linear Unit (CReLU) instead of ReLU. (2) To enjoy the accuracy gain from considerably increased network depth and maintain efficiency, we combine residual structure and CReLU to construct the middle layers. (3) To broaden network width and introduce multi-scale feature extraction property, the topper layers are designed as a variant of inception-residual structure. The main benefit of grouping these modules into the HoloNet is that both negative and positive phase information implicitly contained in the input data can flow over it in multiple paths, thus deep multi-scale features explicitly capturing emotion variation can be well extracted from multi-path sibling layers, and then can be further concatenated for robust recognition. We obtain competitive results in this year’s video based emotion recognition sub-challenge using an ensemble of two HoloNet models trained with given data only. Specifically, we obtain a mean recognition rate of 57.84%, outperforming the baseline accuracy with an absolute margin of 17.37%, and yielding 4.04% absolute accuracy gain compared to the result of last year’s winner team. Meanwhile, our method runs with a speed of several thousands of frames per second on a GPU, thus it is well applicable to real-time scenarios.

References

  1. Chen, D. et al. 2013. Blessing of dimensionality: Highdimensional feature and its efficient compression for face verification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013), 3025 3032. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Dalal, N. et al. 2006. Human detection using oriented histograms of flow and appearance. Proceedings of the 9th European Conference on Computer Vision (2006), 428 441. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Dalal, N. and Triggs, B. 2005. Histograms of oriented gradients for human detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2005), 886 893. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Dhall, A. et al. 2013. Emotion recognition in the wild challenge 2013. Proceedings of the 15th ACM International Conference on Multimodal Interaction (2013), 509 516. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Dhall, A. et al. 2014. Emotion recognition in the wild challenge 2014: Baseline, data and protocol. Proceedings of the 16th ACM International Conference on Multimodal Interaction (2014), 461 466. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dhall, A. et al. 2015. Video and image based emotion recognition challenges in the wild: Emotiw 2015. Proceedings of the17th ACM on International Conference on Multimodal Interaction (2015), 423 426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Dhall, A. et al. 2016. Emotiw 2016: Video and group-level emotion recognition challenges. Proceedings of the18th ACM on International Conference on Multimodal Interaction (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Eyben, F. et al. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. Proceedings of the 18th ACM International Conference on Multimedia (2010), 1459 1462. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Hassner, T. et al. 2015. Effective face frontalization in unconstrained images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), 4295 4304.Google ScholarGoogle Scholar
  10. He, K. et al. 2016. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016).Google ScholarGoogle Scholar
  11. Kahou, S.E. et al. 2013. Combining modality specific deep neural networks for emotion recognition in video. Proceedings of the 15th ACM International Conference on Multimodal Interaction (2013), 543 550. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Krizhevsky, A. et al. 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (2012), 1097 1105. Google ScholarGoogle Scholar
  13. Lee C.-C., et al. 2011. Emotion recognition using a hierarchical binary decision tree approach. Speech Communication. 53, 9 (2011), 1162-1171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Liao, S. et al. 2007. Learning multi-scale block local binary patterns for face recognition. Proceedings of the International Conference on Biometrics (2007), 828 837. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Liu, M. et al. 2014. Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. Proceedings of the 16th ACM International Conference on Multimodal Interaction (2014), 494 501. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ojala, T. et al. 2002. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence. 24, 7 (2002), 971 987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Peng, K.-C. et al. 2015. A mixed bag of emotions: Model, predict, and transfer emotion distributions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), 860 868.Google ScholarGoogle Scholar
  18. Shan, C. et al. 2009. Facial expression recognition based on local binary patterns: A comprehensive study. Image and Vision Computing. 27, 6 (2009), 803 816. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Shang, W. et al. 2016. Understanding and improving convolutional neural networks via concatenated rectified linear units. Proceedings of the 33rd International Conference on Machine Learning (2016).Google ScholarGoogle Scholar
  20. Shelhamer, E. et al. 2014. DIY deep learning for vision: A hands-on tutorial with caffe. Proceedings of the 13th European Conference on Computer Vision (2014).Google ScholarGoogle Scholar
  21. Simonyan, K. and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. Proceedings of the 3rd International Conference on Learning Representations (2015).Google ScholarGoogle Scholar
  22. Szegedy, C. et al. 2015. Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015).Google ScholarGoogle Scholar
  23. Szegedy, C. et al. 2016. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261. (2016).Google ScholarGoogle Scholar
  24. Szegedy, C. et al. 2015. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567.Google ScholarGoogle Scholar
  25. (2015).Google ScholarGoogle Scholar
  26. Taigman, Y. et al. 2014. Deepface: Closing the gap to human-level performance in face verification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014), 1701 1708. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Viola, P. and Jones, M. 2001. Rapid object detection using a boosted cascade of simple features. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2001), 511 518.Google ScholarGoogle Scholar
  28. Wang, H. et al. 2013. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision. 103, 1 (2013), 60 79.Google ScholarGoogle ScholarCross RefCross Ref
  29. Wang, H. and Schmid, C. 2013. Action recognition with improved trajectories. Proceedings of the IEEE International Conference on Computer Vision (2013), 3551 3558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Xiong, X. and De la Torre, F. 2013. Supervised descent method and its applications to face alignment. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013), 532 539. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yao, A. et al. 2015. Capturing au-aware facial features and their latent relations for emotion recognition in the wild. Proceedings of the 17th ACM International Conference on Multimodal Interaction (2015), 451 458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Zeng, Z. et al. 2009. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Mmachine Intelligence. 31, 1 (2009), 39 58. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. HoloNet: towards robust emotion recognition in the wild

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction
        October 2016
        605 pages
        ISBN:9781450345569
        DOI:10.1145/2993148

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 31 October 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper

        Acceptance Rates

        Overall Acceptance Rate453of1,080submissions,42%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader