skip to main content
10.1145/2993148.2997630acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

Multi-clue fusion for emotion recognition in the wild

Published:31 October 2016Publication History

ABSTRACT

In the past three years, Emotion Recognition in the Wild (EmotiW) Grand Challenge has drawn more and more attention due to its huge potential applications. In the fourth challenge, aimed at the task of video based emotion recognition, we propose a multi-clue emotion fusion (MCEF) framework by modeling human emotion from three mutually complementary sources, facial appearance texture, facial action, and audio. To extract high-level emotion features from sequential face images, we employ a CNN-RNN architecture, where face image from each frame is first fed into the fine-tuned VGG-Face network to extract face feature, and then the features of all frames are sequentially traversed in a bidirectional RNN so as to capture dynamic changes of facial textures. To attain more accurate facial actions, a facial landmark trajectory model is proposed to explicitly learn emotion variations of facial components. Further, audio signals are also modeled in a CNN framework by extracting low-level energy features from segmented audio clips and then stacking them as an image-like map. Finally, we fuse the results generated from three clues to boost the performance of emotion recognition. Our proposed MCEF achieves an overall accuracy of 56.66% with a large improvement of 16.19% with respect to the baseline.

References

  1. J. J. Abhinav Dhall, Roland Goecke, J. Hoey, and T. Gedeon. Emotiw 2016: Video and group-level emotion recognition challenges. In ICMI. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C.-N. Anagnostopoulos, T. Iliou, and I. Giannoukos. Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artificial Intelligence Review, 43(2):155–177, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Ciregan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In CVPR, 2012.Google ScholarGoogle Scholar
  4. Z. Cui, S. Xiao, J. Feng, S. Yan, and W. Zheng. Recurrent shape regression. IEEE TPAMI (under review), 2016.Google ScholarGoogle Scholar
  5. S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal. Recurrent neural networks for emotion recognition in video. In ICMI, pages 467–474. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. Eyben, M. Wöllmer, and B. Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In ACM international conference on Multimedia, pages 1459–1462. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  8. R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. Image and Vision Computing, 28(5):807–813, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Happy and A. Routray. Automatic facial expression recognition using features of salient facial patches. IEEE TAC, 6(1):1–12, 2015.Google ScholarGoogle Scholar
  10. V. Jain and E. Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. Technical Report UM-CS-2010-009, University of Massachusetts, Amherst, 2010.Google ScholarGoogle Scholar
  11. H. Jung, S. Lee, J. Yim, S. Park, and J. Kim. Joint fine-tuning in deep neural networks for facial expression recognition. In ICCV, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Kaya, F. Gürpinar, S. Afshar, and A. A. Salah. Contrasting and combining least squares based learners for emotion recognition in the wild. In ICMI, pages 459–466. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B.-K. Kim, H. Lee, J. Roh, and S.-Y. Lee. Hierarchical committee of deep cnns with exponentially-weighted decision fusion for static facial expression recognition. In ICMI, pages 427–434. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Q. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.Google ScholarGoogle Scholar
  16. Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  17. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. arXiv preprint arXiv:1512.02325, 2015.Google ScholarGoogle Scholar
  18. P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In CVPR Workshops, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  19. O. Martin, I. Kotsia, B. Macq, and I. Pitas. The enterface’05 audio-visual emotion database. In ICDE Workshops, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  21. M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE TSP, 45(11):2673–2681, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y.-I. Tian, T. Kanade, and J. F. Cohn. Recognizing action units for facial expression analysis. IEEE TPAMI, 23(2):97–115, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. K. Wang, N. An, B. N. Li, Y. Zhang, and L. Li. Speech emotion recognition using fourier parameters. IEEE TAC, 6(1):69–75, 2015.Google ScholarGoogle Scholar
  24. P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  25. J. Wu, Z. Lin, and H. Zha. Multiple models fusion for emotion recognition in the wild. In ICMI, pages 475–481. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Yao, J. Shao, N. Ma, and Y. Chen. Capturing au-aware facial features and their latent relations for emotion recognition in the wild. In ICMI, pages 451–458. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Zhang, C. Liu, H. Jiang, S. Wei, L. Dai, and Y. Hu. Feedforward sequential memory networks: A new structure to learn long-term dependency. arXiv preprint arXiv:1512.08301, 2015.Google ScholarGoogle Scholar
  28. W. Zheng. Multi-view facial expression recognition based on group sparse reduced-rank regression. IEEE TAC, 5(1):71–85, 2014.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Multi-clue fusion for emotion recognition in the wild

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction
        October 2016
        605 pages
        ISBN:9781450345569
        DOI:10.1145/2993148

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 31 October 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper

        Acceptance Rates

        Overall Acceptance Rate453of1,080submissions,42%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader