skip to main content
10.1145/2993148.2997629acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

Video emotion recognition in the wild based on fusion of multimodal features

Published:31 October 2016Publication History

ABSTRACT

In this paper, we present our methods to the Audio-Video Based Emotion Recognition subtask in the 2016 Emotion Recognition in the Wild (EmotiW) Challenge. The task is to predict one of the seven basic emotions for the characters in the video clips extracted from movies or TV shows. In our approach, we explore various multimodal features from audio, facial image and video motion modalities. The audio features contain statistical acoustic features, MFCC Bag-of-Audio-Words and MFCC Fisher Vectors. For image related features, we extract hand-crafted features (LBP-TOP and SPM Dense SIFT) and learned features (CNN features). The improved Dense Trajectory is used as the motion related features. We train SVM, Random Forest and Logistic Regression classifiers for each kind of feature. Among them, MFCC fisher vector is the best acoustic features and the facial CNN feature is the most discriminative feature for emotion recognition. We utilize late fusion to combine different modality features and achieve a 50.76% accuracy on the testing set, which significantly outperforms the baseline test accuracy of 40.47%.

References

  1. Abhinav Dhall, OV Ramana Murthy, Roland Goecke, Jyoti Joshi, and Tom Gedeon. Video and image based emotion recognition challenges in the wild: Emotiw 2015. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 423–426. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Anbang Yao, Junchao Shao, Ningning Ma, and Yurong Chen. Capturing au-aware facial features and their latent relations for emotion recognition in the wild. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 451–458. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jianlong Wu, Zhouchen Lin, and Hongbin Zha. Multiple models fusion for emotion recognition in the wild. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 475–481. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bo Sun, Liandong Li, Guoyan Zhou, Xuewen Wu, Jun He, Lejun Yu, Dongxue Li, and Qinglan Wei. Combining multimodal features within a fusion network for emotion recognition in the wild. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 2015, pages 497–502, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Zhiding Yu and Cha Zhang. Image based static facial expression recognition with multiple deep network learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 2015, pages 435–442, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Albert C. Cruz. Quantification of cinematography semiotics for video-based facial emotion recognition in the emotiw 2015 grand challenge. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 2015, pages 511–518, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Samira Ebrahimi Kahou, Vincent Michalski, Kishore Reddy Konda, Roland Memisevic, and Christopher Joseph Pal. Recurrent neural networks for emotion recognition in video. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 2015, pages 467–474, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chung-Hsien Wu, Jen-Chun Lin, and Wen-Li Wei. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA Transactions on Signal and Information Processing, 3:e12, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  9. Viktor Rozgic, Sankaranarayanan Ananthakrishnan, Shirin Saleem, Rohit Kumar, Aravind Namandi Vembu, and Rohit Prasad. Emotion recognition using acoustic and lexical features. In INTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association, Portland, Oregon, USA, September 9-13, 2012, pages 366–369, 2012.Google ScholarGoogle Scholar
  10. Zhaocheng Huang, Ting Dang, Nicholas Cummins, Brian Stasak, Phu Le, Vidhyasaharan Sethu, and Julien Epps. An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction. In The International Workshop on Audio/visual Emotion Challenge, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. JunKai Chen, Zenghai Chen, Zheru Chi, and Hong Fu. Emotion recognition in the wild with feature fusion and multiple kernel learning. In Proceedings of the 16th International Conference on Multimodal Interaction, ICMI 2014, Istanbul, Turke, 2014, pages 508–513, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Florian Eyben, Martin Llmer, and Björn Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In ACM International Conference on Multimedia, Mm, pages 1459–1462, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Björn W. Schuller, Stefan Steidl, and Anton Batliner. The INTERSPEECH 2009 emotion challenge. In INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009, pages 312–315, 2009.Google ScholarGoogle Scholar
  14. Björn Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Communication, 53(9-10):1062–1087, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Steven B. Davis. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Readings in Speech Recognition, 28(4):65–74, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Stephanie Pancoast and Murat Akbacak. Softening quantization in bag-of-audio-words. In ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1370–1374, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  17. Jorge S ˜ Aanchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, 105(3):222–245, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Markus Mathias, Rodrigo Benenson, Marco Pedersoli, and Luc Van Gool. Face detection without bells and whistles. In European Conference on Computer Vision, pages 720–735. Springer, 2014.Google ScholarGoogle Scholar
  19. Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection by deep multitask learning. In European Conference on Computer Vision, pages 94–108. Springer, 2014.Google ScholarGoogle Scholar
  20. Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Learning deep representation for face alignment with auxiliary attributes. IEEE transactions on pattern analysis and machine intelligence, 38(5):918– 930, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Guoying Zhao and Matti Pietikäinen. Dynamic texture recognition using volume local binary patterns. In Dynamical Vision, ICCV 2005 and ECCV 2006 Workshops, WDV 2005 and WDV 2006, Beijing, China, October 21, 2005, Graz, Austria, May 13, 2006. Revised Papers, pages 165–177, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. David G Lowe. Distinctive image features from scaleinvariant keypoints. International journal of computer vision, 60(2):91–110, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray. Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV, volume 1, pages 1–2. Prague, 2004.Google ScholarGoogle Scholar
  24. Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Yichuan Tang. Deep learning using support vector machines. CoRR, abs/1306.0239, 2, 2013.Google ScholarGoogle Scholar
  26. O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  27. Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, pages 3551–3558, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass support vector machines. IEEE transactions on Neural Networks, 13(2):415–425, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R news, 2(3):18–22, 2002.Google ScholarGoogle Scholar

Index Terms

  1. Video emotion recognition in the wild based on fusion of multimodal features

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction
        October 2016
        605 pages
        ISBN:9781450345569
        DOI:10.1145/2993148

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 31 October 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper

        Acceptance Rates

        Overall Acceptance Rate453of1,080submissions,42%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader