ABSTRACT
In this paper, we present our methods to the Audio-Video Based Emotion Recognition subtask in the 2016 Emotion Recognition in the Wild (EmotiW) Challenge. The task is to predict one of the seven basic emotions for the characters in the video clips extracted from movies or TV shows. In our approach, we explore various multimodal features from audio, facial image and video motion modalities. The audio features contain statistical acoustic features, MFCC Bag-of-Audio-Words and MFCC Fisher Vectors. For image related features, we extract hand-crafted features (LBP-TOP and SPM Dense SIFT) and learned features (CNN features). The improved Dense Trajectory is used as the motion related features. We train SVM, Random Forest and Logistic Regression classifiers for each kind of feature. Among them, MFCC fisher vector is the best acoustic features and the facial CNN feature is the most discriminative feature for emotion recognition. We utilize late fusion to combine different modality features and achieve a 50.76% accuracy on the testing set, which significantly outperforms the baseline test accuracy of 40.47%.
- Abhinav Dhall, OV Ramana Murthy, Roland Goecke, Jyoti Joshi, and Tom Gedeon. Video and image based emotion recognition challenges in the wild: Emotiw 2015. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 423–426. ACM, 2015. Google ScholarDigital Library
- Anbang Yao, Junchao Shao, Ningning Ma, and Yurong Chen. Capturing au-aware facial features and their latent relations for emotion recognition in the wild. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 451–458. ACM, 2015. Google ScholarDigital Library
- Jianlong Wu, Zhouchen Lin, and Hongbin Zha. Multiple models fusion for emotion recognition in the wild. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 475–481. ACM, 2015. Google ScholarDigital Library
- Bo Sun, Liandong Li, Guoyan Zhou, Xuewen Wu, Jun He, Lejun Yu, Dongxue Li, and Qinglan Wei. Combining multimodal features within a fusion network for emotion recognition in the wild. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 2015, pages 497–502, 2015. Google ScholarDigital Library
- Zhiding Yu and Cha Zhang. Image based static facial expression recognition with multiple deep network learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 2015, pages 435–442, 2015. Google ScholarDigital Library
- Albert C. Cruz. Quantification of cinematography semiotics for video-based facial emotion recognition in the emotiw 2015 grand challenge. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 2015, pages 511–518, 2015. Google ScholarDigital Library
- Samira Ebrahimi Kahou, Vincent Michalski, Kishore Reddy Konda, Roland Memisevic, and Christopher Joseph Pal. Recurrent neural networks for emotion recognition in video. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 2015, pages 467–474, 2015. Google ScholarDigital Library
- Chung-Hsien Wu, Jen-Chun Lin, and Wen-Li Wei. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA Transactions on Signal and Information Processing, 3:e12, 2014.Google ScholarCross Ref
- Viktor Rozgic, Sankaranarayanan Ananthakrishnan, Shirin Saleem, Rohit Kumar, Aravind Namandi Vembu, and Rohit Prasad. Emotion recognition using acoustic and lexical features. In INTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association, Portland, Oregon, USA, September 9-13, 2012, pages 366–369, 2012.Google Scholar
- Zhaocheng Huang, Ting Dang, Nicholas Cummins, Brian Stasak, Phu Le, Vidhyasaharan Sethu, and Julien Epps. An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction. In The International Workshop on Audio/visual Emotion Challenge, 2015. Google ScholarDigital Library
- JunKai Chen, Zenghai Chen, Zheru Chi, and Hong Fu. Emotion recognition in the wild with feature fusion and multiple kernel learning. In Proceedings of the 16th International Conference on Multimodal Interaction, ICMI 2014, Istanbul, Turke, 2014, pages 508–513, 2014. Google ScholarDigital Library
- Florian Eyben, Martin Llmer, and Björn Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In ACM International Conference on Multimedia, Mm, pages 1459–1462, 2010. Google ScholarDigital Library
- Björn W. Schuller, Stefan Steidl, and Anton Batliner. The INTERSPEECH 2009 emotion challenge. In INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009, pages 312–315, 2009.Google Scholar
- Björn Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Communication, 53(9-10):1062–1087, 2011. Google ScholarDigital Library
- Steven B. Davis. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Readings in Speech Recognition, 28(4):65–74, 1990. Google ScholarDigital Library
- Stephanie Pancoast and Murat Akbacak. Softening quantization in bag-of-audio-words. In ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1370–1374, 2014.Google ScholarCross Ref
- Jorge S ˜ Aanchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, 105(3):222–245, 2013. Google ScholarDigital Library
- Markus Mathias, Rodrigo Benenson, Marco Pedersoli, and Luc Van Gool. Face detection without bells and whistles. In European Conference on Computer Vision, pages 720–735. Springer, 2014.Google Scholar
- Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection by deep multitask learning. In European Conference on Computer Vision, pages 94–108. Springer, 2014.Google Scholar
- Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Learning deep representation for face alignment with auxiliary attributes. IEEE transactions on pattern analysis and machine intelligence, 38(5):918– 930, 2016. Google ScholarDigital Library
- Guoying Zhao and Matti Pietikäinen. Dynamic texture recognition using volume local binary patterns. In Dynamical Vision, ICCV 2005 and ECCV 2006 Workshops, WDV 2005 and WDV 2006, Beijing, China, October 21, 2005, Graz, Austria, May 13, 2006. Revised Papers, pages 165–177, 2006. Google ScholarDigital Library
- David G Lowe. Distinctive image features from scaleinvariant keypoints. International journal of computer vision, 60(2):91–110, 2004. Google ScholarDigital Library
- Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray. Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV, volume 1, pages 1–2. Prague, 2004.Google Scholar
- Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995. Google ScholarDigital Library
- Yichuan Tang. Deep learning using support vector machines. CoRR, abs/1306.0239, 2, 2013.Google Scholar
- O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference, 2015.Google ScholarCross Ref
- Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, pages 3551–3558, 2013. Google ScholarDigital Library
- Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass support vector machines. IEEE transactions on Neural Networks, 13(2):415–425, 2002. Google ScholarDigital Library
- Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R news, 2(3):18–22, 2002.Google Scholar
Index Terms
Video emotion recognition in the wild based on fusion of multimodal features
Recommendations
Emotion recognition with multimodal features and temporal models
ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal InteractionThis paper presents our methods to the Audio-Video Based Emotion Recognition subtask in the 2017 Emotion Recognition in the Wild (EmotiW) Challenge. The task aims to predict one of the seven basic emotions for short video segments. We extract different ...
Combining Multimodal Features within a Fusion Network for Emotion Recognition in the Wild
ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal InteractionIn this paper, we describe our work in the third Emotion Recognition in the Wild (EmotiW 2015) Challenge. For each video clip, we extract MSDF, LBP-TOP, HOG, LPQ-TOP and acoustic features to recognize the emotions of film characters. For the static ...
Cross-culture Continuous Emotion Recognition with Multimodal Features
ICCPR '19: Proceedings of the 2019 8th International Conference on Computing and Pattern RecognitionAutomatic emotion recognition is a challenging task that can make great impact on improving natural human-computer interactions. In this paper, we present our automatic prediction of dimensional emotional state for Cross-cultural Emotion Sub-Challenge (...
Comments