Abstract
Video is an important medium in communication and entertainment, and thus, an intelligent understanding of videos has attracted widespread interest in academic community. Video content diversity and sparse emotional expression are challenging for video emotion recognition, especially for user-generated video. In this paper, we propose a key frames extraction algorithm based on affective saliency estimation. By estimating the affective saliency value of video frames, key frames are extracted to avoid the influence of emotion-independent frames on the recognition result. Efficient deep visual features are extracted using pretrained models and traditional models Support Vector Machine (SVM), Random Forests (RF) and deep model Convolutional Neural Networks (CNN) are used to perform emotion recognition. Moreover, we propose a hybrid fusion mechanism that combines score fusion and Top-K decision fusion to further improve recognition accuracy. Extensive experiments are conducted on user-generated video datasets Ekman-6 and VideoEmotion-8, and the average recognition accuracy are 59.51% and 52.85% respectively. The experimental results show that the proposed method can improve the recognition performance and is superior to the current user-generated video emotion recognition methods.
Similar content being viewed by others
References
Almaev TR, Valstar MF (2013) Local gabor binary patterns from three orthogonal planes for automatic facial expression recognition. In: Humaine association conference on affective computing and intelligent interaction. IEEE, p 2013
Baohan X u, Xi W, Jiang Y-G (2016) Fast summarization of user-generated videos: exploiting semantic, emotional, and quality clues. IEEE MultiMed 23(3):23–33
Baohan X u, Yanwei F u, Jiang Y-G, Li B, Sigal L (2016) Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization. IEEE Trans Affect Comput 9(2):255–270
Baveye Y, Bettinelli J-N, Dellandréa E, Chen L, Chamaret C (2013) A large video database for computational models of induced emotion. In: 2013 Humaine association conference on affective computing and intelligent interaction. IEEE, pp 13–18
Chen T, Borth D, Darrell T, Chang S-F (2014) Deepsentibank:, Visual sentiment concept classification with deep convolutional neural networks. arXiv:1410.8586, pp 1–7
Chen J, Chen Z, Chi Z, Fu H (2016) Facial expression recognition in video with multiple feature fusion. IEEE Trans Affect Comput 9(1):38–50
Chen C, Wu Z, Y-G J (2016) Emotion in context: Deep semantic feature fusion for video emotion recognition, ACM
Cheng L, Zheng W, Li C, Tang C, Liu S, Yan S, Zong Y (2018) Multiple spatio-temporal feature learning for video-based emotion recognition in the wild, ACM
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Dhall A, Goecke R, Joshi J, Hoey J, Gedeon T (2016) Emotiw 2016: Video and group-level emotion recognition challenges, ACM
Doherty AR, Byrne D, Smeaton AF, Jones GJF, Hughes M (2008) Investigating keyframe selection methods in the novel domain of passively captured visual lifelogs. In: Proceedings of the 2008 international conference on Content-based image and video retrieval. ACM, pp 259–268
Gao J, Yanwei F u, Jiang Y-G, Xue X (2017) Frame-transformer emotion classification network. In: Proceedings of the ACM on international conference on multimedia retrieval. ACM, p 2017
Ge H, Yan Z, Yu W, Sun L (2019) An attention mechanism based convolutional lstm network for video action recognition. Multimed Tools Appl, 1–24
Güder M, Ċiċekli NK (2018) Multi-modal video event recognition based on association rules and decision fusion. Multimed Syst 24(1):55–72
Guo S-M, Pan YA, Liao Y-C, Hsu CY, Tsai JS-H, Chang CI (2006) A key frame selection-based facial expression recognition system. In: First international conference on innovative computing, information and control-Volume I (ICICIC’06), vol 3. IEEE, pp 341–344
Guo F, Wang W, Shen J, Shao L, Yang J, Tao D, Tang YY (2017) Video saliency detection using object proposals. IEEE Trans Cybern 48 (11):3159–3170
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Ho TK (1995) Random decision forests, vol 1, IEEE
Jiang Y-G, Xu B, Xue X (2014) Predicting emotions in user-generated videos. In: Twenty-Eighth AAAI conference on artificial intelligence, pp 73–79
Li D, Yao T, Duan L-Y, Mei T, Rui Y (2018) Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans Multimed 21(2):416–428
Liu Y, Hoai M, Shao M, Kim T-K (2017) Latent bi-constraint svm for video-based object recognition. IEEE Trans Circ Syst Vid Technol 28 (10):3044–3052
Liu C, Tang T, Lv K, Wang M (2018) Multi-feature based emotion recognition for video clips, ACM
Ly ST, Do N-T, Lee G-S, Kim S-H, Yang H-J (2019) Multimodal 2d and 3d for in-the-wild facial expression recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1–8
Mo S, Niu J, Su Y, Das SK (2018) A novel feature set for video emotion recognition. Neurocomputing 291:11–20
Noroozi F, Marjanovic M, Njegus A, Escalera S, Anbarjafari G (2017) Audio-visual emotion recognition in video clips. IEEE Trans Affect Comput 10(1):60–75
Okubo M, Tamura S (2019) A proposal of video evaluation method using facial expression for video recommendation system. In: International conference on human-computer interaction. Springer, pp 254–268
Pan X (2019) Fusing hog and convolutional neural network spatial-temporal features for video-based facial expression recognition. IET Image Process 14(1):176–182
Pang L, Ngo C-W (2015) Mutlimodal learning with deep boltzmann machine for emotion prediction in user generated videos. In: Proceedings of the 5th ACM on international conference on multimedia retrieval. ACM, pp 619–622
Pang L, Zhu S, Ngo C-W (2015) Deep multimodal learning for affective analysis and retrieval. IEEE Trans Multimed 17(11):2008–2020
Samadiani N, Huang G, Luo W, Shu Y, Wang R, Kocaturk T (2019) A novel video emotion recognition system in the wild using a random forest classifier. In: International conference on data service. Springer, pp 275–284
Shen J, Tang X, Dong X, Shao L (2019) Visual object tracking by hierarchical attention siamese network. IEEE Trans Cybern PP(99):1–13
Shinohara Y, Nomiya H, Hochin T (2018) Estimation of facial expression intensity for lifelog videos retrieval. In: 2018 5th International conference on computational science/intelligence and applied informatics (CSII). IEEE, pp 133–138
Shukla A, Gullapuram SS, Katti H, Yadati K, Kankanhalli M, Subramanian R (2017) Affect recognition in ads with application to computational advertising, ACM
Singh R, Kushwaha AKS, Srivastava R (2019) Multi-view recognition system for human activity based on multiple features for video surveillance system. Multimed Tools Appl 78(12):17165–17196
Singhal A, Kumar P, Saini R, Roy PP, Dogra DP, Kim B-G (2018) Summarization of videos by analyzing affective state of the user through crowdsource. Cogn Syst Res 52:917–930
Soltanian M, Ghaemmaghami S (2018) Hierarchical concept score postprocessing and concept-wise normalization in cnn-based video event recognition. IEEE Trans Multimed 21(1):157–172
Tripathi A, Ashwin TS, Guddeti RMR (2019) Emoware: a context-aware framework for personalized video recommendation using affective video sequences. IEEE Access 7:51185–51200
Tu G, Fu Y, Li B, Gao J, Jiang YG, Xue X (2020) A multi-task neural approach for emotion attribution, classification and summarization. IEEE Trans Multimed 22(1):148–159
Tu Z, Guo Z, Xie W, Yan M, Veltkamp RC, Li B, Yuan J (2017) Fusing disparate object signatures for salient object detection in video. Pattern Recogn 72:285–299
Wang S, Ji Q (2015) Video affective content analysis: a survey of state-of-the-art methods. IEEE Trans Affect Comput 6(4):410–430
Wang W, Shen J, Porikli F (2017) Selective video object cutout. IEEE Trans Image Process 26(12):5645–5655
Wang W, Shen J, Porikli F, Yang R (2019) Semi-supervised video object segmentation with super-trajectories. IEEE Trans Pattern Anal Mach Intell 41(4):985–998
Wang W, Shen J, Shao L (2015) Consistent video saliency using local gradient flow optimization and global refinement. IEEE Trans Image Process 24 (11):4185–4196
Wang W, Shen J, Yang R, Porikli F (2017) Saliency-aware video object segmentation. IEEE Trans Pattern Anal Mach Intell 40(1):20–33
Wang S, Wang W, Zhao J, Chen S, Jin Q, Zhang S, Qin Y (2017) Emotion recognition with multimodal features and temporal models. In: Proceedings of the 19th ACM international conference on multimodal interaction. ACM, pp 598–602
Wang J, Zhao Z, Liang J, Li C (2018) Video-based emotion recognition using face frontalization and deep spatiotemporal feature. In: 2018 First asian conference on affective computing and intelligent interaction (ACII Asia). IEEE, p 2018
Wei W, Jia Q, Feng Y, Chen G, Chu M (2020) Multi-modal facial expression feature based on deep-neural networks. J Multimod User Inter 14(1):17–23
Xia X, Liu J, Yang T, Jiang D, Han W, Sahli H (2018) Video emotion recognition using hand-crafted and deep learning features. In: 2018 First Asian conference on affective computing and intelligent interaction (ACII Asia). IEEE, p 2018
Xie S, Hu H (2019) Facial expression recognition using hierarchical features with deep comprehensive multipatches aggregation convolutional neural networks. IEEE Trans Multimed 21(1):211–220
Xu J, Dong Y, Ma L, Bai H (2018) Video-based emotion recognition using aggregated features and spatio-temporal information. In: 2018 24th international conference on pattern recognition (ICPR). IEEE, pp 2833–2838
Xu B, Fu Y, Jiang Y-G, Li B, Sigal L (2016) Video emotion recognition with transferred deep feature encodings. In: Proceedings of the ACM on international conference on multimedia retrieval. ACM, p 2016
Xu Wanru, Miao Z, Zhang X-P, Yi T (2017) A hierarchical spatio-temporal model for human activity recognition. IEEE Trans Multimed 19 (7):1494–1509
Yadati K, Katti H, Kankanhalli M (2013) Cavva: Computational affective video-in-video advertising. IEEE Trans Multimed 16(1):15–23
Yang J, Sun M, Sun X (2017) Learning visual sentiment distributions via augmented conditional probability neural network. In: Thirty-first AAAI conference on artificial intelligence, pp 224–230
Yu L, Shen J, Wang W, Sun H, Shao L (2019) Better dense trajectories by motion in videos. IEEE Trans Cybern 49(1):159–170
Zhalehpour S, Akhtar Z, Erdem CE (2016) Multimodal emotion recognition based on peak frame selection from video, vol 10, pp 827–834
Zhang H, Xu M (2018) Recognition of emotions in user-generated videos with kernelized features. IEEE Trans Multimed 20(10):2824–2835
Zhang Q, Yu S-P, Zhou D-S, Wei X-P (2013) An efficient method of key-frame extraction based on a cluster algorithm. J Human Kinetics 39 (1):5–14
Zhang S, Zhang S, Huang T, Gao W, Qi T (2017) Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans Circ Syst Vid Technol 28(10):3030–3043
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Jie Wei and Xinyu Yang contribute equally to this work.
Rights and permissions
About this article
Cite this article
Wei, J., Yang, X. & Dong, Y. User-generated video emotion recognition based on key frames. Multimed Tools Appl 80, 14343–14361 (2021). https://doi.org/10.1007/s11042-020-10203-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-10203-1