User-generated video emotion recognition based on key frames

Wei, Jie; Yang, Xinyu; Dong, Yizhuo

doi:10.1007/s11042-020-10203-1

User-generated video emotion recognition based on key frames

Published: 22 January 2021

Volume 80, pages 14343–14361, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Jie Wei¹,
Xinyu Yang¹ &
Yizhuo Dong¹

1105 Accesses
25 Citations
Explore all metrics

Abstract

Video is an important medium in communication and entertainment, and thus, an intelligent understanding of videos has attracted widespread interest in academic community. Video content diversity and sparse emotional expression are challenging for video emotion recognition, especially for user-generated video. In this paper, we propose a key frames extraction algorithm based on affective saliency estimation. By estimating the affective saliency value of video frames, key frames are extracted to avoid the influence of emotion-independent frames on the recognition result. Efficient deep visual features are extracted using pretrained models and traditional models Support Vector Machine (SVM), Random Forests (RF) and deep model Convolutional Neural Networks (CNN) are used to perform emotion recognition. Moreover, we propose a hybrid fusion mechanism that combines score fusion and Top-K decision fusion to further improve recognition accuracy. Extensive experiments are conducted on user-generated video datasets Ekman-6 and VideoEmotion-8, and the average recognition accuracy are 59.51% and 52.85% respectively. The experimental results show that the proposed method can improve the recognition performance and is superior to the current user-generated video emotion recognition methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A multimodal fusion-based deep learning framework combined with keyframe extraction and spatial and channel attention for group emotion recognition from videos

Article 18 June 2023

One-Shot Only Real-Time Video Classification: A Case Study in Facial Emotion Recognition

Video-Based Emotion Estimation Using Deep Neural Networks: A Comparative Study

References

Almaev TR, Valstar MF (2013) Local gabor binary patterns from three orthogonal planes for automatic facial expression recognition. In: Humaine association conference on affective computing and intelligent interaction. IEEE, p 2013
Baohan X u, Xi W, Jiang Y-G (2016) Fast summarization of user-generated videos: exploiting semantic, emotional, and quality clues. IEEE MultiMed 23(3):23–33
Article Google Scholar
Baohan X u, Yanwei F u, Jiang Y-G, Li B, Sigal L (2016) Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization. IEEE Trans Affect Comput 9(2):255–270
Google Scholar
Baveye Y, Bettinelli J-N, Dellandréa E, Chen L, Chamaret C (2013) A large video database for computational models of induced emotion. In: 2013 Humaine association conference on affective computing and intelligent interaction. IEEE, pp 13–18
Chen T, Borth D, Darrell T, Chang S-F (2014) Deepsentibank:, Visual sentiment concept classification with deep convolutional neural networks. arXiv:1410.8586, pp 1–7
Chen J, Chen Z, Chi Z, Fu H (2016) Facial expression recognition in video with multiple feature fusion. IEEE Trans Affect Comput 9(1):38–50
Article Google Scholar
Chen C, Wu Z, Y-G J (2016) Emotion in context: Deep semantic feature fusion for video emotion recognition, ACM
Cheng L, Zheng W, Li C, Tang C, Liu S, Yan S, Zong Y (2018) Multiple spatio-temporal feature learning for video-based emotion recognition in the wild, ACM
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Dhall A, Goecke R, Joshi J, Hoey J, Gedeon T (2016) Emotiw 2016: Video and group-level emotion recognition challenges, ACM
Doherty AR, Byrne D, Smeaton AF, Jones GJF, Hughes M (2008) Investigating keyframe selection methods in the novel domain of passively captured visual lifelogs. In: Proceedings of the 2008 international conference on Content-based image and video retrieval. ACM, pp 259–268
Gao J, Yanwei F u, Jiang Y-G, Xue X (2017) Frame-transformer emotion classification network. In: Proceedings of the ACM on international conference on multimedia retrieval. ACM, p 2017
Ge H, Yan Z, Yu W, Sun L (2019) An attention mechanism based convolutional lstm network for video action recognition. Multimed Tools Appl, 1–24
Güder M, Ċiċekli NK (2018) Multi-modal video event recognition based on association rules and decision fusion. Multimed Syst 24(1):55–72
Article Google Scholar
Guo S-M, Pan YA, Liao Y-C, Hsu CY, Tsai JS-H, Chang CI (2006) A key frame selection-based facial expression recognition system. In: First international conference on innovative computing, information and control-Volume I (ICICIC’06), vol 3. IEEE, pp 341–344
Guo F, Wang W, Shen J, Shao L, Yang J, Tao D, Tang YY (2017) Video saliency detection using object proposals. IEEE Trans Cybern 48 (11):3159–3170
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Ho TK (1995) Random decision forests, vol 1, IEEE
Jiang Y-G, Xu B, Xue X (2014) Predicting emotions in user-generated videos. In: Twenty-Eighth AAAI conference on artificial intelligence, pp 73–79
Li D, Yao T, Duan L-Y, Mei T, Rui Y (2018) Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans Multimed 21(2):416–428
Article Google Scholar
Liu Y, Hoai M, Shao M, Kim T-K (2017) Latent bi-constraint svm for video-based object recognition. IEEE Trans Circ Syst Vid Technol 28 (10):3044–3052
Article Google Scholar
Liu C, Tang T, Lv K, Wang M (2018) Multi-feature based emotion recognition for video clips, ACM
Ly ST, Do N-T, Lee G-S, Kim S-H, Yang H-J (2019) Multimodal 2d and 3d for in-the-wild facial expression recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1–8
Mo S, Niu J, Su Y, Das SK (2018) A novel feature set for video emotion recognition. Neurocomputing 291:11–20
Article Google Scholar
Noroozi F, Marjanovic M, Njegus A, Escalera S, Anbarjafari G (2017) Audio-visual emotion recognition in video clips. IEEE Trans Affect Comput 10(1):60–75
Article Google Scholar
Okubo M, Tamura S (2019) A proposal of video evaluation method using facial expression for video recommendation system. In: International conference on human-computer interaction. Springer, pp 254–268
Pan X (2019) Fusing hog and convolutional neural network spatial-temporal features for video-based facial expression recognition. IET Image Process 14(1):176–182
Article Google Scholar
Pang L, Ngo C-W (2015) Mutlimodal learning with deep boltzmann machine for emotion prediction in user generated videos. In: Proceedings of the 5th ACM on international conference on multimedia retrieval. ACM, pp 619–622
Pang L, Zhu S, Ngo C-W (2015) Deep multimodal learning for affective analysis and retrieval. IEEE Trans Multimed 17(11):2008–2020
Article Google Scholar
Samadiani N, Huang G, Luo W, Shu Y, Wang R, Kocaturk T (2019) A novel video emotion recognition system in the wild using a random forest classifier. In: International conference on data service. Springer, pp 275–284
Shen J, Tang X, Dong X, Shao L (2019) Visual object tracking by hierarchical attention siamese network. IEEE Trans Cybern PP(99):1–13
Google Scholar
Shinohara Y, Nomiya H, Hochin T (2018) Estimation of facial expression intensity for lifelog videos retrieval. In: 2018 5th International conference on computational science/intelligence and applied informatics (CSII). IEEE, pp 133–138
Shukla A, Gullapuram SS, Katti H, Yadati K, Kankanhalli M, Subramanian R (2017) Affect recognition in ads with application to computational advertising, ACM
Singh R, Kushwaha AKS, Srivastava R (2019) Multi-view recognition system for human activity based on multiple features for video surveillance system. Multimed Tools Appl 78(12):17165–17196
Article Google Scholar
Singhal A, Kumar P, Saini R, Roy PP, Dogra DP, Kim B-G (2018) Summarization of videos by analyzing affective state of the user through crowdsource. Cogn Syst Res 52:917–930
Article Google Scholar
Soltanian M, Ghaemmaghami S (2018) Hierarchical concept score postprocessing and concept-wise normalization in cnn-based video event recognition. IEEE Trans Multimed 21(1):157–172
Article Google Scholar
Tripathi A, Ashwin TS, Guddeti RMR (2019) Emoware: a context-aware framework for personalized video recommendation using affective video sequences. IEEE Access 7:51185–51200
Article Google Scholar
Tu G, Fu Y, Li B, Gao J, Jiang YG, Xue X (2020) A multi-task neural approach for emotion attribution, classification and summarization. IEEE Trans Multimed 22(1):148–159
Article Google Scholar
Tu Z, Guo Z, Xie W, Yan M, Veltkamp RC, Li B, Yuan J (2017) Fusing disparate object signatures for salient object detection in video. Pattern Recogn 72:285–299
Article Google Scholar
Wang S, Ji Q (2015) Video affective content analysis: a survey of state-of-the-art methods. IEEE Trans Affect Comput 6(4):410–430
Article Google Scholar
Wang W, Shen J, Porikli F (2017) Selective video object cutout. IEEE Trans Image Process 26(12):5645–5655
Article MathSciNet Google Scholar
Wang W, Shen J, Porikli F, Yang R (2019) Semi-supervised video object segmentation with super-trajectories. IEEE Trans Pattern Anal Mach Intell 41(4):985–998
Article Google Scholar
Wang W, Shen J, Shao L (2015) Consistent video saliency using local gradient flow optimization and global refinement. IEEE Trans Image Process 24 (11):4185–4196
Article MathSciNet Google Scholar
Wang W, Shen J, Yang R, Porikli F (2017) Saliency-aware video object segmentation. IEEE Trans Pattern Anal Mach Intell 40(1):20–33
Article Google Scholar
Wang S, Wang W, Zhao J, Chen S, Jin Q, Zhang S, Qin Y (2017) Emotion recognition with multimodal features and temporal models. In: Proceedings of the 19th ACM international conference on multimodal interaction. ACM, pp 598–602
Wang J, Zhao Z, Liang J, Li C (2018) Video-based emotion recognition using face frontalization and deep spatiotemporal feature. In: 2018 First asian conference on affective computing and intelligent interaction (ACII Asia). IEEE, p 2018
Wei W, Jia Q, Feng Y, Chen G, Chu M (2020) Multi-modal facial expression feature based on deep-neural networks. J Multimod User Inter 14(1):17–23
Article Google Scholar
Xia X, Liu J, Yang T, Jiang D, Han W, Sahli H (2018) Video emotion recognition using hand-crafted and deep learning features. In: 2018 First Asian conference on affective computing and intelligent interaction (ACII Asia). IEEE, p 2018
Xie S, Hu H (2019) Facial expression recognition using hierarchical features with deep comprehensive multipatches aggregation convolutional neural networks. IEEE Trans Multimed 21(1):211–220
Article MathSciNet Google Scholar
Xu J, Dong Y, Ma L, Bai H (2018) Video-based emotion recognition using aggregated features and spatio-temporal information. In: 2018 24th international conference on pattern recognition (ICPR). IEEE, pp 2833–2838
Xu B, Fu Y, Jiang Y-G, Li B, Sigal L (2016) Video emotion recognition with transferred deep feature encodings. In: Proceedings of the ACM on international conference on multimedia retrieval. ACM, p 2016
Xu Wanru, Miao Z, Zhang X-P, Yi T (2017) A hierarchical spatio-temporal model for human activity recognition. IEEE Trans Multimed 19 (7):1494–1509
Article Google Scholar
Yadati K, Katti H, Kankanhalli M (2013) Cavva: Computational affective video-in-video advertising. IEEE Trans Multimed 16(1):15–23
Article Google Scholar
Yang J, Sun M, Sun X (2017) Learning visual sentiment distributions via augmented conditional probability neural network. In: Thirty-first AAAI conference on artificial intelligence, pp 224–230
Yu L, Shen J, Wang W, Sun H, Shao L (2019) Better dense trajectories by motion in videos. IEEE Trans Cybern 49(1):159–170
Article Google Scholar
Zhalehpour S, Akhtar Z, Erdem CE (2016) Multimodal emotion recognition based on peak frame selection from video, vol 10, pp 827–834
Zhang H, Xu M (2018) Recognition of emotions in user-generated videos with kernelized features. IEEE Trans Multimed 20(10):2824–2835
Article Google Scholar
Zhang Q, Yu S-P, Zhou D-S, Wei X-P (2013) An efficient method of key-frame extraction based on a cluster algorithm. J Human Kinetics 39 (1):5–14
Article Google Scholar
Zhang S, Zhang S, Huang T, Gao W, Qi T (2017) Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans Circ Syst Vid Technol 28(10):3030–3043
Article Google Scholar

Download references

Author information

Authors and Affiliations

Xi’an Jiaotong University, Xi’an, China
Jie Wei, Xinyu Yang & Yizhuo Dong

Authors

Jie Wei
View author publications
You can also search for this author in PubMed Google Scholar
Xinyu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yizhuo Dong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xinyu Yang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Jie Wei and Xinyu Yang contribute equally to this work.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wei, J., Yang, X. & Dong, Y. User-generated video emotion recognition based on key frames. Multimed Tools Appl 80, 14343–14361 (2021). https://doi.org/10.1007/s11042-020-10203-1

Download citation

Received: 20 November 2019
Revised: 15 September 2020
Accepted: 24 November 2020
Published: 22 January 2021
Issue Date: April 2021
DOI: https://doi.org/10.1007/s11042-020-10203-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

User-generated video emotion recognition based on key frames

Abstract

Access this article

Similar content being viewed by others

A multimodal fusion-based deep learning framework combined with keyframe extraction and spatial and channel attention for group emotion recognition from videos

One-Shot Only Real-Time Video Classification: A Case Study in Facial Emotion Recognition

Video-Based Emotion Estimation Using Deep Neural Networks: A Comparative Study

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

User-generated video emotion recognition based on key frames

Abstract

Access this article

Similar content being viewed by others

A multimodal fusion-based deep learning framework combined with keyframe extraction and spatial and channel attention for group emotion recognition from videos

One-Shot Only Real-Time Video Classification: A Case Study in Facial Emotion Recognition

Video-Based Emotion Estimation Using Deep Neural Networks: A Comparative Study

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation