Skip to main content
Log in

User-generated video emotion recognition based on key frames

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Video is an important medium in communication and entertainment, and thus, an intelligent understanding of videos has attracted widespread interest in academic community. Video content diversity and sparse emotional expression are challenging for video emotion recognition, especially for user-generated video. In this paper, we propose a key frames extraction algorithm based on affective saliency estimation. By estimating the affective saliency value of video frames, key frames are extracted to avoid the influence of emotion-independent frames on the recognition result. Efficient deep visual features are extracted using pretrained models and traditional models Support Vector Machine (SVM), Random Forests (RF) and deep model Convolutional Neural Networks (CNN) are used to perform emotion recognition. Moreover, we propose a hybrid fusion mechanism that combines score fusion and Top-K decision fusion to further improve recognition accuracy. Extensive experiments are conducted on user-generated video datasets Ekman-6 and VideoEmotion-8, and the average recognition accuracy are 59.51% and 52.85% respectively. The experimental results show that the proposed method can improve the recognition performance and is superior to the current user-generated video emotion recognition methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Almaev TR, Valstar MF (2013) Local gabor binary patterns from three orthogonal planes for automatic facial expression recognition. In: Humaine association conference on affective computing and intelligent interaction. IEEE, p 2013

  2. Baohan X u, Xi W, Jiang Y-G (2016) Fast summarization of user-generated videos: exploiting semantic, emotional, and quality clues. IEEE MultiMed 23(3):23–33

    Article  Google Scholar 

  3. Baohan X u, Yanwei F u, Jiang Y-G, Li B, Sigal L (2016) Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization. IEEE Trans Affect Comput 9(2):255–270

    Google Scholar 

  4. Baveye Y, Bettinelli J-N, Dellandréa E, Chen L, Chamaret C (2013) A large video database for computational models of induced emotion. In: 2013 Humaine association conference on affective computing and intelligent interaction. IEEE, pp 13–18

  5. Chen T, Borth D, Darrell T, Chang S-F (2014) Deepsentibank:, Visual sentiment concept classification with deep convolutional neural networks. arXiv:1410.8586, pp 1–7

  6. Chen J, Chen Z, Chi Z, Fu H (2016) Facial expression recognition in video with multiple feature fusion. IEEE Trans Affect Comput 9(1):38–50

    Article  Google Scholar 

  7. Chen C, Wu Z, Y-G J (2016) Emotion in context: Deep semantic feature fusion for video emotion recognition, ACM

  8. Cheng L, Zheng W, Li C, Tang C, Liu S, Yan S, Zong Y (2018) Multiple spatio-temporal feature learning for video-based emotion recognition in the wild, ACM

  9. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  10. Dhall A, Goecke R, Joshi J, Hoey J, Gedeon T (2016) Emotiw 2016: Video and group-level emotion recognition challenges, ACM

  11. Doherty AR, Byrne D, Smeaton AF, Jones GJF, Hughes M (2008) Investigating keyframe selection methods in the novel domain of passively captured visual lifelogs. In: Proceedings of the 2008 international conference on Content-based image and video retrieval. ACM, pp 259–268

  12. Gao J, Yanwei F u, Jiang Y-G, Xue X (2017) Frame-transformer emotion classification network. In: Proceedings of the ACM on international conference on multimedia retrieval. ACM, p 2017

  13. Ge H, Yan Z, Yu W, Sun L (2019) An attention mechanism based convolutional lstm network for video action recognition. Multimed Tools Appl, 1–24

  14. Güder M, Ċiċekli NK (2018) Multi-modal video event recognition based on association rules and decision fusion. Multimed Syst 24(1):55–72

    Article  Google Scholar 

  15. Guo S-M, Pan YA, Liao Y-C, Hsu CY, Tsai JS-H, Chang CI (2006) A key frame selection-based facial expression recognition system. In: First international conference on innovative computing, information and control-Volume I (ICICIC’06), vol 3. IEEE, pp 341–344

  16. Guo F, Wang W, Shen J, Shao L, Yang J, Tao D, Tang YY (2017) Video saliency detection using object proposals. IEEE Trans Cybern 48 (11):3159–3170

    Article  Google Scholar 

  17. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  18. Ho TK (1995) Random decision forests, vol 1, IEEE

  19. Jiang Y-G, Xu B, Xue X (2014) Predicting emotions in user-generated videos. In: Twenty-Eighth AAAI conference on artificial intelligence, pp 73–79

  20. Li D, Yao T, Duan L-Y, Mei T, Rui Y (2018) Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans Multimed 21(2):416–428

    Article  Google Scholar 

  21. Liu Y, Hoai M, Shao M, Kim T-K (2017) Latent bi-constraint svm for video-based object recognition. IEEE Trans Circ Syst Vid Technol 28 (10):3044–3052

    Article  Google Scholar 

  22. Liu C, Tang T, Lv K, Wang M (2018) Multi-feature based emotion recognition for video clips, ACM

  23. Ly ST, Do N-T, Lee G-S, Kim S-H, Yang H-J (2019) Multimodal 2d and 3d for in-the-wild facial expression recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1–8

  24. Mo S, Niu J, Su Y, Das SK (2018) A novel feature set for video emotion recognition. Neurocomputing 291:11–20

    Article  Google Scholar 

  25. Noroozi F, Marjanovic M, Njegus A, Escalera S, Anbarjafari G (2017) Audio-visual emotion recognition in video clips. IEEE Trans Affect Comput 10(1):60–75

    Article  Google Scholar 

  26. Okubo M, Tamura S (2019) A proposal of video evaluation method using facial expression for video recommendation system. In: International conference on human-computer interaction. Springer, pp 254–268

  27. Pan X (2019) Fusing hog and convolutional neural network spatial-temporal features for video-based facial expression recognition. IET Image Process 14(1):176–182

    Article  Google Scholar 

  28. Pang L, Ngo C-W (2015) Mutlimodal learning with deep boltzmann machine for emotion prediction in user generated videos. In: Proceedings of the 5th ACM on international conference on multimedia retrieval. ACM, pp 619–622

  29. Pang L, Zhu S, Ngo C-W (2015) Deep multimodal learning for affective analysis and retrieval. IEEE Trans Multimed 17(11):2008–2020

    Article  Google Scholar 

  30. Samadiani N, Huang G, Luo W, Shu Y, Wang R, Kocaturk T (2019) A novel video emotion recognition system in the wild using a random forest classifier. In: International conference on data service. Springer, pp 275–284

  31. Shen J, Tang X, Dong X, Shao L (2019) Visual object tracking by hierarchical attention siamese network. IEEE Trans Cybern PP(99):1–13

    Google Scholar 

  32. Shinohara Y, Nomiya H, Hochin T (2018) Estimation of facial expression intensity for lifelog videos retrieval. In: 2018 5th International conference on computational science/intelligence and applied informatics (CSII). IEEE, pp 133–138

  33. Shukla A, Gullapuram SS, Katti H, Yadati K, Kankanhalli M, Subramanian R (2017) Affect recognition in ads with application to computational advertising, ACM

  34. Singh R, Kushwaha AKS, Srivastava R (2019) Multi-view recognition system for human activity based on multiple features for video surveillance system. Multimed Tools Appl 78(12):17165–17196

    Article  Google Scholar 

  35. Singhal A, Kumar P, Saini R, Roy PP, Dogra DP, Kim B-G (2018) Summarization of videos by analyzing affective state of the user through crowdsource. Cogn Syst Res 52:917–930

    Article  Google Scholar 

  36. Soltanian M, Ghaemmaghami S (2018) Hierarchical concept score postprocessing and concept-wise normalization in cnn-based video event recognition. IEEE Trans Multimed 21(1):157–172

    Article  Google Scholar 

  37. Tripathi A, Ashwin TS, Guddeti RMR (2019) Emoware: a context-aware framework for personalized video recommendation using affective video sequences. IEEE Access 7:51185–51200

    Article  Google Scholar 

  38. Tu G, Fu Y, Li B, Gao J, Jiang YG, Xue X (2020) A multi-task neural approach for emotion attribution, classification and summarization. IEEE Trans Multimed 22(1):148–159

    Article  Google Scholar 

  39. Tu Z, Guo Z, Xie W, Yan M, Veltkamp RC, Li B, Yuan J (2017) Fusing disparate object signatures for salient object detection in video. Pattern Recogn 72:285–299

    Article  Google Scholar 

  40. Wang S, Ji Q (2015) Video affective content analysis: a survey of state-of-the-art methods. IEEE Trans Affect Comput 6(4):410–430

    Article  Google Scholar 

  41. Wang W, Shen J, Porikli F (2017) Selective video object cutout. IEEE Trans Image Process 26(12):5645–5655

    Article  MathSciNet  Google Scholar 

  42. Wang W, Shen J, Porikli F, Yang R (2019) Semi-supervised video object segmentation with super-trajectories. IEEE Trans Pattern Anal Mach Intell 41(4):985–998

    Article  Google Scholar 

  43. Wang W, Shen J, Shao L (2015) Consistent video saliency using local gradient flow optimization and global refinement. IEEE Trans Image Process 24 (11):4185–4196

    Article  MathSciNet  Google Scholar 

  44. Wang W, Shen J, Yang R, Porikli F (2017) Saliency-aware video object segmentation. IEEE Trans Pattern Anal Mach Intell 40(1):20–33

    Article  Google Scholar 

  45. Wang S, Wang W, Zhao J, Chen S, Jin Q, Zhang S, Qin Y (2017) Emotion recognition with multimodal features and temporal models. In: Proceedings of the 19th ACM international conference on multimodal interaction. ACM, pp 598–602

  46. Wang J, Zhao Z, Liang J, Li C (2018) Video-based emotion recognition using face frontalization and deep spatiotemporal feature. In: 2018 First asian conference on affective computing and intelligent interaction (ACII Asia). IEEE, p 2018

  47. Wei W, Jia Q, Feng Y, Chen G, Chu M (2020) Multi-modal facial expression feature based on deep-neural networks. J Multimod User Inter 14(1):17–23

    Article  Google Scholar 

  48. Xia X, Liu J, Yang T, Jiang D, Han W, Sahli H (2018) Video emotion recognition using hand-crafted and deep learning features. In: 2018 First Asian conference on affective computing and intelligent interaction (ACII Asia). IEEE, p 2018

  49. Xie S, Hu H (2019) Facial expression recognition using hierarchical features with deep comprehensive multipatches aggregation convolutional neural networks. IEEE Trans Multimed 21(1):211–220

    Article  MathSciNet  Google Scholar 

  50. Xu J, Dong Y, Ma L, Bai H (2018) Video-based emotion recognition using aggregated features and spatio-temporal information. In: 2018 24th international conference on pattern recognition (ICPR). IEEE, pp 2833–2838

  51. Xu B, Fu Y, Jiang Y-G, Li B, Sigal L (2016) Video emotion recognition with transferred deep feature encodings. In: Proceedings of the ACM on international conference on multimedia retrieval. ACM, p 2016

  52. Xu Wanru, Miao Z, Zhang X-P, Yi T (2017) A hierarchical spatio-temporal model for human activity recognition. IEEE Trans Multimed 19 (7):1494–1509

    Article  Google Scholar 

  53. Yadati K, Katti H, Kankanhalli M (2013) Cavva: Computational affective video-in-video advertising. IEEE Trans Multimed 16(1):15–23

    Article  Google Scholar 

  54. Yang J, Sun M, Sun X (2017) Learning visual sentiment distributions via augmented conditional probability neural network. In: Thirty-first AAAI conference on artificial intelligence, pp 224–230

  55. Yu L, Shen J, Wang W, Sun H, Shao L (2019) Better dense trajectories by motion in videos. IEEE Trans Cybern 49(1):159–170

    Article  Google Scholar 

  56. Zhalehpour S, Akhtar Z, Erdem CE (2016) Multimodal emotion recognition based on peak frame selection from video, vol 10, pp 827–834

  57. Zhang H, Xu M (2018) Recognition of emotions in user-generated videos with kernelized features. IEEE Trans Multimed 20(10):2824–2835

    Article  Google Scholar 

  58. Zhang Q, Yu S-P, Zhou D-S, Wei X-P (2013) An efficient method of key-frame extraction based on a cluster algorithm. J Human Kinetics 39 (1):5–14

    Article  Google Scholar 

  59. Zhang S, Zhang S, Huang T, Gao W, Qi T (2017) Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans Circ Syst Vid Technol 28(10):3030–3043

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xinyu Yang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Jie Wei and Xinyu Yang contribute equally to this work.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wei, J., Yang, X. & Dong, Y. User-generated video emotion recognition based on key frames. Multimed Tools Appl 80, 14343–14361 (2021). https://doi.org/10.1007/s11042-020-10203-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-10203-1

Keywords

Navigation