Multi-modal learning for affective content analysis in movies

Yi, Yun; Wang, Hanli

doi:10.1007/s11042-018-5662-9

Multi-modal learning for affective content analysis in movies

Published: 30 January 2018

Volume 78, pages 13331–13350, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Yun Yi^1,2,3 &
Hanli Wang^1,2,4

916 Accesses
20 Citations
3 Altmetric
Explore all metrics

Abstract

Affective content analysis is an important research topic in video content analysis, and has extensive applications in many fields. However, it is a challenging task to design a computational model for predicting emotions induced by videos, since the elicited emotions can be considered relatively subjective. Intuitively, several features of different modalities can depict the elicited emotions, but the correlation and influence of these features are still not well studied. To address this issue, we propose a multi-modal learning framework, which classifies affective contents in the valence-arousal space. In particular, we utilize the features extracted by the methods of motion keypoint trajectory and convolutional neural networks to depict the visual modality of elicited emotions, and extract a global audio feature by the openSMILE toolkit to describe the audio modality. Then, the linear support vector machine and support vector regression are employed to learn the affective models. By comparing these three features with five baseline features, we discover that the three features are significant for describing affective content. Experimental results also demonstrate that the three features complement each other. Moreover, the proposed framework obtains the state-of-the-art results on two challenging datasets of video affective content analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

Human action recognition using fusion of multiview and deep features: an application to video surveillance

Article 14 March 2020

A survey on video-based Human Action Recognition: recent updates, datasets, challenges, and applications

Article 25 September 2020

Notes

References

Acar E, Hopfgartner F, Albayrak S (2017) A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material. Multimed Tools Appl 76(9):11,809–11,837
Article Google Scholar
Anastasia T, Leontios H (2016) AUTH-SGP in MediaEval 2016 emotional impact of movies task. In: MediaEval 2016 Workshop
Arsigny V, Fillard P, Pennec X, Ayache N (2006) Log-Euclidean metrics for fast and simple calculus on diffusion tensors. Magn Resonan Med 56(2):411–421
Article Google Scholar
Baecchi C, Uricchio T, Bertini M, Del Bimbo A (2017) Deep sentiment features of context and faces for affective video analysis. In: ICMR’17, pp 72–77
Baveye Y, Dellandrea E, Chamaret C, Chen L (2015) LIRIS-ACCEDE: a video database for affective content analysis. IEEE Trans Affect Comput 6(1):43–55
Article Google Scholar
Baveye Y, Chamaret C, Dellandréa E, Chen L (2017) Affective video content analysis: a multidisciplinary insight. IEEE Trans Affect Comput
Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns. In: ICCV’07, pp 1–8
Canini L, Benini S, Leonardi R (2013) Affective recommendation of movies based on selected connotative features. IEEE Trans Circuits Syst Video Technol 23 (4):636–647
Article Google Scholar
Chakraborty R, Maurya AK, Pandharipande M, Hassan E, Ghosh H, Kopparapu SK (2015) TCS-ILAB-MediaEval 2015: affective impact of movies and violent scene detection. In: MediaEval 2015 Workshop
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27
Article Google Scholar
Chen S, Jin Q (2016) RUC at MediaEval 2016 emotional impact of movies task: fusion of multimodal features. In: MediaEval 2016 Workshop
Dai Q, Zhao RW, Wu Z, Wang X, Gu Z, Wu W, Jiang YG (2015) Fudan-Huawei at MediaEval 2015: detecting violent scenes and affective impact in movies with deep learning. In: MediaEval 2015 Workshop
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: CVPR’05, pp 886–893
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: ECCV’06, pp 428–441
Dellandréa E, Chen L, Baveye Y, Sjöberg MV, Chamaret C et al (2016) The mediaeval 2016 emotional impact of movies task. In: MediaEval 2016 Workshop
Eggink J, Bland D (2012) A large scale experiment for mood-based classification of tv programmes. In: ICME’12, pp 140–145
Ellis DPW (2005) PLP and RASTA (and MFCC, and inversion) in Matlab. http://www.ee.columbia.edu/dpwe/~resources/matlab/rastamat/, online web resource
Eyben F, Weninger F, Gross F, Schuller B (2013) Recent developments in opensmile, the munich open-source multimedia feature extractor. In: ACM MM’13, pp 835–838
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874
MATH Google Scholar
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: AISTATS’10, pp 249–256
Hanjalic A, Xu LQ (2005) Affective video content representation and modeling. IEEE Trans Multimed 7(1):143–154
Article Google Scholar
Ho CH, Lin CJ (2012) Large-scale linear support vector regression. J Mach Learn Res 13:3323–3348
MathSciNet MATH Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML’15, pp 448–456
Irie G, Satou T, Kojima A, Yamasaki T, Aizawa K (2010) Affective audio-visual words and latent topic driving model for realizing movie affective scene classification. IEEE Trans Multimed 12(6):523–535
Article Google Scholar
Jan A, Gaus YFBA, Meng H, Zhang F (2016) BUL in MediaEval 2016 emotional impact of movies task. In: MediaEval 2016 Workshop
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: ACM MM’14, pp 675–678
Jiang YG, Xu B, Xue X (2014) Predicting emotions in user-generated videos. In: AAAI’14, pp 73–79
Lam V, Phan S, Le DD, Satoh S, Duong DA (2015) NII-UIT at MediaEval 2015 affective impact of movies task. In: MediaEval 2015 Workshop
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: CVPR’08, pp 1–8
Li C, Feng Z, Xu C (2016) Error-correcting output codes for multi-label emotion classification. Multimed Tools Appl 75(22):14,399–14,416
Article Google Scholar
Lin CJ, Weng RC, Keerthi SS (2007) Trust region newton method for large-scale logistic regression. J Mach Learn Res 9(2):561–568
MathSciNet Google Scholar
Liu Y, Gu Z, Zhang Y, Liu Y (2016) Mining emotional features of movies. In: MediaEval 2016 Workshop
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Ma J, Zhao J, Tian J, Yuille AL, Tu Z (2014) Robust point matching via vector field consensus. IEEE Trans Image Process 23(4):1706–1721
Article MathSciNet MATH Google Scholar
Ma Y, Ye Z, Xu M (2016) THU-HCSI at MediaEval 2016: emotional impact of movies task. In: MediaEval 2016 workshop
Marin Vlastelica P, Hayrapetyan S, Tapaswi M, Stiefelhagen R (2015) KIT at MediaEval 2015–evaluating visual cues for affective impact of movies task. In: MediaEval 2015 workshop
Mironica I, Ionescu B, Sjöberg M, Schedl M, Skowron M (2015) RFA at MediaEval 2015 affective impact of movies task: a multimodal approach. In: MediaEval 2015 workshop
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Poria S, Cambria E, Hussain A, Huang GB (2015) Towards an intelligent framework for multimodal affective data analysis. Neural Netw 63:104–116
Article Google Scholar
Sanchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vis 105(3):222–245
Article MathSciNet MATH Google Scholar
Sang J, Xu C (2012) Right buddy makes the difference: an early exploration of social relation analysis in multimedia applications. In: ACM MM’12, pp 19–28
Sang J, Xu C, Liu J (2012) User-aware image tag refinement via ternary semantic analysis. IEEE Trans Multimed 14(3):883–895
Article Google Scholar
Schuller B, Steidl S, Batliner A, Burkhardt F, Devillers L, Müller CA, Narayanan SS (2010) The INTERSPEECH 2010 paralinguistic challenge. In: INTERSPEECH’10
Seddati O, Kulah E, Pironkov G, Dupont S, Mahmoudi S, Dutoit T (2015) UMons at MediaEval 2015 affective impact of movies task including violent scenes detection. In: MediaEval 2015 workshop
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS’14, pp 568–576
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:14091556
Sjöberg M, Baveye Y, Wang H, Quang VL, Ionescu B, Dellandréa E, Schedl M, Demarty CH, Chen L (2015) The MediaEval 2015 affective impact of movies task. In: MediaEval 2015 workshop
Snoek CG, Worring M, Smeulders AW (2005) Early versus late fusion in semantic video analysis. In: ACM MM’05, pp 399–402
Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01
Sun K, Yu J (2007) Video affective content representation and recognition using video affective tree and hidden markov models. In: ACII’07, pp 594–605
Sural S, Qian G, Pramanik S (2002) Segmentation and histogram generation using the HSV color space for image retrieval. In: ICIP’02, pp 589–592
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: CVPR’16, pp 2818–2826
Teixeira RMA, Yamasaki T, Aizawa K (2012) Determination of emotional content of video clips by low-level audiovisual features. Multimed Tools Appl 61(1):21–49
Article Google Scholar
Tieleman T (2008) Training restricted boltzmann machines using approximations to the likelihood gradient. In: ICML’08, pp 1064–1071
Trigeorgis G, Coutinho E, Ringeval F, Marchi E, Zafeiriou S, Schuller B (2015) The ICL-TUM-PASSAU approach for the MediaEval 2015 affective impact of movies task. In: MediaEval 2015 Workshop
Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. In: ACM MM’10, pp 1469–1472
Verma GK, Tiwary US (2016) Affect representation and recognition in 3d continuous valence–arousal–dominance space. Multimed Tools Appl 1–25
Wang HL, Cheong LF (2006) Affective understanding in film. IEEE Trans Circ Syst Video Technol 16(6):689–704
Article Google Scholar
Wang S, Ji Q (2015) Video affective content analysis: a survey of state-of-the-art methods. IEEE Trans Affect Comput 6(4):410–430
Article Google Scholar
Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: CVPR’11, pp 3169–3176
Wang H, Yi Y, Wu J (2015) Human action recognition with trajectory based covariance descriptor in unconstrained videos. In: ACM MM’15, pp 1175–1178
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: ECCV’16, pp 20–36
Xu M, Wang J, He X, Jin J S, Luo S, Lu H (2014) A three-level framework for affective content analysis and its case studies. Multimed Tools Appl 70 (2):757–779
Article Google Scholar
Yi Y, Wang H (2017) Motion keypoint trajectory and covariance descriptor for human action recognition. Vis Comput 1–13
Yi Y, Wang H, Zhang B, Yu J (2015) MIC-TJU in MediaEval 2015 affective impact of movies task. In: MediaEval 2015 workshop
Yu HF, Huang FL, Lin CJ (2011) Dual coordinate descent methods for logistic regression and maximum entropy models. Mach Learn 85(1):41–75
Article MathSciNet MATH Google Scholar
Yu J, Rui Y, Tao D (2014) Click prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 23(5):2019–32
Article MathSciNet MATH Google Scholar
Yu J, Yang X, Gao F, Tao D (2017) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans Cybern 47(12):4014–4024
Article Google Scholar
Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: ICCV’17, pp 1 – 10
Zach C, Pock T, Bischof H (2007) A duality based approach for realtime T V − L ¹ optical flow. In: Joint pattern recognition symposium, pp 214–223
Zhang S, Tian Q, Jiang S, Huang Q, Gao W (2008) Affective MTV analysis based on arousal and valence features. In: ICME’08, pp 1369–1372
Zhang S, Tian Q, Huang Q, Gao W, Li S (2009) Utilizing affective analysis for efficient movie browsing. In: ICIP’09, pp 1853–1856

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 61622115 and 61472281, the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning (No. GZ2015005), Shanghai Engineering Research Center of Industrial Vision Perception & Intelligent Computing (17DZ2251600), and the Key Research and Development Project of Jiangxi Provincial Department of Science and Technology (20171BBE50065).

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tongji University, Shanghai, 201804, People’s Republic of China
Yun Yi & Hanli Wang
Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai, 200092, People’s Republic of China
Yun Yi & Hanli Wang
Department of Mathematics and Computer Science, Gannan Normal University, Ganzhou, 341000, People’s Republic of China
Yun Yi
Shanghai Engineering Research Center of Industrial Vision Perception & Intelligent Computing, Shanghai, 200092, People’s Republic of China
Hanli Wang

Authors

Yun Yi
View author publications
You can also search for this author in PubMed Google Scholar
Hanli Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hanli Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yi, Y., Wang, H. Multi-modal learning for affective content analysis in movies. Multimed Tools Appl 78, 13331–13350 (2019). https://doi.org/10.1007/s11042-018-5662-9

Download citation

Received: 30 July 2017
Revised: 29 December 2017
Accepted: 14 January 2018
Published: 30 January 2018
Issue Date: 30 May 2019
DOI: https://doi.org/10.1007/s11042-018-5662-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-modal learning for affective content analysis in movies

Abstract

Access this article

Similar content being viewed by others

Video summarization using deep learning techniques: a detailed analysis and investigation

Human action recognition using fusion of multiview and deep features: an application to video surveillance

A survey on video-based Human Action Recognition: recent updates, datasets, challenges, and applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-modal learning for affective content analysis in movies

Abstract

Access this article

Similar content being viewed by others

Video summarization using deep learning techniques: a detailed analysis and investigation

Human action recognition using fusion of multiview and deep features: an application to video surveillance

A survey on video-based Human Action Recognition: recent updates, datasets, challenges, and applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation