Abstract
In the field of video content analysis, affective video content analysis is an important part. This paper presents an efficient multimodal multilevel Transformer derivation model based on standard self-attention mechanism and cross-attention mechanism by multilevel stepwise fusion of features. We also used the loss function for the first time to constrain the learning of tokens in the transformer, and achieved good results. The model begins by combining the global and local features of each modality. The model then uses the cross-attention module to combine data from the three modalities and then uses the self-attention module to integrate the data from each modality. In classification and regression experiments, we achieved better results in previous papers. Compared to the state-of-the-art results in recent years [19], we have improved the Valence and Arousal correct rates in the classification dataset by 4.267% and 0.924%, respectively. On the regression dataset, Valence results improved by 0.007 and 0.08 on the MSE and PCC metrics, respectively; Arousal correspondingly improved by 0.117 and 0.057.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Baveye, Y., Dellandrea, E., Chamaret, C., Chen, L.: LIRIS-accede: a video database for affective content analysis. IEEE Trans. Affect. Comput. 6(1), 43–55 (2015)
Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., Erhan, D.: Domain separation networks. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Chan, C.H., Jones, G.J.: Affect-based indexing and retrieval of films. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, pp. 427–430 (2005)
Chen, S., Jin, Q.: RUC at mediaeval 2016 emotional impact of movies task: fusion of multimodal features. In: MediaEval, vol. 1739 (2016)
Chen, T., Wang, Y., Wang, S., Chen, S.: Exploring domain knowledge for affective video content analyses. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 769–776 (2017)
Dellandréa, E., Chen, L., Baveye, Y., Sjöberg, M.V., Chamaret, C.: The mediaeval 2016 emotional impact of movies task. In: CEUR Workshop Proceedings (2016)
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Ekman, P.: Basic emotions. Handbook of Cognition And Emotion 98(45–60), 16 (1999)
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135. IEEE (2017)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Ou, Y., Chen, Z., Wu, F.: Multimodal local-global attention network for affective video content analysis. IEEE Trans. Circuits Syst. Video Technol. 31(5), 1901–1914 (2020)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Russell, J.A.: A circumplex model of affect. J. Pers. Soc. Psychol. 39(6), 1161 (1980)
Sjöberg, M., et al.: The mediaeval 2015 affective impact of movies task. In: MediaEval, vol. 1436 (2015)
Thao, H.T.P., Balamurali, B., Roig, G., Herremans, D.: Attendaffectnet-emotion prediction of movie viewers using multimodal fusion with self-attention. Sensors 21(24), 8356 (2021)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wang, J., Li, B., Hu, W., Wu, O.: Horror video scene recognition via multiple-instance learning. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1325–1328. IEEE (2011)
Wang, Q., Xiang, X., Zhao, J., Deng, X.: P2SL: private-shared subspaces learning for affective video content analysis. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2022)
Wang, S., Ji, Q.: Video affective content analysis: a survey of state-of-the-art methods. IEEE Trans. Affect. Comput. 6(4), 410–430 (2015)
Yi, Y., Wang, H.: Multi-modal learning for affective content analysis in movies. Multimed. Tools App. 78(10), 13331–13350 (2019)
Yi, Y., Wang, H., Li, Q.: Affective video content analysis with adaptive fusion recurrent network. IEEE Trans. Multimed. 22(9), 2454–2466 (2019)
Yi, Y., Wang, H., Tang, P.: Unified multi-stage fusion network for affective video content analysis. SSRN 4080629
Zeng, Z., Tu, J., Liu, M., Huang, T.S., Pianfetti, B., Roth, D., Levinson, S.: Audio-visual affect recognition. IEEE Trans. Multimedia 9(2), 424–428 (2007)
Zhao, S., Yao, H., Sun, X., Xu, P., Liu, X., Ji, R.: Video indexing and recommendation based on affective analysis of viewers. In: Proceedings of the 19th ACM International Conference on Multimedia, pp. 1473–1476 (2011)
Acknowledgement
This work was supported in part by the Natural Science Foundation of Chongqing under Grant cstc2020jcyj-msxmX0284; in part by the Scientific and Technological Research Program of Chongqing Municipal Education Commission under Grant KJQN202000625.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Chen, Z., Xiang, X., Deng, X., Wang, Q. (2023). Stepwise Fusion Transformer for Affective Video Content Analysis. In: Zhang, H., et al. International Conference on Neural Computing for Advanced Applications. NCAA 2023. Communications in Computer and Information Science, vol 1870. Springer, Singapore. https://doi.org/10.1007/978-981-99-5847-4_27
Download citation
DOI: https://doi.org/10.1007/978-981-99-5847-4_27
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-5846-7
Online ISBN: 978-981-99-5847-4
eBook Packages: Computer ScienceComputer Science (R0)