Stepwise Fusion Transformer for Affective Video Content Analysis

Chen, Zeyu; Xiang, Xiaohong; Deng, Xin; Wang, Qi

doi:10.1007/978-981-99-5847-4_27

Zeyu Chen¹²,
Xiaohong Xiang¹²,
Xin Deng¹² &
…
Qi Wang¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1870))

Included in the following conference series:

International Conference on Neural Computing for Advanced Applications

653 Accesses

Abstract

In the field of video content analysis, affective video content analysis is an important part. This paper presents an efficient multimodal multilevel Transformer derivation model based on standard self-attention mechanism and cross-attention mechanism by multilevel stepwise fusion of features. We also used the loss function for the first time to constrain the learning of tokens in the transformer, and achieved good results. The model begins by combining the global and local features of each modality. The model then uses the cross-attention module to combine data from the three modalities and then uses the self-attention module to integrate the data from each modality. In classification and regression experiments, we achieved better results in previous papers. Compared to the state-of-the-art results in recent years [19], we have improved the Valence and Arousal correct rates in the classification dataset by 4.267% and 0.924%, respectively. On the regression dataset, Valence results improved by 0.007 and 0.08 on the MSE and PCC metrics, respectively; Arousal correspondingly improved by 0.117 and 0.057.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Emotion Recognition in Video Streams Using Intramodal and Intermodal Attention Mechanisms

Towards Good Practices for Multi-modal Fusion in Large-Scale Video Classification

Global Affective Video Content Regression Based on Complementary Audio-Visual Features

References

Baveye, Y., Dellandrea, E., Chamaret, C., Chen, L.: LIRIS-accede: a video database for affective content analysis. IEEE Trans. Affect. Comput. 6(1), 43–55 (2015)
Article Google Scholar
Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., Erhan, D.: Domain separation networks. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Chan, C.H., Jones, G.J.: Affect-based indexing and retrieval of films. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, pp. 427–430 (2005)
Google Scholar
Chen, S., Jin, Q.: RUC at mediaeval 2016 emotional impact of movies task: fusion of multimodal features. In: MediaEval, vol. 1739 (2016)
Google Scholar
Chen, T., Wang, Y., Wang, S., Chen, S.: Exploring domain knowledge for affective video content analyses. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 769–776 (2017)
Google Scholar
Dellandréa, E., Chen, L., Baveye, Y., Sjöberg, M.V., Chamaret, C.: The mediaeval 2016 emotional impact of movies task. In: CEUR Workshop Proceedings (2016)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16$\times $16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Ekman, P.: Basic emotions. Handbook of Cognition And Emotion 98(45–60), 16 (1999)
Google Scholar
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135. IEEE (2017)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Google Scholar
Ou, Y., Chen, Z., Wu, F.: Multimodal local-global attention network for affective video content analysis. IEEE Trans. Circuits Syst. Video Technol. 31(5), 1901–1914 (2020)
Article Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Russell, J.A.: A circumplex model of affect. J. Pers. Soc. Psychol. 39(6), 1161 (1980)
Article Google Scholar
Sjöberg, M., et al.: The mediaeval 2015 affective impact of movies task. In: MediaEval, vol. 1436 (2015)
Google Scholar
Thao, H.T.P., Balamurali, B., Roig, G., Herremans, D.: Attendaffectnet-emotion prediction of movie viewers using multimodal fusion with self-attention. Sensors 21(24), 8356 (2021)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wang, J., Li, B., Hu, W., Wu, O.: Horror video scene recognition via multiple-instance learning. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1325–1328. IEEE (2011)
Google Scholar
Wang, Q., Xiang, X., Zhao, J., Deng, X.: P2SL: private-shared subspaces learning for affective video content analysis. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2022)
Google Scholar
Wang, S., Ji, Q.: Video affective content analysis: a survey of state-of-the-art methods. IEEE Trans. Affect. Comput. 6(4), 410–430 (2015)
Article Google Scholar
Yi, Y., Wang, H.: Multi-modal learning for affective content analysis in movies. Multimed. Tools App. 78(10), 13331–13350 (2019)
Article Google Scholar
Yi, Y., Wang, H., Li, Q.: Affective video content analysis with adaptive fusion recurrent network. IEEE Trans. Multimed. 22(9), 2454–2466 (2019)
Article Google Scholar
Yi, Y., Wang, H., Tang, P.: Unified multi-stage fusion network for affective video content analysis. SSRN 4080629
Google Scholar
Zeng, Z., Tu, J., Liu, M., Huang, T.S., Pianfetti, B., Roth, D., Levinson, S.: Audio-visual affect recognition. IEEE Trans. Multimedia 9(2), 424–428 (2007)
Article Google Scholar
Zhao, S., Yao, H., Sun, X., Xu, P., Liu, X., Ji, R.: Video indexing and recommendation based on affective analysis of viewers. In: Proceedings of the 19th ACM International Conference on Multimedia, pp. 1473–1476 (2011)
Google Scholar

Download references

Acknowledgement

This work was supported in part by the Natural Science Foundation of Chongqing under Grant cstc2020jcyj-msxmX0284; in part by the Scientific and Technological Research Program of Chongqing Municipal Education Commission under Grant KJQN202000625.

Author information

Authors and Affiliations

Chongqing University of Posts and Telecommunications, Nan’an District, Chongqing, 400065, China
Zeyu Chen, Xiaohong Xiang, Xin Deng & Qi Wang

Authors

Zeyu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohong Xiang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Deng
View author publications
You can also search for this author in PubMed Google Scholar
Qi Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zeyu Chen .

Editor information

Editors and Affiliations

Harbin Institute of Technology, Shenzhen, China
Haijun Zhang
Chaohu University, Hefei, China
Yinggen Ke
Chongqing University, Chongqing, China
Zhou Wu
South China Normal University, Guangzhou, China
Tianyong Hao
Hefei University of Technology, Hefei, China
Zhao Zhang
Technical University of Denmark, Kongens Lyngby, Denmark
Weizhi Meng
Chaohu University, Hefei, China
Yuanyuan Mu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Z., Xiang, X., Deng, X., Wang, Q. (2023). Stepwise Fusion Transformer for Affective Video Content Analysis. In: Zhang, H., et al. International Conference on Neural Computing for Advanced Applications. NCAA 2023. Communications in Computer and Information Science, vol 1870. Springer, Singapore. https://doi.org/10.1007/978-981-99-5847-4_27

Download citation

DOI: https://doi.org/10.1007/978-981-99-5847-4_27
Published: 30 August 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-5846-7
Online ISBN: 978-981-99-5847-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Stepwise Fusion Transformer for Affective Video Content Analysis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Emotion Recognition in Video Streams Using Intramodal and Intermodal Attention Mechanisms

Towards Good Practices for Multi-modal Fusion in Large-Scale Video Classification

Global Affective Video Content Regression Based on Complementary Audio-Visual Features

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Stepwise Fusion Transformer for Affective Video Content Analysis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Emotion Recognition in Video Streams Using Intramodal and Intermodal Attention Mechanisms

Towards Good Practices for Multi-modal Fusion in Large-Scale Video Classification

Global Affective Video Content Regression Based on Complementary Audio-Visual Features

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation