Skip to main content

Stepwise Fusion Transformer for Affective Video Content Analysis

  • Conference paper
  • First Online:
International Conference on Neural Computing for Advanced Applications (NCAA 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1870))

Included in the following conference series:

  • 653 Accesses

Abstract

In the field of video content analysis, affective video content analysis is an important part. This paper presents an efficient multimodal multilevel Transformer derivation model based on standard self-attention mechanism and cross-attention mechanism by multilevel stepwise fusion of features. We also used the loss function for the first time to constrain the learning of tokens in the transformer, and achieved good results. The model begins by combining the global and local features of each modality. The model then uses the cross-attention module to combine data from the three modalities and then uses the self-attention module to integrate the data from each modality. In classification and regression experiments, we achieved better results in previous papers. Compared to the state-of-the-art results in recent years [19], we have improved the Valence and Arousal correct rates in the classification dataset by 4.267% and 0.924%, respectively. On the regression dataset, Valence results improved by 0.007 and 0.08 on the MSE and PCC metrics, respectively; Arousal correspondingly improved by 0.117 and 0.057.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Baveye, Y., Dellandrea, E., Chamaret, C., Chen, L.: LIRIS-accede: a video database for affective content analysis. IEEE Trans. Affect. Comput. 6(1), 43–55 (2015)

    Article  Google Scholar 

  2. Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., Erhan, D.: Domain separation networks. In: Advances in Neural Information Processing Systems, vol. 29 (2016)

    Google Scholar 

  3. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

    Google Scholar 

  4. Chan, C.H., Jones, G.J.: Affect-based indexing and retrieval of films. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, pp. 427–430 (2005)

    Google Scholar 

  5. Chen, S., Jin, Q.: RUC at mediaeval 2016 emotional impact of movies task: fusion of multimodal features. In: MediaEval, vol. 1739 (2016)

    Google Scholar 

  6. Chen, T., Wang, Y., Wang, S., Chen, S.: Exploring domain knowledge for affective video content analyses. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 769–776 (2017)

    Google Scholar 

  7. Dellandréa, E., Chen, L., Baveye, Y., Sjöberg, M.V., Chamaret, C.: The mediaeval 2016 emotional impact of movies task. In: CEUR Workshop Proceedings (2016)

    Google Scholar 

  8. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  9. Ekman, P.: Basic emotions. Handbook of Cognition And Emotion 98(45–60), 16 (1999)

    Google Scholar 

  10. Hershey, S., et al.: CNN architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135. IEEE (2017)

    Google Scholar 

  11. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)

    Google Scholar 

  12. Ou, Y., Chen, Z., Wu, F.: Multimodal local-global attention network for affective video content analysis. IEEE Trans. Circuits Syst. Video Technol. 31(5), 1901–1914 (2020)

    Article  Google Scholar 

  13. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  14. Russell, J.A.: A circumplex model of affect. J. Pers. Soc. Psychol. 39(6), 1161 (1980)

    Article  Google Scholar 

  15. Sjöberg, M., et al.: The mediaeval 2015 affective impact of movies task. In: MediaEval, vol. 1436 (2015)

    Google Scholar 

  16. Thao, H.T.P., Balamurali, B., Roig, G., Herremans, D.: Attendaffectnet-emotion prediction of movie viewers using multimodal fusion with self-attention. Sensors 21(24), 8356 (2021)

    Article  Google Scholar 

  17. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  18. Wang, J., Li, B., Hu, W., Wu, O.: Horror video scene recognition via multiple-instance learning. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1325–1328. IEEE (2011)

    Google Scholar 

  19. Wang, Q., Xiang, X., Zhao, J., Deng, X.: P2SL: private-shared subspaces learning for affective video content analysis. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2022)

    Google Scholar 

  20. Wang, S., Ji, Q.: Video affective content analysis: a survey of state-of-the-art methods. IEEE Trans. Affect. Comput. 6(4), 410–430 (2015)

    Article  Google Scholar 

  21. Yi, Y., Wang, H.: Multi-modal learning for affective content analysis in movies. Multimed. Tools App. 78(10), 13331–13350 (2019)

    Article  Google Scholar 

  22. Yi, Y., Wang, H., Li, Q.: Affective video content analysis with adaptive fusion recurrent network. IEEE Trans. Multimed. 22(9), 2454–2466 (2019)

    Article  Google Scholar 

  23. Yi, Y., Wang, H., Tang, P.: Unified multi-stage fusion network for affective video content analysis. SSRN 4080629

    Google Scholar 

  24. Zeng, Z., Tu, J., Liu, M., Huang, T.S., Pianfetti, B., Roth, D., Levinson, S.: Audio-visual affect recognition. IEEE Trans. Multimedia 9(2), 424–428 (2007)

    Article  Google Scholar 

  25. Zhao, S., Yao, H., Sun, X., Xu, P., Liu, X., Ji, R.: Video indexing and recommendation based on affective analysis of viewers. In: Proceedings of the 19th ACM International Conference on Multimedia, pp. 1473–1476 (2011)

    Google Scholar 

Download references

Acknowledgement

This work was supported in part by the Natural Science Foundation of Chongqing under Grant cstc2020jcyj-msxmX0284; in part by the Scientific and Technological Research Program of Chongqing Municipal Education Commission under Grant KJQN202000625.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zeyu Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, Z., Xiang, X., Deng, X., Wang, Q. (2023). Stepwise Fusion Transformer for Affective Video Content Analysis. In: Zhang, H., et al. International Conference on Neural Computing for Advanced Applications. NCAA 2023. Communications in Computer and Information Science, vol 1870. Springer, Singapore. https://doi.org/10.1007/978-981-99-5847-4_27

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-5847-4_27

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-5846-7

  • Online ISBN: 978-981-99-5847-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics