Skip to main content

Fusing Multimodal Streams for Improved Group Emotion Recognition in Videos

  • Conference paper
  • First Online:
Pattern Recognition (ICPR 2024)

Abstract

Recognizing social cues and emotions is vital for navigating daily interactions, understanding emotions in conversations, interpreting body language in meetings, and supporting friends in difficult situations. This work focuses on analyzing group-level emotions in videos captured in natural settings, marking an attempt at multimodal group-level emotion analysis. Automatic group emotion recognition is pivotal for understanding complex human-human interactions. Group emotion recognition in videos presents several challenges because existing work predominantly focuses either on individual emotion recognition or group emotion analysis in static images. To address this challenge, we introduce a deep-learning-based multimodal fusion model that integrates diverse modalities, including audio, video, and scene. Feature extraction employs advanced models like TimeSformer for video description and wav2vec2.0 for audio analysis. All the experiments are conducted on the VGAF dataset. Our key findings include: (1) Multimodal approaches outperform their unimodal counterparts, (2) Experimental results confirm the superior performance of proposed approach compared to benchmark methods on the given dataset, and (3) There is a strong correlation between modalities and respective emotions.

Supported by the Ministry of Education (MoE) INDIA with reference grant number: OH-3123200428

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021)

    Article  Google Scholar 

  2. Augusma, A., Vaufreydaz, D., Letué, F.: Multimodal group emotion recognition in-the-wild using privacy-compliant features. In: Proceedings of the 25th International Conference on Multimodal Interaction. pp. 750–754 (2023)

    Google Scholar 

  3. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33, 12449–12460 (2020)

    Google Scholar 

  4. Balaji, B., Oruganti, V.R.M.: Multi-level feature fusion for group-level emotion recognition. In: Proceedings of the 19th ACM international conference on multimodal interaction. pp. 583–586 (2017)

    Google Scholar 

  5. Belova, N.S.: Group-level affect recognition in video using deviation of frame features. In: Analysis of Images, Social Networks and Texts: 10th International Conference, AIST 2021, Tbilisi, Georgia, December 16–18, 2021, Revised Selected Papers. vol. 13217, p. 199. Springer Nature (2022)

    Google Scholar 

  6. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: Proceedings of the International Conference on Machine Learning (ICML) (July 2021)

    Google Scholar 

  7. Castro, S., Hazarika, D., Pérez-Rosas, V., Zimmermann, R., Mihalcea, R., Poria, S.: Towards multimodal sarcasm detection (an _obviously_ perfect paper). arXiv preprint arXiv:1906.01815 (2019)

  8. Collins, J.A., Olson, I.R.: Knowledge is power: How conceptual knowledge transforms visual cognition. Psychonomic bulletin & review 21, 843–860 (2014)

    Article  Google Scholar 

  9. Constantin, M.G., Ştefan, L.D., Ionescu, B., Demarty, C.H., Sjöberg, M., Schedl, M., Gravier, G.: Affect in multimedia: Benchmarking violent scenes detection. IEEE Trans. Affect. Comput. 13(1), 347–366 (2020)

    Article  Google Scholar 

  10. Dhall, A., Sharma, G., Goecke, R., Gedeon, T.: Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based challenges. In: Proceedings of the 2020 International Conference on Multimodal Interaction. pp. 784–789 (2020)

    Google Scholar 

  11. Evtodienko, L.: Multimodal end-to-end group emotion recognition using cross-modal attention. arXiv preprint arXiv:2111.05890 (2021)

  12. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 6202–6211 (2019)

    Google Scholar 

  13. Ferreira, P.M., Marques, F., Cardoso, J.S., Rebelo, A.: Physiological inspired deep neural networks for emotion recognition. IEEE Access 6, 53930–53943 (2018)

    Article  Google Scholar 

  14. Guo, X., Zhu, B., Polanía, L.F., Boncelet, C., Barner, K.E.: Group-level emotion recognition using hybrid deep models based on faces, scenes, skeletons and visual attentions. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction. pp. 635–639 (2018)

    Google Scholar 

  15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

    Google Scholar 

  16. Howse, J.: OpenCV computer vision with python, vol. 27. Packt Publishing Birmingham, UK (2013)

    Google Scholar 

  17. Huang, X., Dhall, A., Goecke, R., Pietikäinen, M., Zhao, G.: Analyzing group-level emotion with global alignment kernel based approach. IEEE Trans. Affect. Comput. 13(2), 713–728 (2019)

    Article  Google Scholar 

  18. Jin, B.T., Abdelrahman, L., Chen, C.K., Khanzada, A.: Fusical: Multimodal fusion for video sentiment. In: Proceedings of the 2020 International Conference on Multimodal Interaction. pp. 798–806 (2020)

    Google Scholar 

  19. Jocher, G., Chaurasia, A., Qiu, J.: Ultralytics YOLO (Jan 2023), https://github.com/ultralytics/ultralytics

  20. Kelly, J.R., Barsade, S.G.: Mood and emotions in small groups and work teams. Organ. Behav. Hum. Decis. Process. 86(1), 99–130 (2001)

    Article  Google Scholar 

  21. Li, S., Deng, W.: Deep facial expression recognition: A survey. IEEE Trans. Affect. Comput. 13(3), 1195–1215 (2020)

    Article  MathSciNet  Google Scholar 

  22. Liu, C., Jiang, W., Wang, M., Tang, T.: Group level audio-video emotion recognition using hybrid networks. In: Proceedings of the 2020 International Conference on Multimodal Interaction. pp. 807–812 (2020)

    Google Scholar 

  23. Magnani, L., Civita, S., Massara, G.P.: Visual cognition and cognitive modeling. Human and machine vision: Analogies and divergencies pp. 229–243 (1994)

    Google Scholar 

  24. Morris, R.G., Tarassenko, L., Kenward, M.: Cognitive systems-Information processing meets brain science. Elsevier (2005)

    Google Scholar 

  25. Niedenthal, P.M., Brauer, M.: Social functionality of human emotion. Annu. Rev. Psychol. 63, 259–285 (2012)

    Article  Google Scholar 

  26. O’Malley, T., Bursztein, E., Long, J., Chollet, F., Jin, H., Invernizzi, L., et al.: Kerastuner. https://github.com/keras-team/keras-tuner (2019)

  27. Pan, C., Yu, D., Sijiang, L., Zhen, G., Lei, Y.: Group emotion recognition based on multilayer hybrid network. In: 2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC). pp. 173–177. IEEE (2018)

    Google Scholar 

  28. Petrova, A., Vaufreydaz, D., Dessus, P.: Group-level emotion recognition using a unimodal privacy-safe non-individual approach. In: Proceedings of the 2020 International Conference on Multimodal Interaction. pp. 813–820 (2020)

    Google Scholar 

  29. Pinto, J.R., Gonçalves, T., Pinto, C., Sanhudo, L., Fonseca, J., Gonçalves, F., Carvalho, P., Cardoso, J.S.: Audiovisual classification of group emotion valence using activity recognition networks. In: 2020 IEEE 4th International Conference on Image Processing, Applications and Systems (IPAS). pp. 114–119. IEEE (2020)

    Google Scholar 

  30. Savchenko, A.V., Makarov, I.: Neural network model for video-based analysis of student’s emotions in e-learning. Optical Memory and Neural Networks 31(3), 237–244 (2022)

    Article  Google Scholar 

  31. Sharma, G., Dhall, A., Cai, J.: Audio-visual automatic group affect analysis. IEEE Trans. Affect. Comput. 14(2), 1056–1069 (2021)

    Article  Google Scholar 

  32. Sharma, G., Ghosh, S., Dhall, A.: Automatic group level affect and cohesion prediction in videos. In: 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). pp. 161–167. IEEE (2019)

    Google Scholar 

  33. Tian, Y., Lu, G., Yan, Y., Zhai, G., Chen, L., Gao, Z.: A coding framework and benchmark towards low-bitrate video understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

    Google Scholar 

  34. Tian, Y., Lu, G., Zhai, G., Gao, Z.: Non-semantics suppressed mask learning for unsupervised video semantic compression. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13610–13622 (2023)

    Google Scholar 

  35. Tian, Y., Yan, Y., Zhai, G., Guo, G., Gao, Z.: Ean: event adaptive network for enhanced action recognition. Int. J. Comput. Vision 130(10), 2453–2471 (2022)

    Article  Google Scholar 

  36. Veltmeijer, E.A., Gerritsen, C., Hindriks, K.V.: Automatic emotion recognition for groups: a review. IEEE Trans. Affect. Comput. 14(1), 89–107 (2021)

    Article  Google Scholar 

  37. Wang, Y., Song, W., Tao, W., Liotta, A., Yang, D., Li, X., Gao, S., Sun, Y., Ge, W., Zhang, W., et al.: A systematic review on affective computing: Emotion models, databases, and recent advances. Information Fusion 83, 19–52 (2022)

    Article  Google Scholar 

  38. Wang, Y., Wu, J., Heracleous, P., Wada, S., Kimura, R., Kurihara, S.: Implicit knowledge injectable cross attention audiovisual model for group emotion recognition. In: Proceedings of the 2020 international conference on multimodal interaction. pp. 827–834 (2020)

    Google Scholar 

  39. Xiao, F., Lee, Y.J., Grauman, K., Malik, J., Feichtenhofer, C.: Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740 (2020)

  40. Zadeh, A., Chan, M., Liang, P.P., Tong, E., Morency, L.P.: Social-iq: A question answering benchmark for artificial social intelligence. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8807–8817 (2019)

    Google Scholar 

  41. Zhang, K., Li, Y., Wang, J., Cambria, E., Li, X.: Real-time video emotion recognition based on reinforcement learning and domain knowledge. IEEE Trans. Circuits Syst. Video Technol. 32(3), 1034–1047 (2021)

    Article  Google Scholar 

  42. Zhao, S., Ma, Y., Gu, Y., Yang, J., Xing, T., Xu, P., Hu, R., Chai, H., Keutzer, K.: An end-to-end visual-audio attention network for emotion recognition in user-generated videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 303–311 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Deepak Kumar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kumar, D., Dhamdhere, P., Raman, B. (2025). Fusing Multimodal Streams for Improved Group Emotion Recognition in Videos. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15321. Springer, Cham. https://doi.org/10.1007/978-3-031-78305-0_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-78305-0_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-78304-3

  • Online ISBN: 978-3-031-78305-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics