Fusing Multimodal Streams for Improved Group Emotion Recognition in Videos

Kumar, Deepak; Dhamdhere, Piyush; Raman, Balasubramanian

doi:10.1007/978-3-031-78305-0_26

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15321))

Included in the following conference series:

International Conference on Pattern Recognition

273 Accesses

Abstract

Recognizing social cues and emotions is vital for navigating daily interactions, understanding emotions in conversations, interpreting body language in meetings, and supporting friends in difficult situations. This work focuses on analyzing group-level emotions in videos captured in natural settings, marking an attempt at multimodal group-level emotion analysis. Automatic group emotion recognition is pivotal for understanding complex human-human interactions. Group emotion recognition in videos presents several challenges because existing work predominantly focuses either on individual emotion recognition or group emotion analysis in static images. To address this challenge, we introduce a deep-learning-based multimodal fusion model that integrates diverse modalities, including audio, video, and scene. Feature extraction employs advanced models like TimeSformer for video description and wav2vec2.0 for audio analysis. All the experiments are conducted on the VGAF dataset. Our key findings include: (1) Multimodal approaches outperform their unimodal counterparts, (2) Experimental results confirm the superior performance of proposed approach compared to benchmark methods on the given dataset, and (3) There is a strong correlation between modalities and respective emotions.

Supported by the Ministry of Education (MoE) INDIA with reference grant number: OH-3123200428

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A multimodal fusion-based deep learning framework combined with keyframe extraction and spatial and channel attention for group emotion recognition from videos

Article 18 June 2023

A Spatial-Temporal Graph Convolutional Network for Video-Based Group Emotion Recognition

Enhancing Feature Correlation for Bi-Modal Group Emotion Recognition

References

Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021)
Article Google Scholar
Augusma, A., Vaufreydaz, D., Letué, F.: Multimodal group emotion recognition in-the-wild using privacy-compliant features. In: Proceedings of the 25th International Conference on Multimodal Interaction. pp. 750–754 (2023)
Google Scholar
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33, 12449–12460 (2020)
Google Scholar
Balaji, B., Oruganti, V.R.M.: Multi-level feature fusion for group-level emotion recognition. In: Proceedings of the 19th ACM international conference on multimodal interaction. pp. 583–586 (2017)
Google Scholar
Belova, N.S.: Group-level affect recognition in video using deviation of frame features. In: Analysis of Images, Social Networks and Texts: 10th International Conference, AIST 2021, Tbilisi, Georgia, December 16–18, 2021, Revised Selected Papers. vol. 13217, p. 199. Springer Nature (2022)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: Proceedings of the International Conference on Machine Learning (ICML) (July 2021)
Google Scholar
Castro, S., Hazarika, D., Pérez-Rosas, V., Zimmermann, R., Mihalcea, R., Poria, S.: Towards multimodal sarcasm detection (an _obviously_ perfect paper). arXiv preprint arXiv:1906.01815 (2019)
Collins, J.A., Olson, I.R.: Knowledge is power: How conceptual knowledge transforms visual cognition. Psychonomic bulletin & review 21, 843–860 (2014)
Article Google Scholar
Constantin, M.G., Ştefan, L.D., Ionescu, B., Demarty, C.H., Sjöberg, M., Schedl, M., Gravier, G.: Affect in multimedia: Benchmarking violent scenes detection. IEEE Trans. Affect. Comput. 13(1), 347–366 (2020)
Article Google Scholar
Dhall, A., Sharma, G., Goecke, R., Gedeon, T.: Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based challenges. In: Proceedings of the 2020 International Conference on Multimodal Interaction. pp. 784–789 (2020)
Google Scholar
Evtodienko, L.: Multimodal end-to-end group emotion recognition using cross-modal attention. arXiv preprint arXiv:2111.05890 (2021)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 6202–6211 (2019)
Google Scholar
Ferreira, P.M., Marques, F., Cardoso, J.S., Rebelo, A.: Physiological inspired deep neural networks for emotion recognition. IEEE Access 6, 53930–53943 (2018)
Article Google Scholar
Guo, X., Zhu, B., Polanía, L.F., Boncelet, C., Barner, K.E.: Group-level emotion recognition using hybrid deep models based on faces, scenes, skeletons and visual attentions. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction. pp. 635–639 (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
Google Scholar
Howse, J.: OpenCV computer vision with python, vol. 27. Packt Publishing Birmingham, UK (2013)
Google Scholar
Huang, X., Dhall, A., Goecke, R., Pietikäinen, M., Zhao, G.: Analyzing group-level emotion with global alignment kernel based approach. IEEE Trans. Affect. Comput. 13(2), 713–728 (2019)
Article Google Scholar
Jin, B.T., Abdelrahman, L., Chen, C.K., Khanzada, A.: Fusical: Multimodal fusion for video sentiment. In: Proceedings of the 2020 International Conference on Multimodal Interaction. pp. 798–806 (2020)
Google Scholar
Jocher, G., Chaurasia, A., Qiu, J.: Ultralytics YOLO (Jan 2023), https://github.com/ultralytics/ultralytics
Kelly, J.R., Barsade, S.G.: Mood and emotions in small groups and work teams. Organ. Behav. Hum. Decis. Process. 86(1), 99–130 (2001)
Article Google Scholar
Li, S., Deng, W.: Deep facial expression recognition: A survey. IEEE Trans. Affect. Comput. 13(3), 1195–1215 (2020)
Article MathSciNet Google Scholar
Liu, C., Jiang, W., Wang, M., Tang, T.: Group level audio-video emotion recognition using hybrid networks. In: Proceedings of the 2020 International Conference on Multimodal Interaction. pp. 807–812 (2020)
Google Scholar
Magnani, L., Civita, S., Massara, G.P.: Visual cognition and cognitive modeling. Human and machine vision: Analogies and divergencies pp. 229–243 (1994)
Google Scholar
Morris, R.G., Tarassenko, L., Kenward, M.: Cognitive systems-Information processing meets brain science. Elsevier (2005)
Google Scholar
Niedenthal, P.M., Brauer, M.: Social functionality of human emotion. Annu. Rev. Psychol. 63, 259–285 (2012)
Article Google Scholar
O’Malley, T., Bursztein, E., Long, J., Chollet, F., Jin, H., Invernizzi, L., et al.: Kerastuner. https://github.com/keras-team/keras-tuner (2019)
Pan, C., Yu, D., Sijiang, L., Zhen, G., Lei, Y.: Group emotion recognition based on multilayer hybrid network. In: 2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC). pp. 173–177. IEEE (2018)
Google Scholar
Petrova, A., Vaufreydaz, D., Dessus, P.: Group-level emotion recognition using a unimodal privacy-safe non-individual approach. In: Proceedings of the 2020 International Conference on Multimodal Interaction. pp. 813–820 (2020)
Google Scholar
Pinto, J.R., Gonçalves, T., Pinto, C., Sanhudo, L., Fonseca, J., Gonçalves, F., Carvalho, P., Cardoso, J.S.: Audiovisual classification of group emotion valence using activity recognition networks. In: 2020 IEEE 4th International Conference on Image Processing, Applications and Systems (IPAS). pp. 114–119. IEEE (2020)
Google Scholar
Savchenko, A.V., Makarov, I.: Neural network model for video-based analysis of student’s emotions in e-learning. Optical Memory and Neural Networks 31(3), 237–244 (2022)
Article Google Scholar
Sharma, G., Dhall, A., Cai, J.: Audio-visual automatic group affect analysis. IEEE Trans. Affect. Comput. 14(2), 1056–1069 (2021)
Article Google Scholar
Sharma, G., Ghosh, S., Dhall, A.: Automatic group level affect and cohesion prediction in videos. In: 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). pp. 161–167. IEEE (2019)
Google Scholar
Tian, Y., Lu, G., Yan, Y., Zhai, G., Chen, L., Gao, Z.: A coding framework and benchmark towards low-bitrate video understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)
Google Scholar
Tian, Y., Lu, G., Zhai, G., Gao, Z.: Non-semantics suppressed mask learning for unsupervised video semantic compression. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13610–13622 (2023)
Google Scholar
Tian, Y., Yan, Y., Zhai, G., Guo, G., Gao, Z.: Ean: event adaptive network for enhanced action recognition. Int. J. Comput. Vision 130(10), 2453–2471 (2022)
Article Google Scholar
Veltmeijer, E.A., Gerritsen, C., Hindriks, K.V.: Automatic emotion recognition for groups: a review. IEEE Trans. Affect. Comput. 14(1), 89–107 (2021)
Article Google Scholar
Wang, Y., Song, W., Tao, W., Liotta, A., Yang, D., Li, X., Gao, S., Sun, Y., Ge, W., Zhang, W., et al.: A systematic review on affective computing: Emotion models, databases, and recent advances. Information Fusion 83, 19–52 (2022)
Article Google Scholar
Wang, Y., Wu, J., Heracleous, P., Wada, S., Kimura, R., Kurihara, S.: Implicit knowledge injectable cross attention audiovisual model for group emotion recognition. In: Proceedings of the 2020 international conference on multimodal interaction. pp. 827–834 (2020)
Google Scholar
Xiao, F., Lee, Y.J., Grauman, K., Malik, J., Feichtenhofer, C.: Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740 (2020)
Zadeh, A., Chan, M., Liang, P.P., Tong, E., Morency, L.P.: Social-iq: A question answering benchmark for artificial social intelligence. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8807–8817 (2019)
Google Scholar
Zhang, K., Li, Y., Wang, J., Cambria, E., Li, X.: Real-time video emotion recognition based on reinforcement learning and domain knowledge. IEEE Trans. Circuits Syst. Video Technol. 32(3), 1034–1047 (2021)
Article Google Scholar
Zhao, S., Ma, Y., Gu, Y., Yang, J., Xing, T., Xu, P., Hu, R., Chai, H., Keutzer, K.: An end-to-end visual-audio attention network for emotion recognition in user-generated videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 303–311 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology Roorkee, Roorkee, 247667, India
Deepak Kumar, Piyush Dhamdhere & Balasubramanian Raman

Authors

Deepak Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Piyush Dhamdhere
View author publications
You can also search for this author in PubMed Google Scholar
Balasubramanian Raman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deepak Kumar .

Editor information

Editors and Affiliations

University of Salford, Salford, UK
Apostolos Antonacopoulos
Indian Institute of Technology Bombay, Mumbai, Maharashtra, India
Subhasis Chaudhuri
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
IIT Kharagpur, Kharagpur, India
Saumik Bhattacharya
Indian Statistical Institute Kolkata, Kolkata, India
Umapada Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar, D., Dhamdhere, P., Raman, B. (2025). Fusing Multimodal Streams for Improved Group Emotion Recognition in Videos. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15321. Springer, Cham. https://doi.org/10.1007/978-3-031-78305-0_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-78305-0_26
Published: 04 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78304-3
Online ISBN: 978-3-031-78305-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)