Abstract
Automatic emotion recognition from video streams is an essential challenge for various applications including human behavior understanding, mental disease diagnosis, surveillance, or human-machine interaction. In this paper we introduce a novel, completely automatic, multimodal emotion recognition framework based on audio and visual fusion of information designed to leverage the mutually complementary nature of features while maintaining the modality-distinctive information. Specifically, we integrate the spatial, channel and temporal attention into the visual processing pipeline and the temporal self-attention into the audio branch. Then, a multimodal cross-attention fusion strategy is introduced that effectively exploits the relationship between the audio and video features. The experimental evaluation performed on RAVDESS, a publicly available database, validates the proposed approach with average accuracy scores superior to 87.85%. When compared with the state-of the art methods the proposed framework returns accuracy gains of more than 1.85%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cao, Q., Hou, M., Chen, B., Zhang, Z., Lu, G.: Hierarchical network based on the fusion of static and dynamic features for speech emotion recognition. In: ICASSP – 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp., 6334–6338 (2021)
Hernández-Luquin, F., Escalante, H.J.: Multi-branch deep radial basis function networks for facial emotion recognition. Neural Comput. Applic. (2021)
Naseem, U., Razzak, I., Musial, K., Imran, M.: Transformer based deep intelligent contextual embedding for Twitter sentiment analysis. Futur. Gener. Comput. Syst. 113, 58–69 (2020)
Ekman, P., Friesen, W.V.: Constants across cultures in the face and emotion. J. Pers. Soc. Psychol. 17, 124–129 (1971)
Hara, K., Kataoka H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)
Tzirakis, P., Trigeorgis, G., Nicolaou, M.A., Schuller, B.W., Zafeiriou, S.: End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Top. Sign. Process. 11(8), 1301–1309 (2017)
Ortega, J.D.S., Cardinal, P., Koerich, A.L.: Emotion recognition using fusion of audio and video features. In: 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), pp. 3847–3852 (2019)
Nguyen, D., et al.: Deep auto-encoders with sequential learning for multimodal dimensional emotion recognition. IEEE Trans. Multimedia 24, 1313–1324 (2021)
Zhao, S., et al.: An end-to-end visual-audio attention network for emotion recognition in user-generated videos. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 303–311 (2020)
Ghaleb, E., Niehues, J., Asteriadis, S.: Multimodal attention-mechanism for temporal emotion recognition. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 251–255 (2020)
Wang, Y., Wu, J., Heracleous, P., Wada, S., Kimura, R., Kurihara, S.: Implicit knowledge injectable cross attention audiovisual model for group emotion recognition. In: Proceedings of the 2020 International Conference on Multimodal Interaction, pp. 827–834 (2020)
Parthasarathy, S., Sundaram, S.: Detecting expressions with multimodal transformers. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 636–643 (2021)
Middya, A.I., Nag, B., Roy, S.: Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities. Knowl.-Based Syst. 244, 108580 (2022)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 6000–6010 (2017)
Deng, J., Dong, W., Socher, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes, In: International Conference on Learning Representations, (2014)
Su, L., Hu, C., Li, G., Cao, D.: MSAF: Multimodal Split Attention Fusion. arXiv preprint arXiv: 2012.07175 (2020)
Fu, Z., Liu, F., Wang, H., Qi, J., Fu, X., Zhou, A., Li, Z.: A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition. arXiv preprint arXiv: 2012.07175 (2021)
Luna-Jiménez, C., Cristóbal-Martín, J., Kleinlein, R., Gil-Martín, M., Moya, J.M., Fernández-Martínez, F.: Guided spatial transformers for facial expression recognition. Appl. Sci. 11, 7217 (2021)
Acknowledgement
This work has been carried out within the framework of the joint lab AITV (Artificial Intelligence for Television) established between Télécom SudParis and France Télévisions.
This work was supported by a grant of the Ministry of Research, Innovation and Digitization, CNCS/CCCDI – UEFISCDI, project number PN-III-P1–1.1-TE-2021–0393, within PNCDI III.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mocanu, B., Tapu, R. (2022). Emotion Recognition in Video Streams Using Intramodal and Intermodal Attention Mechanisms. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2022. Lecture Notes in Computer Science, vol 13599. Springer, Cham. https://doi.org/10.1007/978-3-031-20716-7_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-20716-7_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20715-0
Online ISBN: 978-3-031-20716-7
eBook Packages: Computer ScienceComputer Science (R0)