Abstract
In this work, we explore the effectiveness of multimodal models for estimating the emotional state expressed continuously in the Valence/Arousal space. We consider four modalities typically adopted for the emotion recognition, namely audio (voice), video (face expression), electrocardiogram (ECG), and electrodermal activity (EDA), investigating different mixtures of them. To this aim, a CNN-based feature extraction module is adopted for each of the considered modalities, and an RNN-based module for modelling the dynamics of the affective behaviour. The fusion is performed in three different ways: at feature-level (after the CNN feature extraction), at model-level (combining the RNN layer’s outputs) and at prediction-level (late fusion). Results obtained on the publicly available RECOLA dataset, demonstrate that the use of multiple modalities improves the prediction performance. The best results are achieved exploiting the contribution of all the considered modalities, and employing the late fusion, but even mixtures of two modalities (especially audio and video) bring significant benefits.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Grossi, G., Lanzarotti, R., Napoletano, P., Noceti, N., Odone, F.: Positive technology for elderly well-being: a review. Pattern Recogn. Lett. 137, 61–70 (2020)
Sun, A., Li, Y.-J., Huang, Y.-M., Li, Q.: Using facial expression to detect emotion in e-learning system: a deep learning method. In: Huang, T.-C., Lau, R., Huang, Y.-M., Spaniol, M., Yuen, C.-H. (eds.) SETE 2017. LNCS, vol. 10676, pp. 446–455. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-71084-6_52
Du, G., Zhou, W., Li, C., Li, D., Liu, P.X.: An emotion recognition method for game evaluation based on electroencephalogram. IEEE Trans. Affect. Comput. 1 (2020)
Sebe, N., Cohen, I., Gevers, T., Huang, T.S.: Multimodal approaches for emotion recognition: a survey. In: Internet Imaging VI, vol. 5670, pp. 56–67. International Society for Optics and Photonics (2005)
Nguyen, D., Nguyen, K., Sridharan, S., Dean, D., Fookes, C.: Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition. Comput. Vis. Image Underst. 174, 33–42 (2018)
Jain, N., Kumar, S., Kumar, A., Shamsolmoali, P., Zareapoor, M.: Hybrid deep neural networks for face emotion recognition. Pattern Recogn. Lett. 115, 101–106 (2018)
Bursic, S., Boccignone, G., Ferrara, A., D’Amelio, A., Lanzarotti, R.: Improving the accuracy of automatic facial expression recognition in speaking subjects with deep learning. Appl. Sci. 10(11), 4002 (2020)
Cuculo, V., D’Amelio, A.: OpenFACS: an open source FACS-based 3D face animation system. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds.) ICIG 2019. LNCS, vol. 11902, pp. 232–242. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34110-7_20
Noroozi, F., Kaminska, D., Corneanu, C., Sapinski, T., Escalera, S., Anbarjafari, G.: Survey on emotional body gesture recognition. IEEE Trans. Affect. Comput. 12, 505–523 (2018)
Albanie, S., Nagrani, A., Vedaldi, A., Zisserman, A.: Emotion recognition in speech using cross-modal transfer in the wild. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 292–301 (2018)
Song, T., Zheng, W., Song, P., Cui, Z.: EEG emotion recognition using dynamical graph convolutional neural networks. IEEE Trans. Affect. Comput. 11(3), 532–541 (2018)
Sarkar, P., Etemad, A.: Self-supervised ECG representation learning for emotion recognition. IEEE Trans. Affect. Comput. 1 (2020)
Shukla, J., Barreda-Angeles, M., Oliver, J., Nandi, G.C., Puig, D.: Feature extraction and selection for emotion recognition from electrodermal activity. IEEE Trans. Affect. Comput. 12(4), 857–869 (2019)
Boccignone, G., Conte, D., Cuculo, V., D’Amelio, A., Grossi, G., Lanzarotti, R.: Deep construction of an affective latent space via multimodal enactment. IEEE Trans. Cogn. Dev. Syst. 10(4), 865–880 (2018)
Schuller, B., Valstar, M., Cowie, R., Pantic, M.: The first audio/visual emotion challenge and workshop – an introduction. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.-C. (eds.) ACII 2011. LNCS, vol. 6975, p. 322. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24571-8_42
Tzirakis, P., Trigeorgis, G., Nicolaou, M.A., Schuller, B.W., Zafeiriou, S.: End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Topics Signal Process. 11(8), 1301–1309 (2017)
Soleymani, M., Asghari-Esfeden, S., Fu, Y., Pantic, M.: Analysis of EEG signals and facial expressions for continuous emotion detection. IEEE Trans. Affect. Comput. 7(1), 17–28 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Zhang, S., Zhang, S., Huang, T., Gao, W., Tian, Q.: Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans. Circ. Syst. Video Technol. 28(10), 3030–3043 (2017)
Du, G., Long, S., Yuan, H.: Non-contact emotion recognition combining heart rate and facial expression for interactive gaming environments. IEEE Access 8, 11896–11906 (2020)
Ho, N.-H., Yang, H.-J., Kim, S.-H., Lee, G.: Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network. IEEE Access 8, 61672–61686 (2020)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition (2015)
Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)
Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the recola multimodal corpus of remote collaborative and affective interactions. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8. IEEE (2013)
Lawrence, I., Lin, K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45, 255–268 (1989)
Acknowledgement
This work was part of the project n. 2018-0858 title “Stairway to elders: bridging space, time and emotions in their social environment for wellbeing” supported by Fondazione CARIPLO.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Patania, S., D’Amelio, A., Lanzarotti, R. (2022). Exploring Fusion Strategies in Deep Multimodal Affect Prediction. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing – ICIAP 2022. ICIAP 2022. Lecture Notes in Computer Science, vol 13232. Springer, Cham. https://doi.org/10.1007/978-3-031-06430-2_61
Download citation
DOI: https://doi.org/10.1007/978-3-031-06430-2_61
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06429-6
Online ISBN: 978-3-031-06430-2
eBook Packages: Computer ScienceComputer Science (R0)