Skip to main content

Exploring Fusion Strategies in Deep Multimodal Affect Prediction

  • Conference paper
  • First Online:
Image Analysis and Processing – ICIAP 2022 (ICIAP 2022)

Abstract

In this work, we explore the effectiveness of multimodal models for estimating the emotional state expressed continuously in the Valence/Arousal space. We consider four modalities typically adopted for the emotion recognition, namely audio (voice), video (face expression), electrocardiogram (ECG), and electrodermal activity (EDA), investigating different mixtures of them. To this aim, a CNN-based feature extraction module is adopted for each of the considered modalities, and an RNN-based module for modelling the dynamics of the affective behaviour. The fusion is performed in three different ways: at feature-level (after the CNN feature extraction), at model-level (combining the RNN layer’s outputs) and at prediction-level (late fusion). Results obtained on the publicly available RECOLA dataset, demonstrate that the use of multiple modalities improves the prediction performance. The best results are achieved exploiting the contribution of all the considered modalities, and employing the late fusion, but even mixtures of two modalities (especially audio and video) bring significant benefits.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Grossi, G., Lanzarotti, R., Napoletano, P., Noceti, N., Odone, F.: Positive technology for elderly well-being: a review. Pattern Recogn. Lett. 137, 61–70 (2020)

    Article  Google Scholar 

  2. Sun, A., Li, Y.-J., Huang, Y.-M., Li, Q.: Using facial expression to detect emotion in e-learning system: a deep learning method. In: Huang, T.-C., Lau, R., Huang, Y.-M., Spaniol, M., Yuen, C.-H. (eds.) SETE 2017. LNCS, vol. 10676, pp. 446–455. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-71084-6_52

    Chapter  Google Scholar 

  3. Du, G., Zhou, W., Li, C., Li, D., Liu, P.X.: An emotion recognition method for game evaluation based on electroencephalogram. IEEE Trans. Affect. Comput. 1 (2020)

    Google Scholar 

  4. Sebe, N., Cohen, I., Gevers, T., Huang, T.S.: Multimodal approaches for emotion recognition: a survey. In: Internet Imaging VI, vol. 5670, pp. 56–67. International Society for Optics and Photonics (2005)

    Google Scholar 

  5. Nguyen, D., Nguyen, K., Sridharan, S., Dean, D., Fookes, C.: Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition. Comput. Vis. Image Underst. 174, 33–42 (2018)

    Article  Google Scholar 

  6. Jain, N., Kumar, S., Kumar, A., Shamsolmoali, P., Zareapoor, M.: Hybrid deep neural networks for face emotion recognition. Pattern Recogn. Lett. 115, 101–106 (2018)

    Article  Google Scholar 

  7. Bursic, S., Boccignone, G., Ferrara, A., D’Amelio, A., Lanzarotti, R.: Improving the accuracy of automatic facial expression recognition in speaking subjects with deep learning. Appl. Sci. 10(11), 4002 (2020)

    Article  Google Scholar 

  8. Cuculo, V., D’Amelio, A.: OpenFACS: an open source FACS-based 3D face animation system. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds.) ICIG 2019. LNCS, vol. 11902, pp. 232–242. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34110-7_20

    Chapter  Google Scholar 

  9. Noroozi, F., Kaminska, D., Corneanu, C., Sapinski, T., Escalera, S., Anbarjafari, G.: Survey on emotional body gesture recognition. IEEE Trans. Affect. Comput. 12, 505–523 (2018)

    Article  Google Scholar 

  10. Albanie, S., Nagrani, A., Vedaldi, A., Zisserman, A.: Emotion recognition in speech using cross-modal transfer in the wild. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 292–301 (2018)

    Google Scholar 

  11. Song, T., Zheng, W., Song, P., Cui, Z.: EEG emotion recognition using dynamical graph convolutional neural networks. IEEE Trans. Affect. Comput. 11(3), 532–541 (2018)

    Article  Google Scholar 

  12. Sarkar, P., Etemad, A.: Self-supervised ECG representation learning for emotion recognition. IEEE Trans. Affect. Comput. 1 (2020)

    Google Scholar 

  13. Shukla, J., Barreda-Angeles, M., Oliver, J., Nandi, G.C., Puig, D.: Feature extraction and selection for emotion recognition from electrodermal activity. IEEE Trans. Affect. Comput. 12(4), 857–869 (2019)

    Article  Google Scholar 

  14. Boccignone, G., Conte, D., Cuculo, V., D’Amelio, A., Grossi, G., Lanzarotti, R.: Deep construction of an affective latent space via multimodal enactment. IEEE Trans. Cogn. Dev. Syst. 10(4), 865–880 (2018)

    Article  Google Scholar 

  15. Schuller, B., Valstar, M., Cowie, R., Pantic, M.: The first audio/visual emotion challenge and workshop – an introduction. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.-C. (eds.) ACII 2011. LNCS, vol. 6975, p. 322. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24571-8_42

    Chapter  Google Scholar 

  16. Tzirakis, P., Trigeorgis, G., Nicolaou, M.A., Schuller, B.W., Zafeiriou, S.: End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Topics Signal Process. 11(8), 1301–1309 (2017)

    Article  Google Scholar 

  17. Soleymani, M., Asghari-Esfeden, S., Fu, Y., Pantic, M.: Analysis of EEG signals and facial expressions for continuous emotion detection. IEEE Trans. Affect. Comput. 7(1), 17–28 (2015)

    Article  Google Scholar 

  18. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  19. Zhang, S., Zhang, S., Huang, T., Gao, W., Tian, Q.: Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans. Circ. Syst. Video Technol. 28(10), 3030–3043 (2017)

    Article  Google Scholar 

  20. Du, G., Long, S., Yuan, H.: Non-contact emotion recognition combining heart rate and facial expression for interactive gaming environments. IEEE Access 8, 11896–11906 (2020)

    Article  Google Scholar 

  21. Ho, N.-H., Yang, H.-J., Kim, S.-H., Lee, G.: Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network. IEEE Access 8, 61672–61686 (2020)

    Article  Google Scholar 

  22. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  23. Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition (2015)

    Google Scholar 

  24. Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)

    Article  Google Scholar 

  25. Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the recola multimodal corpus of remote collaborative and affective interactions. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8. IEEE (2013)

    Google Scholar 

  26. Lawrence, I., Lin, K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45, 255–268 (1989)

    Article  Google Scholar 

Download references

Acknowledgement

This work was part of the project n. 2018-0858 title “Stairway to elders: bridging space, time and emotions in their social environment for wellbeing” supported by Fondazione CARIPLO.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sabrina Patania .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Patania, S., D’Amelio, A., Lanzarotti, R. (2022). Exploring Fusion Strategies in Deep Multimodal Affect Prediction. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing – ICIAP 2022. ICIAP 2022. Lecture Notes in Computer Science, vol 13232. Springer, Cham. https://doi.org/10.1007/978-3-031-06430-2_61

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-06430-2_61

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-06429-6

  • Online ISBN: 978-3-031-06430-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics