Skip to main content

Content-Aware Efficient Learner for Audio-Visual Emotion Recognition

  • Conference paper
  • First Online:
Social Robotics (ICSR + InnoBiz 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15170))

Included in the following conference series:

  • 32 Accesses

Abstract

Audio-Visual Emotion Recognition (AVER) is essential in various real-world applications. Many methods try to extract and fuse the audio and visual modalities to comprehend better and classify the underlying emotion. Recently, large pre-trained models brought powerful modality-fusion ability in general datasets and significantly outperformed traditional small-scale models. However, they are less effective in complementing some specialized scenarios due to the conflict of meanings between the two modalities. This paper proposes a parameter-efficient fine-tuning method, Content-Aware Efficient Learner (CAEL), to solve this problem with subtle computational consumption. Specifically, we propose an adapter network based on the pre-trained audio and visual transformers for modality fusion. To better fuse the two modalities, propose content-aware attention, in which the audio and visual information align and fuse under the guidance of the speech content. Extensive experiments on the CREMA-D dataset verify the effectiveness and efficiency of our proposed framework.

G. Huang and W. Linā€”These authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/TadasBaltrusaitis/OpenFace.git.

  2. 2.

    Split the training and test sets according to different actors.

References

  1. Baevski, A., Zhou, Y., et al.: wav2vec 2.0: a framework for self-supervised learning of speech representations (2020)

    Google Scholar 

  2. Cao, H., Cooper, D.G., et al.: CREMA-D: crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 5(4), 377ā€“390 (2014)

    Article  Google Scholar 

  3. Cao, Q., Shen, L., et al.: VggFace2: a dataset for recognising faces across pose and age. In: FG 2018 (2018)

    Google Scholar 

  4. Chen, S., Wang, C., et al.: WavLM: large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 16(6), 1505ā€“1518 (2022)

    Article  MATH  Google Scholar 

  5. Chen, S., Jin, Q., et al.: Multimodal multi-task learning for dimensional and continuous emotion recognition. In: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge (2017)

    Google Scholar 

  6. Chen, Y., et al.: USCL: pretraining deep ultrasound image diagnosis model through video contrastive representation learning. In: de Bruijne, M., et al. (eds.) MICCAI 2021, Part VIII. LNCS, vol. 12908, pp. 627ā€“637. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87237-3_60

    Chapter  MATH  Google Scholar 

  7. Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. arXiv (2018)

    Google Scholar 

  8. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)

    Google Scholar 

  9. Dosovitskiy, A., Beyer, L., et al.: An image is worth 16\(\times \)16 words: transformers for image recognitfion at scale. In: ICLR (2020)

    Google Scholar 

  10. Eyben, F., Scherer, K.R., et al.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190ā€“202 (2015)

    Article  MATH  Google Scholar 

  11. Fan, R., Liu, H., et al.: AttA-NET: attention aggregation network for audio-visual emotion recognition. In: ICASSP (2024)

    Google Scholar 

  12. Hershey, S., Chaudhuri, S., et al.: CNN architectures for large-scale audio classification. In: ICASSP (2017)

    Google Scholar 

  13. Houlsby, N., Giurgiu, A., et al.: Parameter-efficient transfer learning for NLP. In: ICML (2019)

    Google Scholar 

  14. Hsu, W.N., Bolte, B., et al.: HuBERT: self-supervised speech representation learning by masked prediction of hidden units. TASLP 29, 3451ā€“3460 (2021)

    MATH  Google Scholar 

  15. Hu, T., Xu, A., et al.: Touch your heart: a tone-aware chatbot for customer care on social media. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (2018)

    Google Scholar 

  16. Jie, S., Deng, Z.H.: Convolutional bypasses are better vision transformer adapters. arXiv (2022)

    Google Scholar 

  17. Kong, Q., Cao, Y., et al.: PANNs: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2880ā€“2894 (2020)

    Article  MATH  Google Scholar 

  18. Liu, L., Feng, G., Beautemps, D., Zhang, X.P.: Re-synchronization using the hand preceding model for multi-modal fusion in automatic continuous cued speech recognition. IEEE Trans. Multimed. 23, 292ā€“305 (2020)

    Article  MATH  Google Scholar 

  19. Liu, L., Hueber, T., Feng, G., Beautemps, D.: Visual recognition of continuous cued speech using a tandem CNN-HMM approach. In: Interspeech, pp. 2643ā€“2647 (2018)

    Google Scholar 

  20. Meng, L., Liu, Y., et al.: Valence and arousal estimation based on multimodal temporal-aware features for videos in the wild. In: CVPR (2022)

    Google Scholar 

  21. Polignano, M., Narducci, F., et al.: Towards emotion-aware recommender systems: an affective coherence model based on emotion-driven behaviors. Expert Syst. Appl. 170, 114382 (2021)

    Article  MATH  Google Scholar 

  22. Radford, A., Kim, J.W., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  23. Sadok, S., Leglaive, S., SĆ©guier, R.: A vector quantized masked autoencoder for audiovisual speech emotion recognition. arXiv (2023)

    Google Scholar 

  24. Schuller, B., Steidl, S., et al.: The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: INTERSPEECH (2013)

    Google Scholar 

  25. Sun, L., Lian, Z., et al.: Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism. In: Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-Life Media Challenge and Workshop (2020)

    Google Scholar 

  26. Sun, L., Lian, Z., et al.: Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis (2023)

    Google Scholar 

  27. Sun, L., Lian, Z., et al.: MAE-DFER: efficient masked autoencoder for self-supervised dynamic facial expression recognition. In: ACM Multimedia (2023)

    Google Scholar 

  28. Sun, L., Lian, Z., et al.: SVFAP: self-supervised video facial affect perceiver. arXiv (2023)

    Google Scholar 

  29. Sun, L., Lian, Z., et al.: HiCMAE: hierarchical contrastive masked autoencoder for self-supervised audio-visual emotion recognition. arXiv (2024)

    Google Scholar 

  30. Sun, L., Xu, M., et al.: Multimodal emotion recognition and sentiment analysis via attention enhanced recurrent model. In: MSAC (2021)

    Google Scholar 

  31. Touvron, H., Martin, L., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv (2023)

    Google Scholar 

  32. Tran, D., Bourdev, L., et al.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)

    Google Scholar 

  33. Tsai, Y.H.H., Bai, S., et al.: Multimodal transformer for unaligned multimodal language sequences. In: ACL (2019)

    Google Scholar 

  34. Vaswani, A., Shazeer, N., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  35. Verbitskiy, S., Berikov, V., Vyshegorodtsev, V.: ERANNs: efficient residual audio neural networks for audio pattern recognition. Pattern Recogn. Lett. 161, 38ā€“44 (2022)

    Article  MATH  Google Scholar 

  36. Wang, J., Zhao, Y., Liu, L., Xu, T., Li, Q., Li, S.: Emotional talking head generation based on memory-sharing and attention-augmented networks (2023)

    Google Scholar 

  37. Wu, C.H., Lin, J.C., Wei, W.L.: Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA Trans. Signal Inf. Process. 3, e12 (2014)

    Article  MATH  Google Scholar 

  38. Zeng, Z., Pantic, M., et al.: A survey of affect recognition methods: audio, visual and spontaneous expressions. In: ICMI (2007)

    Google Scholar 

  39. Zhang, S., Yang, Y., et al.: Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: a systematic review of recent advancements and future prospects. Expert Syst. Appl. 121692 (2023)

    Google Scholar 

  40. Zhang, X., Li, M., et al.: Transformer-based multimodal emotional perception for dynamic facial expression recognition in the wild. IEEE Trans. Circuits Syst. Video Technol. (2023)

    Google Scholar 

  41. Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 915ā€“928 (2007)

    Article  MATH  Google Scholar 

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 62101351), Guangzhou Municipal Science and Technology Project: Basic and Applied Basic research projects (No. 2024A04J4232).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huang, G., Lin, W., Liu, L. (2025). Content-Aware Efficient Learner for Audio-Visual Emotion Recognition. In: Li, H., et al. Social Robotics. ICSR + InnoBiz 2024. Lecture Notes in Computer Science(), vol 15170. Springer, Singapore. https://doi.org/10.1007/978-981-96-1151-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-981-96-1151-5_4

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-96-1150-8

  • Online ISBN: 978-981-96-1151-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics