Abstract
Audio-Visual Emotion Recognition (AVER) is essential in various real-world applications. Many methods try to extract and fuse the audio and visual modalities to comprehend better and classify the underlying emotion. Recently, large pre-trained models brought powerful modality-fusion ability in general datasets and significantly outperformed traditional small-scale models. However, they are less effective in complementing some specialized scenarios due to the conflict of meanings between the two modalities. This paper proposes a parameter-efficient fine-tuning method, Content-Aware Efficient Learner (CAEL), to solve this problem with subtle computational consumption. Specifically, we propose an adapter network based on the pre-trained audio and visual transformers for modality fusion. To better fuse the two modalities, propose content-aware attention, in which the audio and visual information align and fuse under the guidance of the speech content. Extensive experiments on the CREMA-D dataset verify the effectiveness and efficiency of our proposed framework.
G. Huang and W. LināThese authors contributed equally.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Split the training and test sets according to different actors.
References
Baevski, A., Zhou, Y., et al.: wav2vec 2.0: a framework for self-supervised learning of speech representations (2020)
Cao, H., Cooper, D.G., et al.: CREMA-D: crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 5(4), 377ā390 (2014)
Cao, Q., Shen, L., et al.: VggFace2: a dataset for recognising faces across pose and age. In: FG 2018 (2018)
Chen, S., Wang, C., et al.: WavLM: large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 16(6), 1505ā1518 (2022)
Chen, S., Jin, Q., et al.: Multimodal multi-task learning for dimensional and continuous emotion recognition. In: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge (2017)
Chen, Y., et al.: USCL: pretraining deep ultrasound image diagnosis model through video contrastive representation learning. In: de Bruijne, M., et al. (eds.) MICCAI 2021, Part VIII. LNCS, vol. 12908, pp. 627ā637. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87237-3_60
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. arXiv (2018)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)
Dosovitskiy, A., Beyer, L., et al.: An image is worth 16\(\times \)16 words: transformers for image recognitfion at scale. In: ICLR (2020)
Eyben, F., Scherer, K.R., et al.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190ā202 (2015)
Fan, R., Liu, H., et al.: AttA-NET: attention aggregation network for audio-visual emotion recognition. In: ICASSP (2024)
Hershey, S., Chaudhuri, S., et al.: CNN architectures for large-scale audio classification. In: ICASSP (2017)
Houlsby, N., Giurgiu, A., et al.: Parameter-efficient transfer learning for NLP. In: ICML (2019)
Hsu, W.N., Bolte, B., et al.: HuBERT: self-supervised speech representation learning by masked prediction of hidden units. TASLP 29, 3451ā3460 (2021)
Hu, T., Xu, A., et al.: Touch your heart: a tone-aware chatbot for customer care on social media. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (2018)
Jie, S., Deng, Z.H.: Convolutional bypasses are better vision transformer adapters. arXiv (2022)
Kong, Q., Cao, Y., et al.: PANNs: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2880ā2894 (2020)
Liu, L., Feng, G., Beautemps, D., Zhang, X.P.: Re-synchronization using the hand preceding model for multi-modal fusion in automatic continuous cued speech recognition. IEEE Trans. Multimed. 23, 292ā305 (2020)
Liu, L., Hueber, T., Feng, G., Beautemps, D.: Visual recognition of continuous cued speech using a tandem CNN-HMM approach. In: Interspeech, pp. 2643ā2647 (2018)
Meng, L., Liu, Y., et al.: Valence and arousal estimation based on multimodal temporal-aware features for videos in the wild. In: CVPR (2022)
Polignano, M., Narducci, F., et al.: Towards emotion-aware recommender systems: an affective coherence model based on emotion-driven behaviors. Expert Syst. Appl. 170, 114382 (2021)
Radford, A., Kim, J.W., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Sadok, S., Leglaive, S., SĆ©guier, R.: A vector quantized masked autoencoder for audiovisual speech emotion recognition. arXiv (2023)
Schuller, B., Steidl, S., et al.: The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: INTERSPEECH (2013)
Sun, L., Lian, Z., et al.: Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism. In: Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-Life Media Challenge and Workshop (2020)
Sun, L., Lian, Z., et al.: Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis (2023)
Sun, L., Lian, Z., et al.: MAE-DFER: efficient masked autoencoder for self-supervised dynamic facial expression recognition. In: ACM Multimedia (2023)
Sun, L., Lian, Z., et al.: SVFAP: self-supervised video facial affect perceiver. arXiv (2023)
Sun, L., Lian, Z., et al.: HiCMAE: hierarchical contrastive masked autoencoder for self-supervised audio-visual emotion recognition. arXiv (2024)
Sun, L., Xu, M., et al.: Multimodal emotion recognition and sentiment analysis via attention enhanced recurrent model. In: MSAC (2021)
Touvron, H., Martin, L., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv (2023)
Tran, D., Bourdev, L., et al.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)
Tsai, Y.H.H., Bai, S., et al.: Multimodal transformer for unaligned multimodal language sequences. In: ACL (2019)
Vaswani, A., Shazeer, N., et al.: Attention is all you need. In: NeurIPS (2017)
Verbitskiy, S., Berikov, V., Vyshegorodtsev, V.: ERANNs: efficient residual audio neural networks for audio pattern recognition. Pattern Recogn. Lett. 161, 38ā44 (2022)
Wang, J., Zhao, Y., Liu, L., Xu, T., Li, Q., Li, S.: Emotional talking head generation based on memory-sharing and attention-augmented networks (2023)
Wu, C.H., Lin, J.C., Wei, W.L.: Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA Trans. Signal Inf. Process. 3, e12 (2014)
Zeng, Z., Pantic, M., et al.: A survey of affect recognition methods: audio, visual and spontaneous expressions. In: ICMI (2007)
Zhang, S., Yang, Y., et al.: Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: a systematic review of recent advancements and future prospects. Expert Syst. Appl. 121692 (2023)
Zhang, X., Li, M., et al.: Transformer-based multimodal emotional perception for dynamic facial expression recognition in the wild. IEEE Trans. Circuits Syst. Video Technol. (2023)
Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 915ā928 (2007)
Acknowledgments
This work was supported by the National Natural Science Foundation of China (No. 62101351), Guangzhou Municipal Science and Technology Project: Basic and Applied Basic research projects (No. 2024A04J4232).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Huang, G., Lin, W., Liu, L. (2025). Content-Aware Efficient Learner for Audio-Visual Emotion Recognition. In: Li, H., et al. Social Robotics. ICSR + InnoBiz 2024. Lecture Notes in Computer Science(), vol 15170. Springer, Singapore. https://doi.org/10.1007/978-981-96-1151-5_4
Download citation
DOI: https://doi.org/10.1007/978-981-96-1151-5_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-1150-8
Online ISBN: 978-981-96-1151-5
eBook Packages: Computer ScienceComputer Science (R0)