Content-Aware Efficient Learner for Audio-Visual Emotion Recognition

Huang, Guanjie; Lin, Weilin; Liu, Li

doi:10.1007/978-981-96-1151-5_4

Guanjie Huang¹⁶,
Weilin Lin¹⁶ &
Li Liu¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15170))

Included in the following conference series:

International Conference on Social Robotics

32 Accesses

Abstract

Audio-Visual Emotion Recognition (AVER) is essential in various real-world applications. Many methods try to extract and fuse the audio and visual modalities to comprehend better and classify the underlying emotion. Recently, large pre-trained models brought powerful modality-fusion ability in general datasets and significantly outperformed traditional small-scale models. However, they are less effective in complementing some specialized scenarios due to the conflict of meanings between the two modalities. This paper proposes a parameter-efficient fine-tuning method, Content-Aware Efficient Learner (CAEL), to solve this problem with subtle computational consumption. Specifically, we propose an adapter network based on the pre-trained audio and visual transformers for modality fusion. To better fuse the two modalities, propose content-aware attention, in which the audio and visual information align and fuse under the guidance of the speech content. Extensive experiments on the CREMA-D dataset verify the effectiveness and efficiency of our proposed framework.

G. Huang and W. Lin—These authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/TadasBaltrusaitis/OpenFace.git.
2.
Split the training and test sets according to different actors.

References

Baevski, A., Zhou, Y., et al.: wav2vec 2.0: a framework for self-supervised learning of speech representations (2020)
Google Scholar
Cao, H., Cooper, D.G., et al.: CREMA-D: crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 5(4), 377–390 (2014)
Article Google Scholar
Cao, Q., Shen, L., et al.: VggFace2: a dataset for recognising faces across pose and age. In: FG 2018 (2018)
Google Scholar
Chen, S., Wang, C., et al.: WavLM: large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 16(6), 1505–1518 (2022)
Article MATH Google Scholar
Chen, S., Jin, Q., et al.: Multimodal multi-task learning for dimensional and continuous emotion recognition. In: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge (2017)
Google Scholar
Chen, Y., et al.: USCL: pretraining deep ultrasound image diagnosis model through video contrastive representation learning. In: de Bruijne, M., et al. (eds.) MICCAI 2021, Part VIII. LNCS, vol. 12908, pp. 627–637. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87237-3_60
Chapter MATH Google Scholar
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. arXiv (2018)
Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)
Google Scholar
Dosovitskiy, A., Beyer, L., et al.: An image is worth 16$\times $16 words: transformers for image recognitfion at scale. In: ICLR (2020)
Google Scholar
Eyben, F., Scherer, K.R., et al.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2015)
Article MATH Google Scholar
Fan, R., Liu, H., et al.: AttA-NET: attention aggregation network for audio-visual emotion recognition. In: ICASSP (2024)
Google Scholar
Hershey, S., Chaudhuri, S., et al.: CNN architectures for large-scale audio classification. In: ICASSP (2017)
Google Scholar
Houlsby, N., Giurgiu, A., et al.: Parameter-efficient transfer learning for NLP. In: ICML (2019)
Google Scholar
Hsu, W.N., Bolte, B., et al.: HuBERT: self-supervised speech representation learning by masked prediction of hidden units. TASLP 29, 3451–3460 (2021)
MATH Google Scholar
Hu, T., Xu, A., et al.: Touch your heart: a tone-aware chatbot for customer care on social media. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (2018)
Google Scholar
Jie, S., Deng, Z.H.: Convolutional bypasses are better vision transformer adapters. arXiv (2022)
Google Scholar
Kong, Q., Cao, Y., et al.: PANNs: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2880–2894 (2020)
Article MATH Google Scholar
Liu, L., Feng, G., Beautemps, D., Zhang, X.P.: Re-synchronization using the hand preceding model for multi-modal fusion in automatic continuous cued speech recognition. IEEE Trans. Multimed. 23, 292–305 (2020)
Article MATH Google Scholar
Liu, L., Hueber, T., Feng, G., Beautemps, D.: Visual recognition of continuous cued speech using a tandem CNN-HMM approach. In: Interspeech, pp. 2643–2647 (2018)
Google Scholar
Meng, L., Liu, Y., et al.: Valence and arousal estimation based on multimodal temporal-aware features for videos in the wild. In: CVPR (2022)
Google Scholar
Polignano, M., Narducci, F., et al.: Towards emotion-aware recommender systems: an affective coherence model based on emotion-driven behaviors. Expert Syst. Appl. 170, 114382 (2021)
Article MATH Google Scholar
Radford, A., Kim, J.W., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Sadok, S., Leglaive, S., Séguier, R.: A vector quantized masked autoencoder for audiovisual speech emotion recognition. arXiv (2023)
Google Scholar
Schuller, B., Steidl, S., et al.: The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: INTERSPEECH (2013)
Google Scholar
Sun, L., Lian, Z., et al.: Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism. In: Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-Life Media Challenge and Workshop (2020)
Google Scholar
Sun, L., Lian, Z., et al.: Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis (2023)
Google Scholar
Sun, L., Lian, Z., et al.: MAE-DFER: efficient masked autoencoder for self-supervised dynamic facial expression recognition. In: ACM Multimedia (2023)
Google Scholar
Sun, L., Lian, Z., et al.: SVFAP: self-supervised video facial affect perceiver. arXiv (2023)
Google Scholar
Sun, L., Lian, Z., et al.: HiCMAE: hierarchical contrastive masked autoencoder for self-supervised audio-visual emotion recognition. arXiv (2024)
Google Scholar
Sun, L., Xu, M., et al.: Multimodal emotion recognition and sentiment analysis via attention enhanced recurrent model. In: MSAC (2021)
Google Scholar
Touvron, H., Martin, L., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv (2023)
Google Scholar
Tran, D., Bourdev, L., et al.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)
Google Scholar
Tsai, Y.H.H., Bai, S., et al.: Multimodal transformer for unaligned multimodal language sequences. In: ACL (2019)
Google Scholar
Vaswani, A., Shazeer, N., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Verbitskiy, S., Berikov, V., Vyshegorodtsev, V.: ERANNs: efficient residual audio neural networks for audio pattern recognition. Pattern Recogn. Lett. 161, 38–44 (2022)
Article MATH Google Scholar
Wang, J., Zhao, Y., Liu, L., Xu, T., Li, Q., Li, S.: Emotional talking head generation based on memory-sharing and attention-augmented networks (2023)
Google Scholar
Wu, C.H., Lin, J.C., Wei, W.L.: Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA Trans. Signal Inf. Process. 3, e12 (2014)
Article MATH Google Scholar
Zeng, Z., Pantic, M., et al.: A survey of affect recognition methods: audio, visual and spontaneous expressions. In: ICMI (2007)
Google Scholar
Zhang, S., Yang, Y., et al.: Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: a systematic review of recent advancements and future prospects. Expert Syst. Appl. 121692 (2023)
Google Scholar
Zhang, X., Li, M., et al.: Transformer-based multimodal emotional perception for dynamic facial expression recognition in the wild. IEEE Trans. Circuits Syst. Video Technol. (2023)
Google Scholar
Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 915–928 (2007)
Article MATH Google Scholar

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 62101351), Guangzhou Municipal Science and Technology Project: Basic and Applied Basic research projects (No. 2024A04J4232).

Author information

Authors and Affiliations

The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Guanjie Huang, Weilin Lin & Li Liu

Authors

Guanjie Huang
View author publications
You can also search for this author in PubMed Google Scholar
Weilin Lin
View author publications
You can also search for this author in PubMed Google Scholar
Li Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li Liu .

Editor information

Editors and Affiliations

The Chinese University of Hong Kong, Shenzhen, China
Haizhou Li
University of Bremen, Bremen, Germany
Tanja Schultz
Shenzhen Institute of Advanced Technology, Shenzhen, China
Yalei Bi
The Chinese University of Hong Kong, Shenzhen, China
Jian Zhu
The University of Alabama, Tuscaloosa, AL, USA
Hongsheng He
The Hong Kong University of Science, Guangzhou, China
Jun Ma
National University of Singapore, Singapore, Singapore
Siqi Cai
Qingdao University, Qingdao, China
Wanyue Jiang
National University of Singapore, Singapore, Singapore
Shuzhi Sam Ge

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, G., Lin, W., Liu, L. (2025). Content-Aware Efficient Learner for Audio-Visual Emotion Recognition. In: Li, H., et al. Social Robotics. ICSR + InnoBiz 2024. Lecture Notes in Computer Science(), vol 15170. Springer, Singapore. https://doi.org/10.1007/978-981-96-1151-5_4

Download citation

DOI: https://doi.org/10.1007/978-981-96-1151-5_4
Published: 07 February 2025
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-1150-8
Online ISBN: 978-981-96-1151-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Content-Aware Efficient Learner for Audio-Visual Emotion Recognition