skip to main content
10.1145/3678957.3689332acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article
Open access

Multimodal Emotion Recognition Harnessing the Complementarity of Speech, Language, and Vision

Published: 04 November 2024 Publication History

Abstract

In the realm of audiovisual emotion recognition, a significant challenge lies in developing neural network architectures capable of effectively harnessing and integrating multimodal information. This study introduces an advanced methodology for the Empathic Virtual Agent Challenge (EVAC), utilizing state-of-the-art speech, language, and image models. Specifically, we leverage cutting-edge pre-trained models, including multilingual variants fine-tuned in French for each modality, and integrate them using late fusion techniques. Through extensive experimentation and validation, we demonstrate the efficacy of our approach in achieving competitive results on the challenge dataset. Our findings highlight that multimodal approaches outperform unimodal methods across Core Affect Presence and Intensity and Appraisal Dimensions tasks, underscoring the effectiveness of integrating diverse modalities. This underscores the importance of leveraging multiple sources of information to capture nuanced emotional states more accurately and robustly in real-world applications.

References

[1]
Francisca Adoma Acheampong, Henry Nunoo-Mensah, and Wenyu Chen. 2021. Transformer models for text-based emotion detection: a review of BERT-based approaches. Artificial Intelligence Review 54, 8 (2021), 5789–5829.
[2]
Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. 2019. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 (2019).
[3]
Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick Von Platen, Yatharth Saraf, Juan Pino, 2021. XLS-R: Self-supervised cross-lingual speech representation learning at scale. arXiv preprint arXiv:2111.09296 (2021).
[4]
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. CoRR abs/2006.11477 (2020). https://doi.org/10.48550/ARXIV.2006.11477 arXiv:2006.11477
[5]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
[6]
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space-Time Attention All You Need for Video Understanding?. In Proceedings of the International Conference on Machine Learning (ICML).
[7]
Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. 2018. VGGFace2: A dataset for recognising faces across pose and age. arxiv:1710.08092 [cs.CV] https://arxiv.org/abs/1710.08092
[8]
Joao Carreira and Andrew Zisserman. 2018. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. arxiv:1705.07750 [cs.CV] https://arxiv.org/abs/1705.07750
[9]
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1505–1518.
[10]
Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2020. Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979 (2020).
[11]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised Cross-lingual Representation Learning at Scale. CoRR abs/1911.02116 (2019). arXiv:1911.02116http://arxiv.org/abs/1911.02116
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arxiv:1810.04805http://arxiv.org/abs/1810.04805
[13]
Anna Favaro, Yi-Ting Tsai, Ankur Butala, Thomas Thebaud, Jesús Villalba, Najim Dehak, and Laureano Moro-Velázquez. 2023. Interpretable speech features vs. DNN embeddings: What to use in the automatic assessment of Parkinson’s disease in multi-lingual scenarios. Computers in Biology and Medicine 166 (2023), 107559.
[14]
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2020. Language-agnostic BERT sentence embedding. arXiv preprint arXiv:2007.01852 (2020).
[15]
Hippolyte Fournier, Sina Alisamir, Safaa Azzakhnini, Hanna Chainay, Olivier Koenig, Isabella Zsoldos, Eléeonore Trân, Gérard Bailly, Frédéeric Elisei, Béatrice Bouchot, 2024. THERADIA WoZ: An Ecological Corpus for Appraisal-based Affect Research in Healthcare. arXiv preprint arXiv:2405.06728 (2024).
[16]
Lucas Goncalves and Carlos Busso. 2022. Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features. IEEE Transactions on Affective Computing 13, 4 (2022), 2156–2170.
[17]
Yuan Gong, Sameer Khurana, Leonid Karlinsky, and James Glass. 2023. Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers. arXiv preprint arXiv:2307.03183 (2023).
[18]
Yuan Gong, Alexander H Liu, Hongyin Luo, Leonid Karlinsky, and James Glass. 2023. Joint audio and speech understanding. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 1–8.
[19]
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020).
[20]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), 1735–1780. https://deeplearning.cs.cmu.edu/F23/document/readings/LSTM.pdf
[21]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing 29 (2021), 3451–3460.
[22]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. arxiv:1705.06950 [cs.CV] https://arxiv.org/abs/1705.06950
[23]
Adrien Lafore, Clément Pagés, Leila Moudjari, Sebastião Quintas, Hervé Bredin, Thomas Pellegrini, Farah Benamara, Isabelle Ferrané, Jérôme Bertrand, Marie-Françoise Bertrand, 2024. IRIT-MFU Multi-modal systems for emotion classification for Odyssey 2024 challenge. In Odyssey 2024: The Speaker and Language Recognition Workshop.
[24]
Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, and Didier Schwab. 2020. FlauBERT: Unsupervised Language Model Pre-training for French. In Proceedings of The 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 2479–2490. https://www.aclweb.org/anthology/2020.lrec-1.302
[25]
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, and Benoît Sagot. 2020. CamemBERT: a Tasty French Language Model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
[26]
OpenAI, Josh Achiam, et al. 2024. GPT-4 Technical Report. arxiv:2303.08774 [cs.CL] https://arxiv.org/abs/2303.08774
[27]
Raghavendra Pappagari, Tianzi Wang, Jesus Villalba, Nanxin Chen, and Najim Dehak. 2020. x-vectors meet emotions: A study on dependencies between emotion and speaker recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7169–7173.
[28]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning. PMLR, 28492–28518.
[29]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. http://arxiv.org/abs/1908.10084
[30]
Fabien Ringeval, Björn Schuller, Gérard Bailly, Safaa Azzakhnini, and Hippolyte Fournier. 2024. EVAC 2024 – Empathic Virtual Agent Challenge: Appraisal-based Recognition of Affective States. In Proceedings of the 26th International Conference on Multimodal Interaction (ICMI 2024). ACM, San José, Costa Rica.
[31]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019).
[32]
Joel Shor, Aren Jansen, Ronnie Maor, Oran Lang, Omry Tuval, Félix de Chaumont Quitry, Marco Tagliasacchi, Ira Shavitt, Dotan Emanuel, and Yinnon Haviv. 2020. Towards Learning a Universal Non-Semantic Representation of Speech. In Interspeech 2020. ISCA, 140–144. https://doi.org/10.21437/interspeech.2020-1242
[33]
Joel Shor and Subhashini Venugopalan. 2022. TRILLsson: Distilling Universal Paralinguistic Speech Representations. https://doi.org/10.21437/interspeech.2022-118
[34]
Jules Sintes. 2023. Multi-task French speech analysis with deep learning Emotion recognition and speaker diarization models for end-to-end conversational analysis tool. unknown (2023).
[35]
David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5329–5333. https://doi.org/10.1109/ICASSP.2018.8461375
[36]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. 2016. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arxiv:1602.07261 [cs.CV] https://arxiv.org/abs/1602.07261
[37]
Silero Team. 2021. Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier. Retrieved March 31 (2021), 2023.
[38]
Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. 2021. VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint arXiv:2101.00390 (2021).
[39]
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. Improving Text Embeddings with Large Language Models. arxiv:2401.00368 [cs.CL]
[40]
Haibin Wu, Huang-Cheng Chou, Kai-Wei Chang, Lucas Goncalves, Jiawei Du, Jyh-Shing Roger Jang, Chi-Chun Lee, and Hung-Yi Lee. 2024. EMO-SUPERB: An in-depth look at speech emotion recognition. arXiv preprint arXiv:2402.13018 (2024).
[41]
Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, 2021. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051 (2021).
[42]
Seunghyun Yoon, Seokhyun Byun, and Kyomin Jung. 2018. Multimodal speech emotion recognition using audio and text. In 2018 IEEE spoken language technology workshop (SLT). IEEE, 112–118.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICMI '24: Proceedings of the 26th International Conference on Multimodal Interaction
November 2024
725 pages
ISBN:9798400704628
DOI:10.1145/3678957
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 November 2024

Check for updates

Author Tags

  1. Foundational models
  2. Fusion
  3. Multimodal Emotion Recognition

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICMI '24
ICMI '24: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
November 4 - 8, 2024
San Jose, Costa Rica

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 274
    Total Downloads
  • Downloads (Last 12 months)274
  • Downloads (Last 6 weeks)67
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media