research-article

Open access

Multimodal Emotion Recognition Harnessing the Complementarity of Speech, Language, and Vision

Authors:

Thomas Thebaud,

Jesus Villalba,

Laureano Mono-Velazquez,

Najim DehakAuthors Info & Claims

ICMI '24: Proceedings of the 26th International Conference on Multimodal Interaction

Pages 684 - 689

https://doi.org/10.1145/3678957.3689332

Published: 04 November 2024 Publication History

All formats PDF

Abstract

In the realm of audiovisual emotion recognition, a significant challenge lies in developing neural network architectures capable of effectively harnessing and integrating multimodal information. This study introduces an advanced methodology for the Empathic Virtual Agent Challenge (EVAC), utilizing state-of-the-art speech, language, and image models. Specifically, we leverage cutting-edge pre-trained models, including multilingual variants fine-tuned in French for each modality, and integrate them using late fusion techniques. Through extensive experimentation and validation, we demonstrate the efficacy of our approach in achieving competitive results on the challenge dataset. Our findings highlight that multimodal approaches outperform unimodal methods across Core Affect Presence and Intensity and Appraisal Dimensions tasks, underscoring the effectiveness of integrating diverse modalities. This underscores the importance of leveraging multiple sources of information to capture nuanced emotional states more accurately and robustly in real-world applications.

References

[1]

Francisca Adoma Acheampong, Henry Nunoo-Mensah, and Wenyu Chen. 2021. Transformer models for text-based emotion detection: a review of BERT-based approaches. Artificial Intelligence Review 54, 8 (2021), 5789–5829.

Digital Library

[2]

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. 2019. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 (2019).

[3]

Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick Von Platen, Yatharth Saraf, Juan Pino, 2021. XLS-R: Self-supervised cross-lingual speech representation learning at scale. arXiv preprint arXiv:2111.09296 (2021).

[4]

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. CoRR abs/2006.11477 (2020). https://doi.org/10.48550/ARXIV.2006.11477 arXiv:2006.11477

[5]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.

[6]

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space-Time Attention All You Need for Video Understanding?. In Proceedings of the International Conference on Machine Learning (ICML).

[7]

Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. 2018. VGGFace2: A dataset for recognising faces across pose and age. arxiv:1710.08092 [cs.CV] https://arxiv.org/abs/1710.08092

[8]

Joao Carreira and Andrew Zisserman. 2018. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. arxiv:1705.07750 [cs.CV] https://arxiv.org/abs/1705.07750

[9]

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1505–1518.

[10]

Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2020. Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979 (2020).

[11]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised Cross-lingual Representation Learning at Scale. CoRR abs/1911.02116 (2019). arXiv:1911.02116http://arxiv.org/abs/1911.02116

[12]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arxiv:1810.04805http://arxiv.org/abs/1810.04805

[13]

Anna Favaro, Yi-Ting Tsai, Ankur Butala, Thomas Thebaud, Jesús Villalba, Najim Dehak, and Laureano Moro-Velázquez. 2023. Interpretable speech features vs. DNN embeddings: What to use in the automatic assessment of Parkinson’s disease in multi-lingual scenarios. Computers in Biology and Medicine 166 (2023), 107559.

Digital Library

[14]

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2020. Language-agnostic BERT sentence embedding. arXiv preprint arXiv:2007.01852 (2020).

[15]

Hippolyte Fournier, Sina Alisamir, Safaa Azzakhnini, Hanna Chainay, Olivier Koenig, Isabella Zsoldos, Eléeonore Trân, Gérard Bailly, Frédéeric Elisei, Béatrice Bouchot, 2024. THERADIA WoZ: An Ecological Corpus for Appraisal-based Affect Research in Healthcare. arXiv preprint arXiv:2405.06728 (2024).

[16]

Lucas Goncalves and Carlos Busso. 2022. Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features. IEEE Transactions on Affective Computing 13, 4 (2022), 2156–2170.

[17]

Yuan Gong, Sameer Khurana, Leonid Karlinsky, and James Glass. 2023. Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers. arXiv preprint arXiv:2307.03183 (2023).

[18]

Yuan Gong, Alexander H Liu, Hongyin Luo, Leonid Karlinsky, and James Glass. 2023. Joint audio and speech understanding. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 1–8.

[19]

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020).

[20]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), 1735–1780. https://deeplearning.cs.cmu.edu/F23/document/readings/LSTM.pdf

Digital Library

[21]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing 29 (2021), 3451–3460.

[22]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. arxiv:1705.06950 [cs.CV] https://arxiv.org/abs/1705.06950

[23]

Adrien Lafore, Clément Pagés, Leila Moudjari, Sebastião Quintas, Hervé Bredin, Thomas Pellegrini, Farah Benamara, Isabelle Ferrané, Jérôme Bertrand, Marie-Françoise Bertrand, 2024. IRIT-MFU Multi-modal systems for emotion classification for Odyssey 2024 challenge. In Odyssey 2024: The Speaker and Language Recognition Workshop.

[24]

Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, and Didier Schwab. 2020. FlauBERT: Unsupervised Language Model Pre-training for French. In Proceedings of The 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 2479–2490. https://www.aclweb.org/anthology/2020.lrec-1.302

[25]

Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, and Benoît Sagot. 2020. CamemBERT: a Tasty French Language Model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.

[26]

OpenAI, Josh Achiam, et al. 2024. GPT-4 Technical Report. arxiv:2303.08774 [cs.CL] https://arxiv.org/abs/2303.08774

[27]

Raghavendra Pappagari, Tianzi Wang, Jesus Villalba, Nanxin Chen, and Najim Dehak. 2020. x-vectors meet emotions: A study on dependencies between emotion and speaker recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7169–7173.

[28]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning. PMLR, 28492–28518.

[29]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. http://arxiv.org/abs/1908.10084

[30]

Fabien Ringeval, Björn Schuller, Gérard Bailly, Safaa Azzakhnini, and Hippolyte Fournier. 2024. EVAC 2024 – Empathic Virtual Agent Challenge: Appraisal-based Recognition of Affective States. In Proceedings of the 26th International Conference on Multimodal Interaction (ICMI 2024). ACM, San José, Costa Rica.

[31]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019).

[32]

Joel Shor, Aren Jansen, Ronnie Maor, Oran Lang, Omry Tuval, Félix de Chaumont Quitry, Marco Tagliasacchi, Ira Shavitt, Dotan Emanuel, and Yinnon Haviv. 2020. Towards Learning a Universal Non-Semantic Representation of Speech. In Interspeech 2020. ISCA, 140–144. https://doi.org/10.21437/interspeech.2020-1242

[33]

Joel Shor and Subhashini Venugopalan. 2022. TRILLsson: Distilling Universal Paralinguistic Speech Representations. https://doi.org/10.21437/interspeech.2022-118

[34]

Jules Sintes. 2023. Multi-task French speech analysis with deep learning Emotion recognition and speaker diarization models for end-to-end conversational analysis tool. unknown (2023).

[35]

David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5329–5333. https://doi.org/10.1109/ICASSP.2018.8461375

Digital Library

[36]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. 2016. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arxiv:1602.07261 [cs.CV] https://arxiv.org/abs/1602.07261

[37]

Silero Team. 2021. Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier. Retrieved March 31 (2021), 2023.

[38]

Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. 2021. VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint arXiv:2101.00390 (2021).

[39]

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. Improving Text Embeddings with Large Language Models. arxiv:2401.00368 [cs.CL]

[40]

Haibin Wu, Huang-Cheng Chou, Kai-Wei Chang, Lucas Goncalves, Jiawei Du, Jyh-Shing Roger Jang, Chi-Chun Lee, and Hung-Yi Lee. 2024. EMO-SUPERB: An in-depth look at speech emotion recognition. arXiv preprint arXiv:2402.13018 (2024).

[41]

Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, 2021. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051 (2021).

[42]

Seunghyun Yoon, Seokhyun Byun, and Kyomin Jung. 2018. Multimodal speech emotion recognition using audio and text. In 2018 IEEE spoken language technology workshop (SLT). IEEE, 112–118.

Index Terms

Multimodal Emotion Recognition Harnessing the Complementarity of Speech, Language, and Vision

Recommendations

Feature-Enhanced Multimodal Interaction model for emotion recognition in conversation
Abstract
Multimodal Emotion Recognition in Conversation (MERC) aims to identify the emotional state of a speaker who expresses their opinions through text, vision, and audio information during conversations. MERC enables intelligent machines to exhibit ...
Highlights
- The multimodal feature-enhanced module fully mines the comprehensive multimodal features, which include semantic and emotional features of text, vision, and audio.
- The incremental transformer dialogue module models the dialogue process,...
Multimodal Emotion Recognition with Vision-language Prompting and Modality Dropout
MRAC '24: Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing

In this paper, we present our solution for the Second Multimodal Emotion Recognition Challenge Track 1(MER2024-SEMI). To enhance the accuracy and generalization performance of emotion recognition, we propose several methods for Multimodal Emotion ...
Token-disentangling Mutual Transformer for multimodal emotion recognition
Abstract
Multimodal emotion recognition presents a complex challenge, as it involves the identification of human emotions using various modalities such as video, text, and audio. Existing methods focus mainly on the fusion information from multimodal data,...
Highlights
- We propose a novel and robust Token-disentangling Mutual Transformer (TMT) to achieve accurate multimodal emotion recognition.
- We propose a simple and easy-to-implement multimodal emotion Token disentanglement module to disentangle the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICMI '24: Proceedings of the 26th International Conference on Multimodal Interaction

November 2024

725 pages

ISBN:9798400704628

DOI:10.1145/3678957

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 November 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICMI '24

ICMI '24: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

November 4 - 8, 2024

San Jose, Costa Rica

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
274
Total Downloads

Downloads (Last 12 months)274
Downloads (Last 6 weeks)67

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten