research-article

Unlocking the Power of Multimodal Learning for Emotion Recognition in Conversation

Authors:

Liqiang NieAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 5947 - 5955

https://doi.org/10.1145/3581783.3613846

Published: 27 October 2023 Publication History

Abstract

Emotion recognition in conversation aims to identify the emotions underlying each utterance, and it has great potential in various domains. Human perception of emotions relies on multiple modalities, such as language, vocal tonality, and facial expressions. While many studies have incorporated multimodal information to enhance emotion recognition, the performance of multimodal models often plateaus when additional modalities are added. We demonstrate through experiments that the main reason for this plateau is an imbalanced assignment of gradients across modalities. To address this issue, we propose fine-grained adaptive gradient modulation, a plug-in approach to rebalance the gradients of modalities. Experimental results show that our method improves the performance of all baseline models and outperforms existing plug-in methods.

Supplemental Material

MP4 File

In this video, we introduced the Diminishing Modal Marginal utility on the ERC task, and pointed out that the reason is an imbalanced gradient assignment across modalities. Then, we introduced the proposed plug-in method, Fine-grained Adaptive Gradient Modulation, which rebalance the gradients of modalities at the parameter level. We have open-sourced our code and evaluated it through experiments on two benchmark datasets.

Download
108.59 MB

References

[1]

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4971--4980.

[2]

John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.

[3]

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, Vol. 42, 4 (2008), 335--359.

[4]

Xiaolin Chen, Xuemeng Song, Yinwei Wei, Liqiang Nie, and Tat-Seng Chua. 2023. Dual Semantic Knowledge Composed Multimodal Dialog Systems. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 1518--1527.

Digital Library

[5]

Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj Wasnik, and Naoyuki Onoe. 2022. M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 4651--4660.

[6]

Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander F. Gelbukh. 2019. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing. 154--164.

[7]

Yash Goyal, Tejas Khot, Aishwarya Agrawal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2019. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. International Journal of Computer Vision, Vol. 127, 4 (2019), 398--414.

Digital Library

[8]

Xinzhe Han, Shuhui Wang, Chi Su, Qingming Huang, and Qi Tian. 2021. Greedy Gradient Ensemble for Robust Visual Question Answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1564--1573.

[9]

Devamanyu Hazarika, Soujanya Poria, Rada Mihalcea, Erik Cambria, and Roger Zimmermann. 2018a. ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2594--2604.

[10]

Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, and Roger Zimmermann. 2018b. Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 2122--2132.

[11]

Chao-Chun Hsu, Sheng-Yeh Chen, Chuan-Chun Kuo, Ting-Hao K. Huang, and Lun-Wei Ku. 2018. EmotionLines: An Emotion Corpus of Multi-Party Conversations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation.

[12]

Dou Hu, Xiaolong Hou, Lingwei Wei, Lian-Xin Jiang, and Yang Mo. 2022. MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 7037--7041.

[13]

Jingwen Hu, Yuchen Liu, Jinming Zhao, and Qin Jin. 2021. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 5666--5675.

[14]

Yu Huang, Chenzhuang Du, Zihui Xue, Xuanyao Chen, Hang Zhao, and Longbo Huang. 2021. What Makes Multi-Modal Learning Better than Single (Provably). In Proceeding of the Annual Conference on Neural Information Processing Systems. 10944--10956.

[15]

Taichi Ishiwatari, Y. Yasuda, Taro Miyazaki, and Jun Goto. 2020. Relation-aware Graph Attention Networks with Relational Position Encodings for Emotion Recognition in Conversations. In Proceeding of the Conference on Empirical Methods in Natural Language Processing. 7360--7370.

[16]

Mingrui Lao, Yanming Guo, Yu Liu, Wei Chen, Nan Pu, and Michael S. Lew. 2021. From Superficial to Deep: Language Bias driven Curriculum Learning for Visual Question Answering. In Proceedings of the ACM International Conference on Multimedia. 3370--3379.

[17]

Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing. 986--995.

[18]

Zhenyang Li, Yangyang Guo, Kejie Wang, Yinwei Wei, Liqiang Nie, and Mohan S. Kankanhalli. 2023. Joint Answering and Explanation for Visual Commonsense Reasoning. IEEE Transactions on Image Processing, Vol. 32 (2023), 3836--3846.

Digital Library

[19]

Chen Liang, Chong Yang, Jing Xu, Juyang Huang, Yongliang Wang, and Yang Dong. 2022. SPAGE: A Speaker and Position-Aware Graph Neural Network Model for Emotion Recognition in Conversation. In Proceeding of the Asia-Pacific Chapter of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 148--157.

[20]

Fan Liu, Huilin Chen, Zhiyong Cheng, Anan Liu, Liqiang Nie, and Mohan Kankanhalli. 2022. Disentangled Multimodal Representation Learning for Recommendation. IEEE Transactions on Multimedia (2022), 1--11.

Digital Library

[21]

Meng Liu, Liqiang Nie, Xiang Wang, Qi Tian, and Baoquan Chen. 2019. Online Data Organizer: Micro-Video Categorization by Structure-Guided Multimodal Dictionary Learning. IEEE Transactions on Image Processing, Vol. 28, 3 (2019), 1235--1247.

Digital Library

[22]

Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and E. Cambria. 2019. DialogueRNN: An Attentive RNN for Emotion Detection in Conversations. In Proceeding of the Association for the Advancement of Artificial Intelligence. 6818--6825.

[23]

Prem Melville, Wojciech Gryc, and Richard D. Lawrence. 2009. Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1275--1284.

[24]

Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian, and Alberto Del Bimbo. 2022. Search-oriented Micro-video Captioning. In Proceedings of the ACM International Conference on Multimedia. 3234--3243.

Digital Library

[25]

Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, and Di Hu. 2022. Balanced Multimodal Learning via On-the-fly Gradient Modulation. In Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8228--8237.

[26]

Mirco Planamente, Chiara Plizzari, Emanuele Alberti, and Barbara Caputo. 2021. Domain Generalization through Audio-Visual Relative Norm Alignment in First Person Action Recognition. Winter Conference on Applications of Computer Vision (2021), 163--174.

[27]

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019a. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the Conference of the Association for Computational Linguistics. 527--536.

[28]

Soujanya Poria, Navonil Majumder, Rada Mihalcea, and Eduard H. Hovy. 2019b. Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances. IEEE Access, Vol. 7 (2019), 100943--100953.

[29]

Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic Modality Interaction Modeling for Image-Text Retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104--1113.

Digital Library

[30]

Tapan Sahni, Chinmay Chandak, Naveen Reddy Chedeti, and Manish Singh. 2017. Efficient Twitter sentiment classification using subjective distant supervision. In Proceeding of the International Conference on Communication Systems and Networks. 548--553.

[31]

Nusrat Jahan Shoumy, Li-Minn Ang, Kah Phooi Seng, D. M. Motiur Rahaman, and Tanveer A. Zia. 2020. Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals. Journal of Network and Computer Applications, Vol. 149 (2020).

Digital Library

[32]

Teng Sun, Liqiang Jing, Yinwei Wei, Xuemeng Song, Zhiyong Cheng, and Liqiang Nie. 2023. Dual Consistency-enhanced Semi-supervised Sentiment Analysis towards COVID-19 Tweets. IEEE Transactions on Knowledge and Data Engineering (2023).

Digital Library

[33]

Ya Sun, Sijie Mai, and Haifeng Hu. 2021. Learning to Balance the Learning Rates Between Various Modalities via Adaptive Tracking Factor. IEEE Signal Processing Letters, Vol. 28 (2021), 1650--1654.

[34]

Weiyao Wang, Du Tran, and Matt Feiszli. 2020. What Makes Training Multi-Modal Classification Networks Hard?. In Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12692--12702.

[35]

Yinwei Wei, Xiang Wang, Weili Guan, Liqiang Nie, Zhouchen Lin, and Baoquan Chen. 2019. Neural multimodal cooperative learning toward micro-video understanding. IEEE Transactions on Image Processing, Vol. 29 (2019), 1--14.

[36]

Nan Wu, Stanisŀaw Jastrzebski, Kyunghyun Cho, and Krzysztof J. Geras. 2022. Characterizing and Overcoming the Greedy Nature of Learning in Multi-modal Deep Neural Networks. In Proceeding of the International Conference on Machine Learning. 24043--24055.

[37]

Ruize Xu, Ruoxuan Feng, Shi-Xiong Zhang, and Di Hu. 2023. MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning. In Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. 1--5.

[38]

Sayyed M. Zahiri and Jinho D. Choi. 2018. Emotion Detection on TV Show Transcripts with Sequence-Based Convolutional Neural Networks. In Proceeding of the Association for the Advancement of Artificial Intelligence, Vol. WS-19. 44--52.

[39]

Dong Zhang, Weisheng Zhang, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. 2020. Modeling both Intra- and Inter-modal Influence for Real-Time Emotion Detection in Conversations. In Proceedings of the ACM International Conference on Multimedia. 503--511.

Digital Library

[40]

Weixiang Zhao, Yanyan Zhao, and Xin Lu. 2022. CauAIN: Causal Aware Interaction Network for Emotion Recognition in Conversations. In Proceeding of the International Joint Conference on Artificial Intelligence. 4524--4530.

Cited By

Nguyen CLe TMai ALe DCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Ada2I: Enhancing Modality Balance for Multimodal Conversational Emotion RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681648(9330-9339)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681648
Yi ZZhao ZShen ZZhang TCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in ConversationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681633(4341-4348)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681633
Zhao XHu YMu YLi MWang HLi X(2024)Graph Convolutional Metric Learning for Recommender Systems in Smart CitiesIEEE Transactions on Consumer Electronics10.1109/TCE.2024.341170470:3(5929-5941)Online publication date: Aug-2024
https://doi.org/10.1109/TCE.2024.3411704

Index Terms

Unlocking the Power of Multimodal Learning for Emotion Recognition in Conversation
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Discourse, dialogue and pragmatics
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Sentiment analysis

Recommendations

Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Emotion Recognition in Conversation (ERC) plays an important role in driving the development of human-machine interaction. Emotions can exist in multiple modalities, and multimodal ERC mainly faces two problems: (1) the noise problem in the cross-modal ...
Bi-stream graph learning based multimodal fusion for emotion recognition in conversation
Abstract
Emotion Recognition in Conversation (ERC) is the process of automatically detecting and understanding emotions expressed in a conversation, which plays an important role in human–computer interaction. A conversation generates different modality ...
Highlights
- A novel bi-stream graph learning framework is proposed.
- Capturing intra-modal contextual information by unimodal stream graph learning.
- Capturing inter -modal interaction information by cross-modal stream graph learning.
- ...
Real-Time Multimodal Emotion Recognition in Conversation for Multi-Party Interactions
ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

In order to improve multi-party social interaction with artificial companions such as robots or virtual agents, real-time Emotion Recognition in Conversation (ERC) is required. In this context, ERC is a challenging task which involves multiple ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Key R&D Program of Shandong (Major scientific and technological innovation projects)
Special Fund for Distinguished Professors of Shandong Jianzhu University
National Natural Science Foundation of China
Shenzhen College Stability Support Plan
National Natural Science Foundation (NSF) of China
NSF of Shandong Province

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
517
Total Downloads

Downloads (Last 12 months)293
Downloads (Last 6 weeks)17

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Nguyen CLe TMai ALe DCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Ada2I: Enhancing Modality Balance for Multimodal Conversational Emotion RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681648(9330-9339)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681648
Yi ZZhao ZShen ZZhang TCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in ConversationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681633(4341-4348)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681633
Zhao XHu YMu YLi MWang HLi X(2024)Graph Convolutional Metric Learning for Recommender Systems in Smart CitiesIEEE Transactions on Consumer Electronics10.1109/TCE.2024.341170470:3(5929-5941)Online publication date: Aug-2024
https://doi.org/10.1109/TCE.2024.3411704

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten