skip to main content
10.1145/3581783.3613846acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Unlocking the Power of Multimodal Learning for Emotion Recognition in Conversation

Published: 27 October 2023 Publication History

Abstract

Emotion recognition in conversation aims to identify the emotions underlying each utterance, and it has great potential in various domains. Human perception of emotions relies on multiple modalities, such as language, vocal tonality, and facial expressions. While many studies have incorporated multimodal information to enhance emotion recognition, the performance of multimodal models often plateaus when additional modalities are added. We demonstrate through experiments that the main reason for this plateau is an imbalanced assignment of gradients across modalities. To address this issue, we propose fine-grained adaptive gradient modulation, a plug-in approach to rebalance the gradients of modalities. Experimental results show that our method improves the performance of all baseline models and outperforms existing plug-in methods.

Supplemental Material

MP4 File
In this video, we introduced the Diminishing Modal Marginal utility on the ERC task, and pointed out that the reason is an imbalanced gradient assignment across modalities. Then, we introduced the proposed plug-in method, Fine-grained Adaptive Gradient Modulation, which rebalance the gradients of modalities at the parameter level. We have open-sourced our code and evaluated it through experiments on two benchmark datasets.

References

[1]
Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4971--4980.
[2]
John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
[3]
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, Vol. 42, 4 (2008), 335--359.
[4]
Xiaolin Chen, Xuemeng Song, Yinwei Wei, Liqiang Nie, and Tat-Seng Chua. 2023. Dual Semantic Knowledge Composed Multimodal Dialog Systems. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 1518--1527.
[5]
Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj Wasnik, and Naoyuki Onoe. 2022. M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 4651--4660.
[6]
Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander F. Gelbukh. 2019. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing. 154--164.
[7]
Yash Goyal, Tejas Khot, Aishwarya Agrawal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2019. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. International Journal of Computer Vision, Vol. 127, 4 (2019), 398--414.
[8]
Xinzhe Han, Shuhui Wang, Chi Su, Qingming Huang, and Qi Tian. 2021. Greedy Gradient Ensemble for Robust Visual Question Answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1564--1573.
[9]
Devamanyu Hazarika, Soujanya Poria, Rada Mihalcea, Erik Cambria, and Roger Zimmermann. 2018a. ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2594--2604.
[10]
Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, and Roger Zimmermann. 2018b. Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 2122--2132.
[11]
Chao-Chun Hsu, Sheng-Yeh Chen, Chuan-Chun Kuo, Ting-Hao K. Huang, and Lun-Wei Ku. 2018. EmotionLines: An Emotion Corpus of Multi-Party Conversations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation.
[12]
Dou Hu, Xiaolong Hou, Lingwei Wei, Lian-Xin Jiang, and Yang Mo. 2022. MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 7037--7041.
[13]
Jingwen Hu, Yuchen Liu, Jinming Zhao, and Qin Jin. 2021. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 5666--5675.
[14]
Yu Huang, Chenzhuang Du, Zihui Xue, Xuanyao Chen, Hang Zhao, and Longbo Huang. 2021. What Makes Multi-Modal Learning Better than Single (Provably). In Proceeding of the Annual Conference on Neural Information Processing Systems. 10944--10956.
[15]
Taichi Ishiwatari, Y. Yasuda, Taro Miyazaki, and Jun Goto. 2020. Relation-aware Graph Attention Networks with Relational Position Encodings for Emotion Recognition in Conversations. In Proceeding of the Conference on Empirical Methods in Natural Language Processing. 7360--7370.
[16]
Mingrui Lao, Yanming Guo, Yu Liu, Wei Chen, Nan Pu, and Michael S. Lew. 2021. From Superficial to Deep: Language Bias driven Curriculum Learning for Visual Question Answering. In Proceedings of the ACM International Conference on Multimedia. 3370--3379.
[17]
Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing. 986--995.
[18]
Zhenyang Li, Yangyang Guo, Kejie Wang, Yinwei Wei, Liqiang Nie, and Mohan S. Kankanhalli. 2023. Joint Answering and Explanation for Visual Commonsense Reasoning. IEEE Transactions on Image Processing, Vol. 32 (2023), 3836--3846.
[19]
Chen Liang, Chong Yang, Jing Xu, Juyang Huang, Yongliang Wang, and Yang Dong. 2022. SPAGE: A Speaker and Position-Aware Graph Neural Network Model for Emotion Recognition in Conversation. In Proceeding of the Asia-Pacific Chapter of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 148--157.
[20]
Fan Liu, Huilin Chen, Zhiyong Cheng, Anan Liu, Liqiang Nie, and Mohan Kankanhalli. 2022. Disentangled Multimodal Representation Learning for Recommendation. IEEE Transactions on Multimedia (2022), 1--11.
[21]
Meng Liu, Liqiang Nie, Xiang Wang, Qi Tian, and Baoquan Chen. 2019. Online Data Organizer: Micro-Video Categorization by Structure-Guided Multimodal Dictionary Learning. IEEE Transactions on Image Processing, Vol. 28, 3 (2019), 1235--1247.
[22]
Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and E. Cambria. 2019. DialogueRNN: An Attentive RNN for Emotion Detection in Conversations. In Proceeding of the Association for the Advancement of Artificial Intelligence. 6818--6825.
[23]
Prem Melville, Wojciech Gryc, and Richard D. Lawrence. 2009. Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1275--1284.
[24]
Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian, and Alberto Del Bimbo. 2022. Search-oriented Micro-video Captioning. In Proceedings of the ACM International Conference on Multimedia. 3234--3243.
[25]
Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, and Di Hu. 2022. Balanced Multimodal Learning via On-the-fly Gradient Modulation. In Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8228--8237.
[26]
Mirco Planamente, Chiara Plizzari, Emanuele Alberti, and Barbara Caputo. 2021. Domain Generalization through Audio-Visual Relative Norm Alignment in First Person Action Recognition. Winter Conference on Applications of Computer Vision (2021), 163--174.
[27]
Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019a. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the Conference of the Association for Computational Linguistics. 527--536.
[28]
Soujanya Poria, Navonil Majumder, Rada Mihalcea, and Eduard H. Hovy. 2019b. Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances. IEEE Access, Vol. 7 (2019), 100943--100953.
[29]
Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic Modality Interaction Modeling for Image-Text Retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104--1113.
[30]
Tapan Sahni, Chinmay Chandak, Naveen Reddy Chedeti, and Manish Singh. 2017. Efficient Twitter sentiment classification using subjective distant supervision. In Proceeding of the International Conference on Communication Systems and Networks. 548--553.
[31]
Nusrat Jahan Shoumy, Li-Minn Ang, Kah Phooi Seng, D. M. Motiur Rahaman, and Tanveer A. Zia. 2020. Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals. Journal of Network and Computer Applications, Vol. 149 (2020).
[32]
Teng Sun, Liqiang Jing, Yinwei Wei, Xuemeng Song, Zhiyong Cheng, and Liqiang Nie. 2023. Dual Consistency-enhanced Semi-supervised Sentiment Analysis towards COVID-19 Tweets. IEEE Transactions on Knowledge and Data Engineering (2023).
[33]
Ya Sun, Sijie Mai, and Haifeng Hu. 2021. Learning to Balance the Learning Rates Between Various Modalities via Adaptive Tracking Factor. IEEE Signal Processing Letters, Vol. 28 (2021), 1650--1654.
[34]
Weiyao Wang, Du Tran, and Matt Feiszli. 2020. What Makes Training Multi-Modal Classification Networks Hard?. In Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12692--12702.
[35]
Yinwei Wei, Xiang Wang, Weili Guan, Liqiang Nie, Zhouchen Lin, and Baoquan Chen. 2019. Neural multimodal cooperative learning toward micro-video understanding. IEEE Transactions on Image Processing, Vol. 29 (2019), 1--14.
[36]
Nan Wu, Stanisŀaw Jastrzebski, Kyunghyun Cho, and Krzysztof J. Geras. 2022. Characterizing and Overcoming the Greedy Nature of Learning in Multi-modal Deep Neural Networks. In Proceeding of the International Conference on Machine Learning. 24043--24055.
[37]
Ruize Xu, Ruoxuan Feng, Shi-Xiong Zhang, and Di Hu. 2023. MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning. In Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. 1--5.
[38]
Sayyed M. Zahiri and Jinho D. Choi. 2018. Emotion Detection on TV Show Transcripts with Sequence-Based Convolutional Neural Networks. In Proceeding of the Association for the Advancement of Artificial Intelligence, Vol. WS-19. 44--52.
[39]
Dong Zhang, Weisheng Zhang, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. 2020. Modeling both Intra- and Inter-modal Influence for Real-Time Emotion Detection in Conversations. In Proceedings of the ACM International Conference on Multimedia. 503--511.
[40]
Weixiang Zhao, Yanyan Zhao, and Xin Lu. 2022. CauAIN: Causal Aware Interaction Network for Emotion Recognition in Conversations. In Proceeding of the International Joint Conference on Artificial Intelligence. 4524--4530.

Cited By

View all
  • (2024)Ada2I: Enhancing Modality Balance for Multimodal Conversational Emotion RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681648(9330-9339)Online publication date: 28-Oct-2024
  • (2024)Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in ConversationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681633(4341-4348)Online publication date: 28-Oct-2024
  • (2024)Graph Convolutional Metric Learning for Recommender Systems in Smart CitiesIEEE Transactions on Consumer Electronics10.1109/TCE.2024.341170470:3(5929-5941)Online publication date: Aug-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. emotion recognition in conversation
  2. fine-grained adaptive gradient modulation
  3. multimodal balanced learning

Qualifiers

  • Research-article

Funding Sources

  • Key R&D Program of Shandong (Major scientific and technological innovation projects)
  • Special Fund for Distinguished Professors of Shandong Jianzhu University
  • National Natural Science Foundation of China
  • Shenzhen College Stability Support Plan
  • National Natural Science Foundation (NSF) of China
  • NSF of Shandong Province

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)293
  • Downloads (Last 6 weeks)17
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Ada2I: Enhancing Modality Balance for Multimodal Conversational Emotion RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681648(9330-9339)Online publication date: 28-Oct-2024
  • (2024)Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in ConversationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681633(4341-4348)Online publication date: 28-Oct-2024
  • (2024)Graph Convolutional Metric Learning for Recommender Systems in Smart CitiesIEEE Transactions on Consumer Electronics10.1109/TCE.2024.341170470:3(5929-5941)Online publication date: Aug-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media