research-article

Cross-modality Representation Interactive Learning for Multimodal Sentiment Analysis

Authors:

Heng Tao ShenAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 426 - 434

https://doi.org/10.1145/3581783.3612295

Published: 27 October 2023 Publication History

Abstract

Effective alignment and fusion of multimodal features remain a significant challenge for multimodal sentiment analysis. In various multimodal applications, the text modal exhibits a significant advantage of compact yet expressive representation ability. In this paper, we propose a Cross-modality Representation Interactive Learning (CRIL) approach, which adopts the text modality to guide other modalities for learning representative feature tokens, contributing to effective multimodal fusion in multimodal sentiment analysis. We propose a semantic representation interactive learning module to learn concise semantic representation tokens for audio and video modalities under the guidance of the text modality, ensuring semantic alignment of representations among multiple modalities. Furthermore, we design a semantic relationship interactive learning module, which calculates a self-attention matrix for each modality and controls their consistency to enable the semantic relationship alignment for multiple modalities. Finally, we present a two-stage interactive fusion solution to bridge the modality gap for multimodal fusion and sentiment analysis. Extensive experiments are performed on the CMU-MOSEI, CMU-MOSI, and UR-FUNNY datasets, and experiment results demonstrate the effectiveness of our proposed approach.

Supplemental Material

MP4 File

Presentation video

Download
30.35 MB

References

[1]

Relja Arandjelović, Petr Gronát, Akihiko Torii, Tomás Pajdla, and Josef Sivic. 2015. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40 (2015), 1437--1451.

[2]

Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltruvsaitis, Amir Zadeh, and Louis-Philippe Morency. 2017. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In ICMI.

[3]

Huan Deng, Zhenguo Yang, Tianyong Hao, Qing Li, and Wenyin Liu. 2022. Multimodal Affective Computing with Dense Fusion Transformer for Inter- and Intra-modality Interactions. IEEE Transactions on Multimedia, Vol. Early Access (2022).

[4]

James J. Deng and Clement H. C. Leung. 2021. Towards Learning a Joint Representation from Transformer in Multimodal Emotion Recognition. In BI.

[5]

Sidney K. D'Mello and Jacqueline Kory Westlund. 2015. A Review and Meta-Analysis of Multimodal Affect Detection Systems. ACM Computing Surveys (CSUR), Vol. 47 (2015), 1--36.

Digital Library

[6]

Wei Han, Hui Chen, Alexander F. Gelbukh, Amir Zadeh, Louis-Philippe Morency, and Soujanya Poria. 2021b. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In ICMI.

[7]

Wei Han, Hui Chen, and Soujanya Poria. 2021a. Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. In EMNLP.

[8]

M. Hasan, Wasifur Rahman, Amir Zadeh, Jianyuan Zhong, Md. Iftekhar Tanveer, Louis-Philippe Morency, and Ehsan Hoque. 2019. UR-FUNNY: A Multimodal Language Dataset for Understanding Humor. In EMNLP.

[9]

Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In ACM MM.

Digital Library

[10]

Ziyu Jia, Youfang Lin, Jing Wang, Zhiyang Feng, Xiangheng Xie, and Caijie Chen. 2021. HetEmotionNet: Two-Stream Heterogeneous Graph Recurrent Neural Network for Multi-modal Emotion Recognition. ACM MM (2021).

Digital Library

[11]

Yan Ling, Jianfei Yu, and Rui Xia. 2022. Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis. In ACL.

[12]

Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018a. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In ACL.

[13]

Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018b. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In ACL.

[14]

Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, and Guosheng Lin. 2021. Progressive Modality Reinforcement for Human Multimodal Emotion Recognition from Unaligned Multimodal Sequences. In CVPR.

[15]

Sijie Mai, Haifeng Hu, and Songlong Xing. 2020. Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. In AAAI.

[16]

Huisheng Mao, Ziqi Yuan, Hua Xu, Wenmeng Yu, Yihe Liu, and Kai Gao. 2022. M-SENA: An Integrated Platform for Multimodal Sentiment Analysis. In ACL.

[17]

Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2017. CM-GANs. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 15 (2017), 1--24.

[18]

Moisés H. R. Pereira, Flávio Luis Cardeal Pádua, Adriano M. Pereira, Fabrício Benevenuto, and Daniel Hasan Dalip. 2016. Fusing Audio, Textual, and Visual Features for Sentiment Analysis of News Videos. In ICWSM.

[19]

Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabás Póczos. 2019. Found in translation: Learning robust joint representations by cyclic translations between modalities. In AAAI.

[20]

Soujanya Poria, E. Cambria, and Alexander Gelbukh. 2015. Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis. In EMNLP.

[21]

Soujanya Poria, Iti Chaturvedi, E. Cambria, and Amir Hussain. 2016. Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis. In ICDM.

[22]

Viktor Rozgić, Sankaranarayanan Ananthakrishnan, Shirin Saleem, Rohit Kumar, and Rohit Prasad. 2012. Ensemble of svm trees for multimodal emotion recognition. In APSIPA. IEEE.

[23]

Ekaterina Shutova, Douwe Kiela, and Jean Maillard. 2016. Black Holes and White Rabbits: Metaphor Identification with Visual Features. In NAACL.

[24]

Guixin Su, Junyi He, Xia Li, Meixiu Lu, and Hanqun Yang. 2021. NFCMF: Noise Filtering and CrossModal Fusion for Multimodal Sentiment Analysis. 2021 International Conference on Asian Language Processing (IALP), 316--321.

[25]

Zhongkai Sun, Prathusha Kameswara Sarma, William A. Sethares, and Yingyu Liang. 2019. Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis. In AAAI.

[26]

Jiajia Tang, Kang Li, Xuanyu Jin, Andrzej Cichocki, Qibin Zhao, and Wanzeng Kong. 2021. CTFN: Hierarchical Learning for Multimodal Sentiment Analysis Using Coupled-Translation Fusion Network. In ACL.

[27]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. ACL.

[28]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).

[29]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).

[30]

Zilong Wang, Zhaohong Wan, and Xiaojun Wan. 2020. Transmodality: An end2end fusion method with transformer for multimodal sentiment analysis. In WWW.

[31]

Yang Wu, Zijie Lin, Yanyan Zhao, Bing Qin, and Li-Nan Zhu. 2021. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In ACL.

[32]

Yang Wu, Yanyan Zhao, Hao Yang, Songmin Chen, Bing Qin, Xiaohuan Cao, and Wenting Zhao. 2022. Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors. In ACL.

[33]

Bo Yang, Bo Shao, Lijun Wu, and X. Lin. 2022. Multimodal sentiment analysis with unidirectional modality translation. Neurocomputing, Vol. 467, 130--137.

Digital Library

[34]

Xiaocui Yang, Shi Feng, Yifei Zhang, and Daling Wang. 2021. Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks. In ACL.

[35]

Jianfei Yu, Luis Marujo, Jing Jiang, Pradeep Karuturi, and William Brendel. 2018. Improving multi-label emotion classification via sentiment classification with dual attention transfer network. In ACL.

[36]

Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In AAAI.

[37]

Amir Zadeh, Minghai Chen, Soujanya Poria, E. Cambria, and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In EMNLP.

[38]

Amir Zadeh, Paul Pu Liang, Soujanya Poria, E. Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In ACL.

[39]

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages. IEEE Intelligent Systems, Vol. 31 (2016), 82--88.

Digital Library

[40]

Werner Zellinger, Thomas Grubinger, Edwin David Lughofer, Thomas Natschläger, and Susanne Saminger-Platz. 2017. Central Moment Discrepancy (CMD) for Domain-Invariant Representation Learning. In ICLR.

[41]

Tong Zhu, Leida Li, Jufeng Yang, Sicheng Zhao, and Xiao Xiao. 2022. Multimodal Emotion Classification with Multi-level Semantic Reasoning Network. IEEE Transactions on Multimedia, Vol. Early Access (2022).

Cited By

Liu YYu S(2024)Web Semantic-Enhanced Multimodal Sentiment Analysis Using Multilayer Cross-Attention FusionInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.36065320:1(1-29)Online publication date: 13-Dec-2024
https://dl.acm.org/doi/10.4018/IJSWIS.360653
Han WCai CGuo YPeng JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)ERL-MR: Harnessing the Power of Euler Feature Representations for Balanced Multi-modal LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681215(4591-4600)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681215
Fang JWang WLuo THuang YLv FSerra ESpezzano F(2024)Progressive Multimodal Pivot Learning: Towards Semantic Discordance Understanding as HumansProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679524(591-601)Online publication date: 21-Oct-2024
https://doi.org/10.1145/3627673.3679524
Show More Cited By

Index Terms

Cross-modality Representation Interactive Learning for Multimodal Sentiment Analysis
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Vision for robotics

Recommendations

Text-Oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences
Artificial Intelligence
Abstract
Multimodal Sentiment Analysis (MSA) aims to mine sentiment information from text, visual, and acoustic modalities. Previous works have focused on representation learning and feature fusion strategies. However, most of these efforts ignored the ...
Multimodal sentiment analysis based on disentangled representation learning and cross-modal-context association mining
Abstract
Multimodal sentiment analysis aims to extract sentiment information expressed by users from multimodal data, including linguistic, acoustic, and visual cues. However, the heterogeneity of multimodal data leads to disparities in modal distribution,...
Highlights
- A multiview collaborative perception framework focusing on representation, mining, and cooperation.
- A multimodal disentangled representation learning scheme for sentiment association mining.
- A CMCAM module capturing cross-modal, ...
Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis
Abstract
Multimodal Sentiment Analysis (MSA) constitutes a pivotal technology in the realm of multimedia research. The efficacy of MSA models largely hinges on the quality of multimodal fusion. Notably, when conveying information pertinent to specific ...
Highlights
- Novel multimodal adaptive weight matrix enables accurate sentiment analysis by considering unique contributions of each modality.
- Multimodal attention mechanism addresses over-focusing on intra-modality attention.
- Multiple Softmax ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the Science and Technology Innovation Committee of Shenzhen Municipalit Foundation

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
806
Total Downloads

Downloads (Last 12 months)432
Downloads (Last 6 weeks)24

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu YYu S(2024)Web Semantic-Enhanced Multimodal Sentiment Analysis Using Multilayer Cross-Attention FusionInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.36065320:1(1-29)Online publication date: 13-Dec-2024
https://dl.acm.org/doi/10.4018/IJSWIS.360653
Han WCai CGuo YPeng JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)ERL-MR: Harnessing the Power of Euler Feature Representations for Balanced Multi-modal LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681215(4591-4600)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681215
Fang JWang WLuo THuang YLv FSerra ESpezzano F(2024)Progressive Multimodal Pivot Learning: Towards Semantic Discordance Understanding as HumansProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679524(591-601)Online publication date: 21-Oct-2024
https://doi.org/10.1145/3627673.3679524
Gao ZJiang XChen HLi YYang YXu X(2024)Uncertainty-Debiased Multimodal Fusion: Learning Deterministic Joint Representation for Multimodal Sentiment Analysis2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688376(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10688376

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten