skip to main content
10.1145/3581783.3612295acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Cross-modality Representation Interactive Learning for Multimodal Sentiment Analysis

Published: 27 October 2023 Publication History

Abstract

Effective alignment and fusion of multimodal features remain a significant challenge for multimodal sentiment analysis. In various multimodal applications, the text modal exhibits a significant advantage of compact yet expressive representation ability. In this paper, we propose a Cross-modality Representation Interactive Learning (CRIL) approach, which adopts the text modality to guide other modalities for learning representative feature tokens, contributing to effective multimodal fusion in multimodal sentiment analysis. We propose a semantic representation interactive learning module to learn concise semantic representation tokens for audio and video modalities under the guidance of the text modality, ensuring semantic alignment of representations among multiple modalities. Furthermore, we design a semantic relationship interactive learning module, which calculates a self-attention matrix for each modality and controls their consistency to enable the semantic relationship alignment for multiple modalities. Finally, we present a two-stage interactive fusion solution to bridge the modality gap for multimodal fusion and sentiment analysis. Extensive experiments are performed on the CMU-MOSEI, CMU-MOSI, and UR-FUNNY datasets, and experiment results demonstrate the effectiveness of our proposed approach.

Supplemental Material

MP4 File
Presentation video

References

[1]
Relja Arandjelović, Petr Gronát, Akihiko Torii, Tomás Pajdla, and Josef Sivic. 2015. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40 (2015), 1437--1451.
[2]
Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltruvsaitis, Amir Zadeh, and Louis-Philippe Morency. 2017. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In ICMI.
[3]
Huan Deng, Zhenguo Yang, Tianyong Hao, Qing Li, and Wenyin Liu. 2022. Multimodal Affective Computing with Dense Fusion Transformer for Inter- and Intra-modality Interactions. IEEE Transactions on Multimedia, Vol. Early Access (2022).
[4]
James J. Deng and Clement H. C. Leung. 2021. Towards Learning a Joint Representation from Transformer in Multimodal Emotion Recognition. In BI.
[5]
Sidney K. D'Mello and Jacqueline Kory Westlund. 2015. A Review and Meta-Analysis of Multimodal Affect Detection Systems. ACM Computing Surveys (CSUR), Vol. 47 (2015), 1--36.
[6]
Wei Han, Hui Chen, Alexander F. Gelbukh, Amir Zadeh, Louis-Philippe Morency, and Soujanya Poria. 2021b. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In ICMI.
[7]
Wei Han, Hui Chen, and Soujanya Poria. 2021a. Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. In EMNLP.
[8]
M. Hasan, Wasifur Rahman, Amir Zadeh, Jianyuan Zhong, Md. Iftekhar Tanveer, Louis-Philippe Morency, and Ehsan Hoque. 2019. UR-FUNNY: A Multimodal Language Dataset for Understanding Humor. In EMNLP.
[9]
Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In ACM MM.
[10]
Ziyu Jia, Youfang Lin, Jing Wang, Zhiyang Feng, Xiangheng Xie, and Caijie Chen. 2021. HetEmotionNet: Two-Stream Heterogeneous Graph Recurrent Neural Network for Multi-modal Emotion Recognition. ACM MM (2021).
[11]
Yan Ling, Jianfei Yu, and Rui Xia. 2022. Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis. In ACL.
[12]
Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018a. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In ACL.
[13]
Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018b. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In ACL.
[14]
Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, and Guosheng Lin. 2021. Progressive Modality Reinforcement for Human Multimodal Emotion Recognition from Unaligned Multimodal Sequences. In CVPR.
[15]
Sijie Mai, Haifeng Hu, and Songlong Xing. 2020. Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. In AAAI.
[16]
Huisheng Mao, Ziqi Yuan, Hua Xu, Wenmeng Yu, Yihe Liu, and Kai Gao. 2022. M-SENA: An Integrated Platform for Multimodal Sentiment Analysis. In ACL.
[17]
Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2017. CM-GANs. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 15 (2017), 1--24.
[18]
Moisés H. R. Pereira, Flávio Luis Cardeal Pádua, Adriano M. Pereira, Fabrício Benevenuto, and Daniel Hasan Dalip. 2016. Fusing Audio, Textual, and Visual Features for Sentiment Analysis of News Videos. In ICWSM.
[19]
Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabás Póczos. 2019. Found in translation: Learning robust joint representations by cyclic translations between modalities. In AAAI.
[20]
Soujanya Poria, E. Cambria, and Alexander Gelbukh. 2015. Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis. In EMNLP.
[21]
Soujanya Poria, Iti Chaturvedi, E. Cambria, and Amir Hussain. 2016. Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis. In ICDM.
[22]
Viktor Rozgić, Sankaranarayanan Ananthakrishnan, Shirin Saleem, Rohit Kumar, and Rohit Prasad. 2012. Ensemble of svm trees for multimodal emotion recognition. In APSIPA. IEEE.
[23]
Ekaterina Shutova, Douwe Kiela, and Jean Maillard. 2016. Black Holes and White Rabbits: Metaphor Identification with Visual Features. In NAACL.
[24]
Guixin Su, Junyi He, Xia Li, Meixiu Lu, and Hanqun Yang. 2021. NFCMF: Noise Filtering and CrossModal Fusion for Multimodal Sentiment Analysis. 2021 International Conference on Asian Language Processing (IALP), 316--321.
[25]
Zhongkai Sun, Prathusha Kameswara Sarma, William A. Sethares, and Yingyu Liang. 2019. Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis. In AAAI.
[26]
Jiajia Tang, Kang Li, Xuanyu Jin, Andrzej Cichocki, Qibin Zhao, and Wanzeng Kong. 2021. CTFN: Hierarchical Learning for Multimodal Sentiment Analysis Using Coupled-Translation Fusion Network. In ACL.
[27]
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. ACL.
[28]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).
[29]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[30]
Zilong Wang, Zhaohong Wan, and Xiaojun Wan. 2020. Transmodality: An end2end fusion method with transformer for multimodal sentiment analysis. In WWW.
[31]
Yang Wu, Zijie Lin, Yanyan Zhao, Bing Qin, and Li-Nan Zhu. 2021. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In ACL.
[32]
Yang Wu, Yanyan Zhao, Hao Yang, Songmin Chen, Bing Qin, Xiaohuan Cao, and Wenting Zhao. 2022. Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors. In ACL.
[33]
Bo Yang, Bo Shao, Lijun Wu, and X. Lin. 2022. Multimodal sentiment analysis with unidirectional modality translation. Neurocomputing, Vol. 467, 130--137.
[34]
Xiaocui Yang, Shi Feng, Yifei Zhang, and Daling Wang. 2021. Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks. In ACL.
[35]
Jianfei Yu, Luis Marujo, Jing Jiang, Pradeep Karuturi, and William Brendel. 2018. Improving multi-label emotion classification via sentiment classification with dual attention transfer network. In ACL.
[36]
Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In AAAI.
[37]
Amir Zadeh, Minghai Chen, Soujanya Poria, E. Cambria, and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In EMNLP.
[38]
Amir Zadeh, Paul Pu Liang, Soujanya Poria, E. Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In ACL.
[39]
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages. IEEE Intelligent Systems, Vol. 31 (2016), 82--88.
[40]
Werner Zellinger, Thomas Grubinger, Edwin David Lughofer, Thomas Natschläger, and Susanne Saminger-Platz. 2017. Central Moment Discrepancy (CMD) for Domain-Invariant Representation Learning. In ICLR.
[41]
Tong Zhu, Leida Li, Jufeng Yang, Sicheng Zhao, and Xiao Xiao. 2022. Multimodal Emotion Classification with Multi-level Semantic Reasoning Network. IEEE Transactions on Multimedia, Vol. Early Access (2022).

Cited By

View all
  • (2024)Web Semantic-Enhanced Multimodal Sentiment Analysis Using Multilayer Cross-Attention FusionInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.36065320:1(1-29)Online publication date: 13-Dec-2024
  • (2024)ERL-MR: Harnessing the Power of Euler Feature Representations for Balanced Multi-modal LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681215(4591-4600)Online publication date: 28-Oct-2024
  • (2024)Progressive Multimodal Pivot Learning: Towards Semantic Discordance Understanding as HumansProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679524(591-601)Online publication date: 21-Oct-2024
  • Show More Cited By

Index Terms

  1. Cross-modality Representation Interactive Learning for Multimodal Sentiment Analysis

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. multimodal fusion
    2. multimodal sentiment analysis
    3. representation interactive learning

    Qualifiers

    • Research-article

    Funding Sources

    • the Science and Technology Innovation Committee of Shenzhen Municipalit Foundation

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)432
    • Downloads (Last 6 weeks)24
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Web Semantic-Enhanced Multimodal Sentiment Analysis Using Multilayer Cross-Attention FusionInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.36065320:1(1-29)Online publication date: 13-Dec-2024
    • (2024)ERL-MR: Harnessing the Power of Euler Feature Representations for Balanced Multi-modal LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681215(4591-4600)Online publication date: 28-Oct-2024
    • (2024)Progressive Multimodal Pivot Learning: Towards Semantic Discordance Understanding as HumansProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679524(591-601)Online publication date: 21-Oct-2024
    • (2024)Uncertainty-Debiased Multimodal Fusion: Learning Deterministic Joint Representation for Multimodal Sentiment Analysis2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688376(1-6)Online publication date: 15-Jul-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media