skip to main content
10.1145/3581783.3611974acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

AcFormer: An Aligned and Compact Transformer for Multimodal Sentiment Analysis

Published: 27 October 2023 Publication History

Abstract

Multimodal Sentiment Analysis (MSA) is a popular research topic aimed at utilizing multimodal signals for understanding human emotions. The primary approach to solving this task is to develop complex fusion techniques. However, the heterogeneity and unaligned nature between modalities pose significant challenges to fusion. Additionally, existing methods lack consideration for the efficiency of modal fusion. To tackle these issues, we propose AcFormer, which contains two core ingredients: i) contrastive learning within and across modalities to explicitly align different modality streams before fusion; and ii) pivot attention for multimodal interaction/fusion. The former encourages positive triplets of image-audio-text to have similar representations in contrast to negative ones. The latter introduces attention pivots that can serve as cross-modal information bridges and limit cross-modal attention to a certain number of fusion pivot tokens. We evaluate AcFormer on multiple MSA tasks, including multimodal emotion recognition, humor detection, and sarcasm detection. Empirical evidence shows that AcFormer achieves the optimal performance with minimal computation cost compared to previous state-of-the-art methods. Our code is publicly available at https://github.com/dingchaoyue/AcFormer.

References

[1]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. NeurIPS, Vol. 33 (2020), 12449--12460.
[2]
Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 2 (2018), 423--443.
[3]
Tadas Baltruvs aitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: an open source facial behavior analysis toolkit. In WACV. IEEE, 1--10.
[4]
Marouane Birjali, Mohammed Kasri, and Abderrahim Beni-Hssane. 2021. A comprehensive survey on sentiment analysis: Approaches, challenges and trends. Knowledge-Based Systems, Vol. 226 (2021), 107134.
[5]
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, Vol. 42 (2008), 335--359.
[6]
Santiago Castro, Devamanyu Hazarika, Verónica Pérez-Rosas, Roger Zimmermann, Rada Mihalcea, and Soujanya Poria. 2019. Towards multimodal sarcasm detection (an _obviously_ perfect paper). arXiv preprint arXiv:1906.01815 (2019).
[7]
Dushyant Singh Chauhan, Md Shad Akhtar, Asif Ekbal, and Pushpak Bhattacharyya. 2019. Context-aware interactive attention for multi-modal sentiment and emotion analysis. In EMNLP-IJCNLP. 5647--5657.
[8]
Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltruvs aitis, Amir Zadeh, and Louis-Philippe Morency. 2017. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In International Conference on Multimodal Interaction. 163--171.
[9]
Sheng-Yeh Chen, Chao-Chun Hsu, Chuan-Chun Kuo, Lun-Wei Ku, et al. 2018. Emotionlines: An emotion corpus of multi-party conversations. arXiv preprint arXiv:1802.08379 (2018).
[10]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In ICML. PMLR, 1597--1607.
[11]
Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. COVAREP-A collaborative voice analysis repository for speech technologies. In ICASSP. 960--964.
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[13]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[14]
Thomas Drugman and Abeer Alwan. 2019. Joint robust voicing detection and pitch estimation based on residual harmonics. arXiv preprint arXiv:2001.00459 (2019).
[15]
Thomas Drugman, Mark Thomas, Jon Gudnason, Patrick Naylor, and Thierry Dutoit. 2011. Detection of glottal closure instants from speech signals: A quantitative review. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 20, 3 (2011), 994--1006.
[16]
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821 (2021).
[17]
Jiwei Guo, Jiajia Tang, Weichen Dai, Yu Ding, and Wanzeng Kong. 2022. Dynamically Adjust Word Representations Using Unaligned Multimodal Information. In ACM Multimedia. 3394--3402.
[18]
Md Kamrul Hasan, Wasifur Rahman, Amir Zadeh, Jianyuan Zhong, Md Iftekhar Tanveer, Louis-Philippe Morency, et al. 2019. UR-FUNNY: A multimodal language dataset for understanding humor. arXiv preprint arXiv:1904.06618 (2019).
[19]
Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In ACM Multimedia. 1122--1131.
[20]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In CVPR. 9729--9738.
[21]
Penn Phonetics Laboratory. 2013. p2fa-vislab. https://github.com/ucbvislab/p2fa-vislab/. A script for audio/transcript alignment.
[22]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML. PMLR, 12888--12900.
[23]
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, Vol. 34 (2021), 9694--9705.
[24]
Paul Pu Liang, Ziyin Liu, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Multimodal Language Analysis with Recurrent Multistage Fusion. In EMNLP. 150--161.
[25]
Ye Liu, Siyuan Li, Yang Wu, Chang-Wen Chen, Ying Shan, and Xiaohu Qie. 2022. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In CVPR. 3042--3051.
[26]
Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In ACL. 2247--2256.
[27]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
[28]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPS, Vol. 32 (2019).
[29]
Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, and Guosheng Lin. 2021. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In CVPR. 2554--2562.
[30]
Sijie Mai, Haifeng Hu, and Songlong Xing. 2019a. Divide, conquer and combine: Hierarchical feature fusion network with local and global perspectives for multimodal affective computing. In ACL. 481--492.
[31]
Sijie Mai, Haifeng Hu, and Songlong Xing. 2020. Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. In AAAI, Vol. 34. 164--172.
[32]
Sijie Mai, Songlong Xing, and Haifeng Hu. 2019b. Locally confined modality fusion network with a global perspective for multimodal human affective computing. IEEE Transactions on Multimedia, Vol. 22, 1 (2019), 122--137.
[33]
Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. 2021. Attention bottlenecks for multimodal fusion. NeurIPS, Vol. 34 (2021), 14200--14213.
[34]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
[35]
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. Proc. Interspeech 2019 (2019), 2613--2617.
[36]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532--1543.
[37]
Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabás Póczos. 2019. Found in translation: Learning robust joint representations by cyclic translations between modalities. In AAAI, Vol. 33. 6892--6899.
[38]
Soujanya Poria, Erik Cambria, and Alexander Gelbukh. 2015. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In EMNLP. 2539--2544.
[39]
Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2018. Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508 (2018).
[40]
Shyam Sundar Rajagopalan, Louis-Philippe Morency, Tadas Baltrusaitis, and Roland Goecke. 2016. Extending long short-term memory for multi-view structured learning. In ECCV. Springer, 338--353.
[41]
Erika L Rosenberg and Paul Ekman. 2020. What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). Oxford University Press.
[42]
Hao Sun, Hongyi Wang, Jiaqing Liu, Yen-Wei Chen, and Lanfen Lin. 2022. CubeMLP: An MLP-based model for multimodal sentiment analysis and depression estimation. In ACM Multimedia. 3722--3729.
[43]
Licai Sun, Zheng Lian, Bin Liu, and Jianhua Tao. 2023. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Transactions on Affective Computing (2023).
[44]
Zhongkai Sun, Prathusha Sarma, William Sethares, and Yingyu Liang. 2020. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In AAAI, Vol. 34. 8992--8999.
[45]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. In ICML. PMLR, 10347--10357.
[46]
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In ACL, Vol. 2019. NIH Public Access, 6558.
[47]
Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2018. Learning factorized multimodal representations. arXiv preprint arXiv:1806.06176 (2018).
[48]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, Vol. 9, 11 (2008).
[49]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NeurIPS, Vol. 30 (2017).
[50]
Tongzhou Wang and Phillip Isola. 2020. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML. PMLR, 9929--9939.
[51]
Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2019. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. In AAAI, Vol. 33. 7216--7223.
[52]
Mayur Wankhade, Annavarapu Chandra Sekhara Rao, and Chaitanya Kulkarni. 2022. A survey on sentiment analysis methods, applications, and challenges. Artificial Intelligence Review, Vol. 55, 7 (2022), 5731--5780.
[53]
Yang Wu, Zijie Lin, Yanyan Zhao, Bing Qin, and Li-Nan Zhu. 2021. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In Findings of the Association for Computational Linguistics. 4730--4738.
[54]
Dingkang Yang, Shuai Huang, Haopeng Kuang, Yangtao Du, and Lihua Zhang. 2022. Disentangled Representation Learning for Multimodal Emotion Recognition. In ACM Multimedia. 1642--1651.
[55]
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In EMNLP. 1103--1114.
[56]
Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018a. Memory fusion network for multi-view sequential learning. In AAAI, Vol. 32.
[57]
Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018c. Multi-attention recurrent network for human communication comprehension. In AAAI, Vol. 32.
[58]
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems, Vol. 31, 6 (2016), 82--88.
[59]
AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018b. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In ACL. 2236--2246.

Cited By

View all
  • (2025)TF-BERT: Tensor-based fusion BERT for multimodal sentiment analysisNeural Networks10.1016/j.neunet.2025.107222185(107222)Online publication date: May-2025
  • (2025)Learning fine-grained representation with token-level alignment for multimodal sentiment analysisExpert Systems with Applications10.1016/j.eswa.2024.126274269(126274)Online publication date: Apr-2025
  • (2024)SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion RecognitionProceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing10.1145/3689092.3689404(78-87)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. attention mechanism
  2. contrastive learning
  3. multimodal fusion
  4. multimodal representation learning
  5. multimodal sentiment analysis

Qualifiers

  • Research-article

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)480
  • Downloads (Last 6 weeks)32
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)TF-BERT: Tensor-based fusion BERT for multimodal sentiment analysisNeural Networks10.1016/j.neunet.2025.107222185(107222)Online publication date: May-2025
  • (2025)Learning fine-grained representation with token-level alignment for multimodal sentiment analysisExpert Systems with Applications10.1016/j.eswa.2024.126274269(126274)Online publication date: Apr-2025
  • (2024)SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion RecognitionProceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing10.1145/3689092.3689404(78-87)Online publication date: 28-Oct-2024
  • (2024)GRACE: GRadient-based Active Learning with Curriculum Enhancement for Multimodal Sentiment AnalysisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681617(5702-5711)Online publication date: 28-Oct-2024
  • (2024)GLoMo: Global-Local Modal Fusion for Multimodal Sentiment AnalysisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681527(1800-1809)Online publication date: 28-Oct-2024
  • (2024)ERL-MR: Harnessing the Power of Euler Feature Representations for Balanced Multi-modal LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681215(4591-4600)Online publication date: 28-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media