research-article

Unified Multi-modal Pre-training for Few-shot Sentiment Analysis with Prompt-based Learning

Authors:
Yang Yu

Soochow University, Suzhou, China

Soochow University, Suzhou, China
View Profile

,
Dong Zhang

Soochow University, Suzhou, China

Soochow University, Suzhou, China
View Profile

,
Shoushan Li

Soochow University, Suzhou, China

Soochow University, Suzhou, China
View Profile

MM '22: Proceedings of the 30th ACM International Conference on MultimediaOctober 2022Pages 189–198https://doi.org/10.1145/3503161.3548306

Published:10 October 2022Publication History

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 189–198

ABSTRACT

Multi-modal sentiment analysis (MSA) has become more and more attractive in both academia and industry. The conventional studies normally require massive labeled data to train the deep neural models. To alleviate the above issue, in this paper, we conduct few-shot MSA with quite a small number of labeled samples. Inspired by the success of textual prompt-based fine-tuning (PF) approaches in few-shot scenario, we introduce a multi-modal prompt-based fine-tuning (MPF) approach. To narrow the semantic gap between language and vision, we propose unified pre-training for multi-modal prompt-based fine-tuning (UP-MPF) with two stages. First, in unified pre-training stage, we employ a simple and effective task to obtain coherent vision-language representations from fixed pre-trained language models (PLMs), i.e., predicting the rotation direction of the input image with a prompt phrase as input concurrently. Second, in multi-modal prompt-based fine-tuning, we freeze the visual encoder to reduce more parameters, which further facilitates few-shot MSA. Extensive experiments and analysis on three coarse-grained and three fine-grained MSA datasets demonstrate the better performance of our UP-MPF against the state-of-the-art of PF, MSA, and multi-modal pre-training approaches.

Supplemental Material

Available for Download

mp4

MM22-fp2438.mp4 (24 MB)

References

Yitao Cai, Huiyu Cai, and Xiaojun Wan. 2019. Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL). 2506--2515. https://doi.org/10.18653/v1/p19--1239Google ScholarCross Ref
Dushyant Singh Chauhan, Dhanush S. R, Asif Ekbal, and Pushpak Bhattacharyya. 2020. Sentiment and Emotion help Sarcasm? A Multi-task Learning Framework for Multi-Modal Sarcasm, Sentiment and Emotion Analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). 4351--4360. https://doi.org/10.18653/v1/2020.acl-main.401Google ScholarCross Ref
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In Computer Vision - ECCV 2020 - 16th European Conference (ECCV), Vol. 12375. 104--120. https://doi.org/10.1007/978--3-030--58577--8_7Google ScholarDigital Library
Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making Pre-trained Language Models Better Few-shot Learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). 3816--3830. https://doi.org/10.18653/v1/2021.acl-long.295Google ScholarCross Ref
Spyros Gidaris, Praveer Singh, and Nikos Komodakis. 2018. Unsupervised Representation Learning by Predicting Image Rotations. In 6th International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=S1v4N2l0-Google Scholar
Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. 2021. PPT: Pre-trained Prompt Tuning for Few-shot Learning. CoRR, Vol. abs/2109.04332 (2021). showeprint[arXiv]2109.04332 https://arxiv.org/abs/2109.04332Google Scholar
Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. In MM '20: The 28th ACM International Conference on Multimedia (MM). 1122--1131. https://doi.org/10.1145/3394171.3413678Google Scholar
Jingwen Hu, Yuchen Liu, Jinming Zhao, and Qin Jin. 2021b. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). 5666--5675. https://doi.org/10.18653/v1/2021.acl-long.440Google ScholarCross Ref
Shengding Hu, Ning Ding, Huadong Wang, Zhiyuan Liu, Juanzi Li, and Maosong Sun. 2021a. Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification. CoRR, Vol. abs/2108.02035 (2021). arxiv: 2108.02035 https://arxiv.org/abs/2108.02035Google Scholar
Amir Hussain, Erik Cambria, Soujanya Poria, Ahmad Y. A. Hawalah, and Francisco Herrera. 2021. Information fusion for affective computing and sentiment analysis. Inf. Fusion, Vol. 71 (2021), 97--98. https://doi.org/10.1016/j.inffus.2021.02.010Google ScholarCross Ref
Xincheng Ju, Dong Zhang, Rong Xiao, Junhui Li, Shoushan Li, Min Zhang, and Guodong Zhou. 2021. Joint Multi-modal Aspect-Sentiment Analysis with Auxiliary Cross-modal Relation Detection. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4395--4405. https://doi.org/10.18653/v1/2021.emnlp-main.360Google ScholarCross Ref
Zaid Khan and Yun Fu. 2021. Exploiting BERT for Multimodal Target Sentiment Classification through Input Space Translation. In MM '21: ACM Multimedia Conference (MM). 3034--3042. https://doi.org/10.1145/3474085.3475692Google ScholarDigital Library
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Computer Vision - ECCV 2020 - 16th European Conference (ECCV), Vol. 12375. 121--137. https://doi.org/10.1007/978--3-030--58577--8_8Google Scholar
Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). 4582--4597. https://doi.org/10.18653/v1/2021.acl-long.353Google ScholarCross Ref
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollá r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference (ECCV), Vol. 8693. 740--755. https://doi.org/10.1007/978--3--319--10602--1_48Google Scholar
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021a. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. CoRR, Vol. abs/2107.13586 (2021). arxiv: 2107.13586 https://arxiv.org/abs/2107.13586Google Scholar
Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2021b. GPT Understands, Too. CoRR, Vol. abs/2103.10385 (2021). arxiv: 2103.10385 https://arxiv.org/abs/2103.10385Google Scholar
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS). 13--23. https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.htmlGoogle Scholar
Teng Niu, Shiai Zhu, Lei Pang, and Abdulmotaleb El-Saddik. 2016. Sentiment Analysis on Multi-View Social Data. In MultiMedia Modeling - 22nd International Conference (MMM). 15--27. https://doi.org/10.1007/978--3--319--27674--8_2Google Scholar
Tulika Saha, Apoorva Upadhyaya, Sriparna Saha, and Pushpak Bhattacharyya. 2021. Towards Sentiment and Emotion aided Multi-modal Speech Act Classification in Twitter. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 5727--5737. https://doi.org/10.18653/v1/2021.naacl-main.456Google ScholarCross Ref
Gözde Gül cS ahin and Mark Steedman. 2018. Data Augmentation via Dependency Tree Morphing for Low-Resource Languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 5004--5009. https://doi.org/10.18653/v1/D18--1545Google Scholar
Timo Schick and Hinrich Schü tze. 2021a. Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL). 255--269. https://aclanthology.org/2021.eacl-main.20/Google ScholarCross Ref
Timo Schick and Hinrich Schü tze. 2021b. It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 2339--2352. https://doi.org/10.18653/v1/2021.naacl-main.185Google ScholarCross Ref
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 5099--5110. https://doi.org/10.18653/v1/D19--1514Google ScholarCross Ref
Jiajia Tang, Kang Li, Xuanyu Jin, Andrzej Cichocki, Qibin Zhao, and Wanzeng Kong. 2021. CTFN: Hierarchical Learning for Multimodal Sentiment Analysis Using Coupled-Translation Fusion Network. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). 5301--5311. https://doi.org/10.18653/v1/2021.acl-long.412Google ScholarCross Ref
Quoc-Tuan Truong and Hady W. Lauw. 2019. VistaNet: Visual Aspect Attention Network for Multimodal Sentiment Analysis. In The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI). 305--312. https://doi.org/10.1609/aaai.v33i01.3301305Google Scholar
Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal Few-Shot Learning with Frozen Language Models. CoRR, Vol. abs/2106.13884 (2021). arxiv: 2106.13884 https://arxiv.org/abs/2106.13884Google Scholar
Chengyu Wang, Jianing Wang, Minghui Qiu, Jun Huang, and Ming Gao. 2021. TransPrompt: Towards an Automatic Transferable Prompting Framework for Few-shot Text Classification. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2792--2802. https://doi.org/10.18653/v1/2021.emnlp-main.221Google ScholarCross Ref
Zilong Wang, Zhaohong Wan, and Xiaojun Wan. 2020. TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis. In WWW '20: The Web Conference 2020 (WWW). 2514--2520. https://doi.org/10.1145/3366423.3380000Google ScholarDigital Library
Jason Wei and Kai Zou. 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 6382--6388. https://doi.org/10.18653/v1/D19--1670Google ScholarCross Ref
Yang Wu, Zijie Lin, Yanyan Zhao, Bing Qin, and Li-Nan Zhu. 2021. A Text-Centered Shared-Private Framework via Cross-Modal Prediction for Multimodal Sentiment Analysis. In Findings of the Association for Computational Linguistics (ACL/IJCNLP). 4730--4738. https://doi.org/10.18653/v1/2021.findings-acl.417Google Scholar
Nan Xu, Wenji Mao, and Guandan Chen. 2018. A Co-Memory Network for Multimodal Sentiment Analysis. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR). 929--932. https://doi.org/10.1145/3209978.3210093Google ScholarDigital Library
Nan Xu, Wenji Mao, and Guandan Chen. 2019. Multi-Interactive Memory Network for Aspect Based Multimodal Sentiment Analysis. In The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI). 371--378. https://doi.org/10.1609/aaai.v33i01.3301371Google Scholar
Xiaocui Yang, Shi Feng, Daling Wang, and Yifei Zhang. 2020. Image-text Multimodal Emotion Classification via Multi-view Attentional Network. IEEE Transactions on Multimedia (2020).Google Scholar
Xiaocui Yang, Shi Feng, Yifei Zhang, and Daling Wang. 2021. Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). 328--339. https://doi.org/10.18653/v1/2021.acl-long.28Google ScholarCross Ref
Jianfei Yu and Jing Jiang. 2019. Adapting BERT for Target-Oriented Multimodal Sentiment Classification. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI). 5408--5414. https://doi.org/10.24963/ijcai.2019/751Google ScholarCross Ref
Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021. Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. In Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI). 10790--10797. https://ojs.aaai.org/index.php/AAAI/article/view/17289Google Scholar
Dong Zhang, Xincheng Ju, Wei Zhang, Junhui Li, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. 2021. Multi-modal Multi-label Emotion Recognition with Heterogeneous Hierarchical Message Passing. In Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI). 14338--14346. https://ojs.aaai.org/index.php/AAAI/article/view/17686Google Scholar
Jinming Zhao, Ruichen Li, Qin Jin, Xinchao Wang, and Haizhou Li. 2021. MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition. CoRR, Vol. abs/2111.00865 (2021). showeprint[arXiv]2111.00865 https://arxiv.org/abs/2111.00865Google Scholar
Jie Zhou, Jiabao Zhao, Jimmy Xiangji Huang, Qinmin Vivian Hu, and Liang He. 2021b. MASAD: A large-scale dataset for multimodal aspect-based sentiment analysis. Neurocomputing, Vol. 455 (2021), 47--58. https://doi.org/10.1016/j.neucom.2021.05.040Google ScholarDigital Library
Suping Zhou, Jia Jia, Zhiyong Wu, Zhihan Yang, Yanfeng Wang, Wei Chen, Fanbo Meng, Shuo Huang, Jialie Shen, and Xiaochuan Wang. 2021a. Inferring Emotion from Large-scale Internet Voice Data: A Semi-supervised Curriculum Augmentation based Deep Learning Approach. In Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI). 6039--6047. https://ojs.aaai.org/index.php/AAAI/article/view/16753.Google ScholarCross Ref

Index Terms

Unified Multi-modal Pre-training for Few-shot Sentiment Analysis with Prompt-based Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Sentiment analysis
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Multi-stage Multi-modal Pre-training for Video Representation
Natural Language Processing and Chinese Computing
Abstract
Multi-modal networks are usually challenging to train because of their complexity. On the one hand, multi-modal networks are often prone to underfitting due to their heterogeneous data formats of different modalities. On the other hand, data from ...
Read More
Prompt-Based Self-training Framework for Few-Shot Named Entity Recognition
Knowledge Science, Engineering and Management
Abstract
Exploiting unlabeled data is one of the plausible methods to improve few-shot named entity recognition (few-shot NER), where only a small number of labeled examples are given for each entity type. Existing works focus on learning deep NER models ...
Read More
Multi-modal Pathological Pre-training via Masked Autoencoders for Breast Cancer Diagnosis
Medical Image Computing and Computer Assisted Intervention – MICCAI 2023
Abstract
Breast cancer (BC) is one of the most common cancers identified globally among women, which has become the leading cause of death. Multi-modal pathological images contain different information for BC diagnosis. Hematoxylin and eosin (H &E) ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 October 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
few-shot learning
multi-modal pre-training
multi-modal sentiment analysis
prompt-based fine-tuning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 844
  Total Downloads
- Downloads (Last 12 months)448
- Downloads (Last 6 weeks)49
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Unified Multi-modal Pre-training for Few-shot Sentiment Analysis with Prompt-based Learning

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Multi-stage Multi-modal Pre-training for Video Representation

Prompt-Based Self-training Framework for Few-Shot Named Entity Recognition

Multi-modal Pathological Pre-training via Masked Autoencoders for Breast Cancer Diagnosis