ABSTRACT
Multi-modal sentiment analysis (MSA) has become more and more attractive in both academia and industry. The conventional studies normally require massive labeled data to train the deep neural models. To alleviate the above issue, in this paper, we conduct few-shot MSA with quite a small number of labeled samples. Inspired by the success of textual prompt-based fine-tuning (PF) approaches in few-shot scenario, we introduce a multi-modal prompt-based fine-tuning (MPF) approach. To narrow the semantic gap between language and vision, we propose unified pre-training for multi-modal prompt-based fine-tuning (UP-MPF) with two stages. First, in unified pre-training stage, we employ a simple and effective task to obtain coherent vision-language representations from fixed pre-trained language models (PLMs), i.e., predicting the rotation direction of the input image with a prompt phrase as input concurrently. Second, in multi-modal prompt-based fine-tuning, we freeze the visual encoder to reduce more parameters, which further facilitates few-shot MSA. Extensive experiments and analysis on three coarse-grained and three fine-grained MSA datasets demonstrate the better performance of our UP-MPF against the state-of-the-art of PF, MSA, and multi-modal pre-training approaches.
Supplemental Material
Available for Download
- Yitao Cai, Huiyu Cai, and Xiaojun Wan. 2019. Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL). 2506--2515. https://doi.org/10.18653/v1/p19--1239Google ScholarCross Ref
- Dushyant Singh Chauhan, Dhanush S. R, Asif Ekbal, and Pushpak Bhattacharyya. 2020. Sentiment and Emotion help Sarcasm? A Multi-task Learning Framework for Multi-Modal Sarcasm, Sentiment and Emotion Analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). 4351--4360. https://doi.org/10.18653/v1/2020.acl-main.401Google ScholarCross Ref
- Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In Computer Vision - ECCV 2020 - 16th European Conference (ECCV), Vol. 12375. 104--120. https://doi.org/10.1007/978--3-030--58577--8_7Google ScholarDigital Library
- Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making Pre-trained Language Models Better Few-shot Learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). 3816--3830. https://doi.org/10.18653/v1/2021.acl-long.295Google ScholarCross Ref
- Spyros Gidaris, Praveer Singh, and Nikos Komodakis. 2018. Unsupervised Representation Learning by Predicting Image Rotations. In 6th International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=S1v4N2l0-Google Scholar
- Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. 2021. PPT: Pre-trained Prompt Tuning for Few-shot Learning. CoRR, Vol. abs/2109.04332 (2021). showeprint[arXiv]2109.04332 https://arxiv.org/abs/2109.04332Google Scholar
- Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. In MM '20: The 28th ACM International Conference on Multimedia (MM). 1122--1131. https://doi.org/10.1145/3394171.3413678Google Scholar
- Jingwen Hu, Yuchen Liu, Jinming Zhao, and Qin Jin. 2021b. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). 5666--5675. https://doi.org/10.18653/v1/2021.acl-long.440Google ScholarCross Ref
- Shengding Hu, Ning Ding, Huadong Wang, Zhiyuan Liu, Juanzi Li, and Maosong Sun. 2021a. Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification. CoRR, Vol. abs/2108.02035 (2021). arxiv: 2108.02035 https://arxiv.org/abs/2108.02035Google Scholar
- Amir Hussain, Erik Cambria, Soujanya Poria, Ahmad Y. A. Hawalah, and Francisco Herrera. 2021. Information fusion for affective computing and sentiment analysis. Inf. Fusion, Vol. 71 (2021), 97--98. https://doi.org/10.1016/j.inffus.2021.02.010Google ScholarCross Ref
- Xincheng Ju, Dong Zhang, Rong Xiao, Junhui Li, Shoushan Li, Min Zhang, and Guodong Zhou. 2021. Joint Multi-modal Aspect-Sentiment Analysis with Auxiliary Cross-modal Relation Detection. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4395--4405. https://doi.org/10.18653/v1/2021.emnlp-main.360Google ScholarCross Ref
- Zaid Khan and Yun Fu. 2021. Exploiting BERT for Multimodal Target Sentiment Classification through Input Space Translation. In MM '21: ACM Multimedia Conference (MM). 3034--3042. https://doi.org/10.1145/3474085.3475692Google ScholarDigital Library
- Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Computer Vision - ECCV 2020 - 16th European Conference (ECCV), Vol. 12375. 121--137. https://doi.org/10.1007/978--3-030--58577--8_8Google Scholar
- Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). 4582--4597. https://doi.org/10.18653/v1/2021.acl-long.353Google ScholarCross Ref
- Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollá r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference (ECCV), Vol. 8693. 740--755. https://doi.org/10.1007/978--3--319--10602--1_48Google Scholar
- Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021a. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. CoRR, Vol. abs/2107.13586 (2021). arxiv: 2107.13586 https://arxiv.org/abs/2107.13586Google Scholar
- Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2021b. GPT Understands, Too. CoRR, Vol. abs/2103.10385 (2021). arxiv: 2103.10385 https://arxiv.org/abs/2103.10385Google Scholar
- Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS). 13--23. https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.htmlGoogle Scholar
- Teng Niu, Shiai Zhu, Lei Pang, and Abdulmotaleb El-Saddik. 2016. Sentiment Analysis on Multi-View Social Data. In MultiMedia Modeling - 22nd International Conference (MMM). 15--27. https://doi.org/10.1007/978--3--319--27674--8_2Google Scholar
- Tulika Saha, Apoorva Upadhyaya, Sriparna Saha, and Pushpak Bhattacharyya. 2021. Towards Sentiment and Emotion aided Multi-modal Speech Act Classification in Twitter. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 5727--5737. https://doi.org/10.18653/v1/2021.naacl-main.456Google ScholarCross Ref
- Gözde Gül cS ahin and Mark Steedman. 2018. Data Augmentation via Dependency Tree Morphing for Low-Resource Languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 5004--5009. https://doi.org/10.18653/v1/D18--1545Google Scholar
- Timo Schick and Hinrich Schü tze. 2021a. Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL). 255--269. https://aclanthology.org/2021.eacl-main.20/Google ScholarCross Ref
- Timo Schick and Hinrich Schü tze. 2021b. It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 2339--2352. https://doi.org/10.18653/v1/2021.naacl-main.185Google ScholarCross Ref
- Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 5099--5110. https://doi.org/10.18653/v1/D19--1514Google ScholarCross Ref
- Jiajia Tang, Kang Li, Xuanyu Jin, Andrzej Cichocki, Qibin Zhao, and Wanzeng Kong. 2021. CTFN: Hierarchical Learning for Multimodal Sentiment Analysis Using Coupled-Translation Fusion Network. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). 5301--5311. https://doi.org/10.18653/v1/2021.acl-long.412Google ScholarCross Ref
- Quoc-Tuan Truong and Hady W. Lauw. 2019. VistaNet: Visual Aspect Attention Network for Multimodal Sentiment Analysis. In The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI). 305--312. https://doi.org/10.1609/aaai.v33i01.3301305Google Scholar
- Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal Few-Shot Learning with Frozen Language Models. CoRR, Vol. abs/2106.13884 (2021). arxiv: 2106.13884 https://arxiv.org/abs/2106.13884Google Scholar
- Chengyu Wang, Jianing Wang, Minghui Qiu, Jun Huang, and Ming Gao. 2021. TransPrompt: Towards an Automatic Transferable Prompting Framework for Few-shot Text Classification. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2792--2802. https://doi.org/10.18653/v1/2021.emnlp-main.221Google ScholarCross Ref
- Zilong Wang, Zhaohong Wan, and Xiaojun Wan. 2020. TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis. In WWW '20: The Web Conference 2020 (WWW). 2514--2520. https://doi.org/10.1145/3366423.3380000Google ScholarDigital Library
- Jason Wei and Kai Zou. 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 6382--6388. https://doi.org/10.18653/v1/D19--1670Google ScholarCross Ref
- Yang Wu, Zijie Lin, Yanyan Zhao, Bing Qin, and Li-Nan Zhu. 2021. A Text-Centered Shared-Private Framework via Cross-Modal Prediction for Multimodal Sentiment Analysis. In Findings of the Association for Computational Linguistics (ACL/IJCNLP). 4730--4738. https://doi.org/10.18653/v1/2021.findings-acl.417Google Scholar
- Nan Xu, Wenji Mao, and Guandan Chen. 2018. A Co-Memory Network for Multimodal Sentiment Analysis. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR). 929--932. https://doi.org/10.1145/3209978.3210093Google ScholarDigital Library
- Nan Xu, Wenji Mao, and Guandan Chen. 2019. Multi-Interactive Memory Network for Aspect Based Multimodal Sentiment Analysis. In The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI). 371--378. https://doi.org/10.1609/aaai.v33i01.3301371Google Scholar
- Xiaocui Yang, Shi Feng, Daling Wang, and Yifei Zhang. 2020. Image-text Multimodal Emotion Classification via Multi-view Attentional Network. IEEE Transactions on Multimedia (2020).Google Scholar
- Xiaocui Yang, Shi Feng, Yifei Zhang, and Daling Wang. 2021. Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). 328--339. https://doi.org/10.18653/v1/2021.acl-long.28Google ScholarCross Ref
- Jianfei Yu and Jing Jiang. 2019. Adapting BERT for Target-Oriented Multimodal Sentiment Classification. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI). 5408--5414. https://doi.org/10.24963/ijcai.2019/751Google ScholarCross Ref
- Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021. Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. In Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI). 10790--10797. https://ojs.aaai.org/index.php/AAAI/article/view/17289Google Scholar
- Dong Zhang, Xincheng Ju, Wei Zhang, Junhui Li, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. 2021. Multi-modal Multi-label Emotion Recognition with Heterogeneous Hierarchical Message Passing. In Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI). 14338--14346. https://ojs.aaai.org/index.php/AAAI/article/view/17686Google Scholar
- Jinming Zhao, Ruichen Li, Qin Jin, Xinchao Wang, and Haizhou Li. 2021. MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition. CoRR, Vol. abs/2111.00865 (2021). showeprint[arXiv]2111.00865 https://arxiv.org/abs/2111.00865Google Scholar
- Jie Zhou, Jiabao Zhao, Jimmy Xiangji Huang, Qinmin Vivian Hu, and Liang He. 2021b. MASAD: A large-scale dataset for multimodal aspect-based sentiment analysis. Neurocomputing, Vol. 455 (2021), 47--58. https://doi.org/10.1016/j.neucom.2021.05.040Google ScholarDigital Library
- Suping Zhou, Jia Jia, Zhiyong Wu, Zhihan Yang, Yanfeng Wang, Wei Chen, Fanbo Meng, Shuo Huang, Jialie Shen, and Xiaochuan Wang. 2021a. Inferring Emotion from Large-scale Internet Voice Data: A Semi-supervised Curriculum Augmentation based Deep Learning Approach. In Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI). 6039--6047. https://ojs.aaai.org/index.php/AAAI/article/view/16753.Google ScholarCross Ref
Index Terms
- Unified Multi-modal Pre-training for Few-shot Sentiment Analysis with Prompt-based Learning
Recommendations
Multi-stage Multi-modal Pre-training for Video Representation
Natural Language Processing and Chinese ComputingAbstractMulti-modal networks are usually challenging to train because of their complexity. On the one hand, multi-modal networks are often prone to underfitting due to their heterogeneous data formats of different modalities. On the other hand, data from ...
Prompt-Based Self-training Framework for Few-Shot Named Entity Recognition
Knowledge Science, Engineering and ManagementAbstractExploiting unlabeled data is one of the plausible methods to improve few-shot named entity recognition (few-shot NER), where only a small number of labeled examples are given for each entity type. Existing works focus on learning deep NER models ...
Multi-modal Pathological Pre-training via Masked Autoencoders for Breast Cancer Diagnosis
Medical Image Computing and Computer Assisted Intervention – MICCAI 2023AbstractBreast cancer (BC) is one of the most common cancers identified globally among women, which has become the leading cause of death. Multi-modal pathological images contain different information for BC diagnosis. Hematoxylin and eosin (H &E) ...
Comments