skip to main content
10.1145/3503161.3548306acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Unified Multi-modal Pre-training for Few-shot Sentiment Analysis with Prompt-based Learning

Published:10 October 2022Publication History

ABSTRACT

Multi-modal sentiment analysis (MSA) has become more and more attractive in both academia and industry. The conventional studies normally require massive labeled data to train the deep neural models. To alleviate the above issue, in this paper, we conduct few-shot MSA with quite a small number of labeled samples. Inspired by the success of textual prompt-based fine-tuning (PF) approaches in few-shot scenario, we introduce a multi-modal prompt-based fine-tuning (MPF) approach. To narrow the semantic gap between language and vision, we propose unified pre-training for multi-modal prompt-based fine-tuning (UP-MPF) with two stages. First, in unified pre-training stage, we employ a simple and effective task to obtain coherent vision-language representations from fixed pre-trained language models (PLMs), i.e., predicting the rotation direction of the input image with a prompt phrase as input concurrently. Second, in multi-modal prompt-based fine-tuning, we freeze the visual encoder to reduce more parameters, which further facilitates few-shot MSA. Extensive experiments and analysis on three coarse-grained and three fine-grained MSA datasets demonstrate the better performance of our UP-MPF against the state-of-the-art of PF, MSA, and multi-modal pre-training approaches.

Skip Supplemental Material Section

Supplemental Material

References

  1. Yitao Cai, Huiyu Cai, and Xiaojun Wan. 2019. Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL). 2506--2515. https://doi.org/10.18653/v1/p19--1239Google ScholarGoogle ScholarCross RefCross Ref
  2. Dushyant Singh Chauhan, Dhanush S. R, Asif Ekbal, and Pushpak Bhattacharyya. 2020. Sentiment and Emotion help Sarcasm? A Multi-task Learning Framework for Multi-Modal Sarcasm, Sentiment and Emotion Analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). 4351--4360. https://doi.org/10.18653/v1/2020.acl-main.401Google ScholarGoogle ScholarCross RefCross Ref
  3. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In Computer Vision - ECCV 2020 - 16th European Conference (ECCV), Vol. 12375. 104--120. https://doi.org/10.1007/978--3-030--58577--8_7Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making Pre-trained Language Models Better Few-shot Learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). 3816--3830. https://doi.org/10.18653/v1/2021.acl-long.295Google ScholarGoogle ScholarCross RefCross Ref
  5. Spyros Gidaris, Praveer Singh, and Nikos Komodakis. 2018. Unsupervised Representation Learning by Predicting Image Rotations. In 6th International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=S1v4N2l0-Google ScholarGoogle Scholar
  6. Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. 2021. PPT: Pre-trained Prompt Tuning for Few-shot Learning. CoRR, Vol. abs/2109.04332 (2021). showeprint[arXiv]2109.04332 https://arxiv.org/abs/2109.04332Google ScholarGoogle Scholar
  7. Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. In MM '20: The 28th ACM International Conference on Multimedia (MM). 1122--1131. https://doi.org/10.1145/3394171.3413678Google ScholarGoogle Scholar
  8. Jingwen Hu, Yuchen Liu, Jinming Zhao, and Qin Jin. 2021b. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). 5666--5675. https://doi.org/10.18653/v1/2021.acl-long.440Google ScholarGoogle ScholarCross RefCross Ref
  9. Shengding Hu, Ning Ding, Huadong Wang, Zhiyuan Liu, Juanzi Li, and Maosong Sun. 2021a. Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification. CoRR, Vol. abs/2108.02035 (2021). arxiv: 2108.02035 https://arxiv.org/abs/2108.02035Google ScholarGoogle Scholar
  10. Amir Hussain, Erik Cambria, Soujanya Poria, Ahmad Y. A. Hawalah, and Francisco Herrera. 2021. Information fusion for affective computing and sentiment analysis. Inf. Fusion, Vol. 71 (2021), 97--98. https://doi.org/10.1016/j.inffus.2021.02.010Google ScholarGoogle ScholarCross RefCross Ref
  11. Xincheng Ju, Dong Zhang, Rong Xiao, Junhui Li, Shoushan Li, Min Zhang, and Guodong Zhou. 2021. Joint Multi-modal Aspect-Sentiment Analysis with Auxiliary Cross-modal Relation Detection. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4395--4405. https://doi.org/10.18653/v1/2021.emnlp-main.360Google ScholarGoogle ScholarCross RefCross Ref
  12. Zaid Khan and Yun Fu. 2021. Exploiting BERT for Multimodal Target Sentiment Classification through Input Space Translation. In MM '21: ACM Multimedia Conference (MM). 3034--3042. https://doi.org/10.1145/3474085.3475692Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Computer Vision - ECCV 2020 - 16th European Conference (ECCV), Vol. 12375. 121--137. https://doi.org/10.1007/978--3-030--58577--8_8Google ScholarGoogle Scholar
  14. Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). 4582--4597. https://doi.org/10.18653/v1/2021.acl-long.353Google ScholarGoogle ScholarCross RefCross Ref
  15. Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollá r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference (ECCV), Vol. 8693. 740--755. https://doi.org/10.1007/978--3--319--10602--1_48Google ScholarGoogle Scholar
  16. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021a. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. CoRR, Vol. abs/2107.13586 (2021). arxiv: 2107.13586 https://arxiv.org/abs/2107.13586Google ScholarGoogle Scholar
  17. Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2021b. GPT Understands, Too. CoRR, Vol. abs/2103.10385 (2021). arxiv: 2103.10385 https://arxiv.org/abs/2103.10385Google ScholarGoogle Scholar
  18. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS). 13--23. https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.htmlGoogle ScholarGoogle Scholar
  19. Teng Niu, Shiai Zhu, Lei Pang, and Abdulmotaleb El-Saddik. 2016. Sentiment Analysis on Multi-View Social Data. In MultiMedia Modeling - 22nd International Conference (MMM). 15--27. https://doi.org/10.1007/978--3--319--27674--8_2Google ScholarGoogle Scholar
  20. Tulika Saha, Apoorva Upadhyaya, Sriparna Saha, and Pushpak Bhattacharyya. 2021. Towards Sentiment and Emotion aided Multi-modal Speech Act Classification in Twitter. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 5727--5737. https://doi.org/10.18653/v1/2021.naacl-main.456Google ScholarGoogle ScholarCross RefCross Ref
  21. Gözde Gül cS ahin and Mark Steedman. 2018. Data Augmentation via Dependency Tree Morphing for Low-Resource Languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 5004--5009. https://doi.org/10.18653/v1/D18--1545Google ScholarGoogle Scholar
  22. Timo Schick and Hinrich Schü tze. 2021a. Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL). 255--269. https://aclanthology.org/2021.eacl-main.20/Google ScholarGoogle ScholarCross RefCross Ref
  23. Timo Schick and Hinrich Schü tze. 2021b. It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 2339--2352. https://doi.org/10.18653/v1/2021.naacl-main.185Google ScholarGoogle ScholarCross RefCross Ref
  24. Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 5099--5110. https://doi.org/10.18653/v1/D19--1514Google ScholarGoogle ScholarCross RefCross Ref
  25. Jiajia Tang, Kang Li, Xuanyu Jin, Andrzej Cichocki, Qibin Zhao, and Wanzeng Kong. 2021. CTFN: Hierarchical Learning for Multimodal Sentiment Analysis Using Coupled-Translation Fusion Network. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). 5301--5311. https://doi.org/10.18653/v1/2021.acl-long.412Google ScholarGoogle ScholarCross RefCross Ref
  26. Quoc-Tuan Truong and Hady W. Lauw. 2019. VistaNet: Visual Aspect Attention Network for Multimodal Sentiment Analysis. In The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI). 305--312. https://doi.org/10.1609/aaai.v33i01.3301305Google ScholarGoogle Scholar
  27. Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal Few-Shot Learning with Frozen Language Models. CoRR, Vol. abs/2106.13884 (2021). arxiv: 2106.13884 https://arxiv.org/abs/2106.13884Google ScholarGoogle Scholar
  28. Chengyu Wang, Jianing Wang, Minghui Qiu, Jun Huang, and Ming Gao. 2021. TransPrompt: Towards an Automatic Transferable Prompting Framework for Few-shot Text Classification. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2792--2802. https://doi.org/10.18653/v1/2021.emnlp-main.221Google ScholarGoogle ScholarCross RefCross Ref
  29. Zilong Wang, Zhaohong Wan, and Xiaojun Wan. 2020. TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis. In WWW '20: The Web Conference 2020 (WWW). 2514--2520. https://doi.org/10.1145/3366423.3380000Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jason Wei and Kai Zou. 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 6382--6388. https://doi.org/10.18653/v1/D19--1670Google ScholarGoogle ScholarCross RefCross Ref
  31. Yang Wu, Zijie Lin, Yanyan Zhao, Bing Qin, and Li-Nan Zhu. 2021. A Text-Centered Shared-Private Framework via Cross-Modal Prediction for Multimodal Sentiment Analysis. In Findings of the Association for Computational Linguistics (ACL/IJCNLP). 4730--4738. https://doi.org/10.18653/v1/2021.findings-acl.417Google ScholarGoogle Scholar
  32. Nan Xu, Wenji Mao, and Guandan Chen. 2018. A Co-Memory Network for Multimodal Sentiment Analysis. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR). 929--932. https://doi.org/10.1145/3209978.3210093Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Nan Xu, Wenji Mao, and Guandan Chen. 2019. Multi-Interactive Memory Network for Aspect Based Multimodal Sentiment Analysis. In The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI). 371--378. https://doi.org/10.1609/aaai.v33i01.3301371Google ScholarGoogle Scholar
  34. Xiaocui Yang, Shi Feng, Daling Wang, and Yifei Zhang. 2020. Image-text Multimodal Emotion Classification via Multi-view Attentional Network. IEEE Transactions on Multimedia (2020).Google ScholarGoogle Scholar
  35. Xiaocui Yang, Shi Feng, Yifei Zhang, and Daling Wang. 2021. Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). 328--339. https://doi.org/10.18653/v1/2021.acl-long.28Google ScholarGoogle ScholarCross RefCross Ref
  36. Jianfei Yu and Jing Jiang. 2019. Adapting BERT for Target-Oriented Multimodal Sentiment Classification. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI). 5408--5414. https://doi.org/10.24963/ijcai.2019/751Google ScholarGoogle ScholarCross RefCross Ref
  37. Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021. Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. In Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI). 10790--10797. https://ojs.aaai.org/index.php/AAAI/article/view/17289Google ScholarGoogle Scholar
  38. Dong Zhang, Xincheng Ju, Wei Zhang, Junhui Li, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. 2021. Multi-modal Multi-label Emotion Recognition with Heterogeneous Hierarchical Message Passing. In Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI). 14338--14346. https://ojs.aaai.org/index.php/AAAI/article/view/17686Google ScholarGoogle Scholar
  39. Jinming Zhao, Ruichen Li, Qin Jin, Xinchao Wang, and Haizhou Li. 2021. MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition. CoRR, Vol. abs/2111.00865 (2021). showeprint[arXiv]2111.00865 https://arxiv.org/abs/2111.00865Google ScholarGoogle Scholar
  40. Jie Zhou, Jiabao Zhao, Jimmy Xiangji Huang, Qinmin Vivian Hu, and Liang He. 2021b. MASAD: A large-scale dataset for multimodal aspect-based sentiment analysis. Neurocomputing, Vol. 455 (2021), 47--58. https://doi.org/10.1016/j.neucom.2021.05.040Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Suping Zhou, Jia Jia, Zhiyong Wu, Zhihan Yang, Yanfeng Wang, Wei Chen, Fanbo Meng, Shuo Huang, Jialie Shen, and Xiaochuan Wang. 2021a. Inferring Emotion from Large-scale Internet Voice Data: A Semi-supervised Curriculum Augmentation based Deep Learning Approach. In Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI). 6039--6047. https://ojs.aaai.org/index.php/AAAI/article/view/16753.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Unified Multi-modal Pre-training for Few-shot Sentiment Analysis with Prompt-based Learning

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '22: Proceedings of the 30th ACM International Conference on Multimedia
          October 2022
          7537 pages
          ISBN:9781450392037
          DOI:10.1145/3503161

          Copyright © 2022 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 10 October 2022

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader