research-article

GRACE: GRadient-based Active Learning with Curriculum Enhancement for Multimodal Sentiment Analysis

Authors:

Xiaoyan SunAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 5702 - 5711

https://doi.org/10.1145/3664647.3681617

Published: 28 October 2024 Publication History

Abstract

Multimodal sentiment analysis (MSA) aims to predict sentiment from text, audio, and visual data of videos. Existing works focus on designing fusion strategies or decoupling mechanisms, which suffer from low data utilization and a heavy reliance on large amounts of labeled data. However, acquiring large-scale annotations for multimodal sentiment analysis is extremely labor-intensive and costly. To address this challenge, we propose GRACE, a GRadient-based Active learning method with Curriculum Enhancement, designed for MSA under a multi-task learning framework. Our approach achieves annotation reduction by strategically selecting valuable samples from the unlabeled data pool while maintaining high-performance levels. Specifically, we introduce informativeness and representativeness criteria, calculated from gradient magnitudes and sample distances, to quantify the active value of unlabeled samples. Additionally, an easiness criterion is incorporated to avoid outliers, considering the relationship between modality consistency and sample difficulty. During the learning process, we dynamically balance sample difficulty and active value, guided by the curriculum learning principle. This strategy prioritizes easier, modality-aligned samples for stable initial training, then gradually increases the difficulty by incorporating more challenging samples with modality conflicts. Extensive experiments demonstrate the effectiveness of our approach on both multimodal sentiment regression and classification benchmarks.

Supplemental Material

MP4 File - 5069-video.mp4

Video Presentation about Tackling Data Utilization in Multimodal Sentiment Analysis with GRACE

Download
6.60 MB

References

[1]

Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. 2020. Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds. In International Conference on Learning Representations.

[2]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th International Conference on Machine Learning. PMLR, 41--48.

Digital Library

[3]

Wenbin Cai, Ya Zhang, and Jun Zhou. 2013. Maximizing expected model change for active learning in regression. In 2013 IEEE 13th international conference on data mining. IEEE, 51--60.

[4]

Razvan Caramalau, Binod Bhattarai, and Tae-Kyun Kim. 2021. Sequential graph convolutional network for active learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9583--9592.

[5]

Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. 2017. Reverse curriculum generation for reinforcement learning. In Conference on robot learning. PMLR, 482--495.

[6]

Ziwang Fu, Feng Liu, Qing Xu, Jiayin Qi, Xiangling Fu, Aimin Zhou, and Zhibin Li. 2022. NHFNET: a non-homogeneous fusion network for multimodal sentiment analysis. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.

[7]

Yarin Gal, Riashat Islam, and Zoubin Ghahramani. 2017. Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning. PMLR, 1183--1192.

[8]

Mingfei Gao, Zizhao Zhang, Guo Yu, Sercan Ö Arik, Larry S Davis, and Tomas Pfister. 2020. Consistency-based semi-supervised active learning: Towards minimizing labeling cost. In Proceedings of the European Conference on Computer Vision. 510--526.

Digital Library

[9]

Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R Scott, and Dinglong Huang. 2018. Curriculumnet: Weakly supervised learning from large-scale web images. In Proceedings of the European Conference on Computer Vision. 135--150.

Digital Library

[10]

Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia. 1122--1131.

Digital Library

[11]

Wentao Hu, Lianglun Cheng, Guoheng Huang, Xiaochen Yuan, Guo Zhong, Chi-Man Pun, Jian Zhou, and Muyan Cai. 2024. Learning From Incorrectness: Active Learning With Negative Pre-Training and Curriculum Querying for Histological Tissue Classification. IEEE Transactions on Medical Imaging, Vol. 43, 2 (2024), 625--637.

[12]

Lu Jiang, Deyu Meng, Teruko Mitamura, and Alexander G Hauptmann. 2014. Easy samples first: Self-paced reranking for zero-example multimedia search. In Proceedings of the 22nd ACM International Conference on Multimedia. 547--556.

Digital Library

[13]

Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, and Alexander Hauptmann. 2014. Self-paced learning with diversity. Advances in Neural Information Processing Aystems, Vol. 27 (2014).

[14]

Benjamin Kellenberger, Diego Marcos, Sylvain Lobry, and Devis Tuia. 2019. Half a percent of labels is enough: Efficient animal detection in UAV imagery using deep CNNs and active learning. IEEE Transactions on Geoscience and Remote Sensing, Vol. 57, 12 (2019), 9524--9533.

[15]

Kwanyoung Kim, Dongwon Park, Kwang In Kim, and Se Young Chun. 2021. Task-aware variational adversarial active learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8166--8175.

[16]

Andreas Kirsch, Joost Van Amersfoort, and Yarin Gal. 2019. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. Advances in Neural Information Processing Systems, Vol. 32 (2019).

[17]

Yong Li, Yuanzhi Wang, and Zhen Cui. 2023. Decoupled multimodal distilling for emotion recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6631--6640.

[18]

Liang Lin, Keze Wang, Deyu Meng, Wangmeng Zuo, and Lei Zhang. 2017. Active self-paced learning for cost-effective and progressive face identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, 1 (2017), 7--19.

[19]

Peng Liu, Hui Zhang, and Kie B Eom. 2016. Active deep learning for classification of hyperspectral images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, Vol. 10, 2 (2016), 712--724.

[20]

Yihe Liu, Ziqi Yuan, Huisheng Mao, Zhiyun Liang, Wanqiuyue Yang, Yuanzhe Qiu, Tie Cheng, Xiaoteng Li, Hua Xu, and Kai Gao. 2022. Make acoustic and visual cues matter: CH-SIMS v2. 0 dataset and AV-Mixup consistent module. In Proceedings of the 2022 International Conference on Multimodal Interaction. 247--258.

Digital Library

[21]

Feipeng Ma, Yueyi Zhang, and Xiaoyan Sun. 2023. Multimodal Sentiment Analysis with Preferential Fusion and Distance-aware Contrastive Learning. In 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1367--1372.

[22]

Katerina Margatina, Giorgos Vernikos, Lo"ic Barrault, and Nikolaos Aletras. 2021. Active Learning by Acquiring Contrastive Examples. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 650--663.

[23]

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. 2020. M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 1359--1367.

[24]

Seyed Hamidreza Mousavi and Seyed Majid Azimi. 2022. A Deep Curriculum Learner in an Active Learning Cycle for Polsar Image Classification. In IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium. 88--91.

[25]

Jicai Pan, Shangfei Wang, and Lin Fang. 2022. Representation learning through multimodal attention and time-sync comments for affective video content analysis. In Proceedings of the 30th ACM International Conference on Multimedia. 42--50.

Digital Library

[26]

Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barnabás Poczós, and Tom Mitchell. 2019. Competence-based Curriculum Learning for Neural Machine Translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 1162--1172.

[27]

Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B Gupta, Xiaojiang Chen, and Xin Wang. 2021. A survey of deep active learning. ACM computing surveys (CSUR), Vol. 54, 9 (2021), 1--40.

[28]

Zhipeng Ren, Daoyi Dong, Huaxiong Li, and Chunlin Chen. 2018. Self-paced prioritized curriculum learning with coverage penalty in deep reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, Vol. 29, 6 (2018), 2216--2226.

[29]

Ozan Sener and Silvio Savarese. 2018. Active Learning for Convolutional Neural Networks: A Core-Set Approach. In International Conference on Learning Representations.

[30]

Meng Shen, Yizheng Huang, Jianxiong Yin, Heqing Zou, Deepu Rajan, and Simon See. 2023. Towards Balanced Active Learning for Multimodal Classification. In Proceedings of the 31st ACM International Conference on Multimedia. 3434--3445.

Digital Library

[31]

Yanyao Shen, Hyokun Yun, Zachary C Lipton, Yakov Kronrod, and Animashree Anandkumar. 2017. Deep Active Learning for Named Entity Recognition. In Proceedings of the 2nd Workshop on Representation Learning for NLP. 252--256.

[32]

Tao Shi and Shao-Lun Huang. 2023. MultiEMO: An attention-based correlation-aware multimodal fusion framework for emotion recognition in conversations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 14752--14766.

[33]

Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. 2019. Variational adversarial active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5972--5981.

[34]

Jun Sun, Shoukang Han, Yu-Ping Ruan, Xiaoning Zhang, Shu-Kai Zheng, Yulong Liu, Yuxin Huang, and Taihao Li. 2023. Layer-wise fusion with modality independence modeling for multi-modal emotion recognition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 658--670.

[35]

Ying-Peng Tang and Sheng-Jun Huang. 2019. Self-paced active learning: Query the right thing at the right time. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 5117--5124.

Digital Library

[36]

Yi Tay, Shuohang Wang, Anh Tuan Luu, Jie Fu, Minh C Phan, Xingdi Yuan, Jinfeng Rao, Siu Cheung Hui, and Aston Zhang. 2019. Simple and Effective Curriculum Pointer-Generator Networks for Reading Comprehension over Long Narratives. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4922--4931.

[37]

Toan Tran, Thanh-Toan Do, Ian Reid, and Gustavo Carneiro. 2019. Bayesian generative active deep learning. In Proceedings of the 36th International Conference on Machine Learning. PMLR, 6295--6304.

[38]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6558--6569.

[39]

Jennifer Williams, Steven Kleinegesse, Ramona Comanescu, and Oana Radu. 2018. Recognizing emotions in video using multimodal dnn feature fusion. In Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML). 11--19.

[40]

Junru Wu, Yi Liang, Hassan Akbari, Zhangyang Wang, Cong Yu, et al. 2022. Scaling multimodal pre-training via cross-modality gradient harmonization. Advances in Neural Information Processing Systems, Vol. 35 (2022).

[41]

Dingkang Yang, Shuai Huang, Haopeng Kuang, Yangtao Du, and Lihua Zhang. 2022. Disentangled representation learning for multimodal emotion recognition. In Proceedings of the 30th ACM International Conference on Multimedia. 1642--1651.

Digital Library

[42]

Jiuding Yang, Yakun Yu, Di Niu, Weidong Guo, and Yu Xu. 2023. ConFEDE: Contrastive Feature Decomposition for Multimodal Sentiment Analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 7617--7630.

[43]

Kaicheng Yang, Hua Xu, and Kai Gao. 2020. Cm-bert: Cross-modal bert for text-audio sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia. 521--528.

Digital Library

[44]

Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, Jiele Wu, Jiyun Zou, and Kaicheng Yang. 2020. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 3718--3727.

[45]

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1103--1114.

[46]

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems, Vol. 31, 6 (2016), 82--88.

Digital Library

[47]

AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2236--2246.

[48]

Wenqiao Zhang, Lei Zhu, James Hallinan, Shengyu Zhang, Andrew Makmur, Qingpeng Cai, and Beng Chin Ooi. 2022. Boostmis: Boosting medical image semi-supervised learning with adaptive pseudo labeling and informative active annotation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20666--20676.

[49]

Fedor Zhdanov. 2019. Diverse mini-batch Active Learning. arxiv: 1901.05954

[50]

Daoming Zong, Chaoyue Ding, Baoxiang Li, Jiakui Li, Ken Zheng, and Qunyan Zhou. 2023. AcFormer: An Aligned and Compact Transformer for Multimodal Sentiment Analysis. In Proceedings of the 31st ACM International Conference on Multimedia. 833--842.

Digital Library

Index Terms

GRACE: GRadient-based Active Learning with Curriculum Enhancement for Multimodal Sentiment Analysis
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Sentiment analysis
  2. Information systems applications
    1. Multimedia information systems

Recommendations

Hybrid cross-modal interaction learning for multimodal sentiment analysis
Abstract
Multimodal sentiment analysis (MSA) predicts the sentiment polarity of an unlabeled utterance that carries multiple modalities, such as text, vision and audio, by analyzing labeled utterances. Existing fusion methods mainly focus on establishing ...
Few-shot Multimodal Sentiment Analysis Based on Multimodal Probabilistic Fusion Prompts
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Multimodal sentiment analysis has gained significant attention due to the proliferation of multimodal content on social media. However, existing studies in this area rely heavily on large-scale supervised data, which is time-consuming and labor-intensive ...
A text guided multi-task learning network for multimodal sentiment analysis
Abstract
Multimodal Sentiment Analysis (MSA) is an active area of research that leverages multimodal signals for affective understanding of user-generated videos. Existing research tends to develop sophisticated fusion techniques to fuse unimodal ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China (NSFC)
National Key R&D Program of China

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
126
Total Downloads

Downloads (Last 12 months)126
Downloads (Last 6 weeks)29

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten