skip to main content
10.1145/3637528.3671690acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Neural Collapse Anchored Prompt Tuning for Generalizable Vision-Language Models

Published: 24 August 2024 Publication History

Abstract

Large-scale vision-language (V-L) models have demonstrated remarkable generalization capabilities for downstream tasks through prompt tuning. However, the mechanisms behind the learned text representations are unknown, limiting further generalization gains, and the limitations are more severe when faced with the prevalent class imbalances seen in web-sourced datasets. Recent advances in the neural collapse (NC) phenomenon of vision-only models suggest that the optimal representation structure is the simplex ETF, which paves the way to study representations in V-L models. In this paper, we make the first attempt to use NC for examining the representations in V-L models via prompt tuning. It is found that NC optimality of text-to-image representations shows a positive correlation with downstream generalizability, which is more severe under class imbalance settings. To improve the representations, we propose Neural-collapse-anchored Prompt Tuning (NPT), a novel method that learns prompts with text and image representations that satisfy the same simplex Equiangular Tight Frame (ETF). NPT incorporates two regularization terms: language-modality collapse and multi-modality isomorphism; and it is compatible with other prompt tuning methods. Extensive experiments show that NPT can consistently help to improve existing prompt tuning techniques across 11 datasets for both balanced and imbalanced settings.

Supplemental Material

MP4 File - Neural Collapse Anchored Prompt Tuning for Generalizable Vision-Language Models
Neural-collapse-anchored Prompt Tuning capitalizes on the benefits of two distinct regularizers: LC Regularizer which fosters the generation of more discriminative textual representations, MI Regularizer which promotes enhanced multi-modal alignment to address imbalance challenges in CLIP.

References

[1]
Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. 2022. Visual Prompting: Modifying Pixel Space to Adapt Pre-trained Models. arXiv preprint arXiv:2203.17274 (2022).
[2]
Limeng Cui, Xianfeng Tang, Sumeet Katariya, Nikhil Rao, Pallav Agrawal, Karthik Subbian, and Dongwon Lee. 2022. ALLIE: Active learning on large-scale imbalanced graphs. In Proceedings of the ACM Web Conference 2022. 690--698.
[3]
Cong Fang, Hangfeng He, Qi Long, and Weijie J Su. 2021. Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences 118, 43 (2021), e2103091118.
[4]
Xingcheng Fu, Yuecen Wei, Qingyun Sun, Haonan Yuan, Jia Wu, Hao Peng, and Jianxin Li. 2023. Hyperbolic Geometric Graph Representation Learning for Hierarchy-imbalance Node Classification. arXiv preprint arXiv:2304.05059 (2023).
[5]
Tao Guo, Song Guo, and Junxiao Wang. 2023. pFedPrompt: Learning Personalized Prompt for Vision-Language Models in Federated Learning. In Proceedings of the ACM Web Conference 2023. 1364--1374.
[6]
Chenxi Huang, Liang Xie, Yibo Yang, Wenxiao Wang, Binbin Lin, and Deng Cai. 2023. Neural Collapse Inspired Federated Learning with Non-iid Data. arXiv preprint arXiv:2303.16066 (2023).
[7]
Tony Huang, Jack Chu, and Fangyun Wei. 2022. Unsupervised Prompt Learning for Vision-Language Models. arXiv preprint arXiv:2204.03649 (2022).
[8]
Wenlong Ji, Yiping Lu, Yiliang Zhang, Zhun Deng, and Weijie J Su. 2021. An unconstrained layer-peeled perspective on neural collapse. arXiv preprint arXiv:2110.02796 (2021).
[9]
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision language representation learning with noisy text supervision. PMLR, 4904--4916.
[10]
Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. 2023. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19113--19122.
[11]
Dongha Lee, Jiaming Shen, Seong Ku Kang, Susik Yoon, Jiawei Han, and Hwanjo Yu. 2022. Taxocom: Topic taxonomy completion with hierarchical discovery of novel topic clusters. In Proceedings of the ACM Web Conference 2022. 2819--2829.
[12]
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning.
[13]
Xiao Li, Sheng Liu, Jinxin Zhou, Xinyu Lu, Carlos Fernandez-Granda, Zhihui Zhu, and Qing Qu. 2022. Principled and Efficient Transfer Learning of Deep Models via Neural Collapse. arXiv preprint arXiv:2212.12206 (2022).
[14]
Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation.
[15]
Zexi Li, Xinyi Shang, Rui He, Tao Lin, and Chao Wu. 2023. No Fear of Classifier Biases: Neural Collapse Inspired Federated Learning with Synthetic and Fixed Classifier. arXiv preprint arXiv:2303.10058 (2023).
[16]
Xiao Liu, Kaixuan Ji, Yicheng Fu, Zhengxiao Du, Zhilin Yang, and Jie Tang. 2021. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks.
[17]
Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, and Xinmei Tian. 2022. Prompt Distribution Learning. 5206--5215.
[18]
Shuang Luo, Didi Zhu, Zexi Li, and Chao Wu. 2021. Ensemble federated adversarial training with non-iid data. FTL-IJCAI 2021 (2021).
[19]
Zheqi Lv, Wenqiao Zhang, Zhengyu Chen, Shengyu Zhang, and Kun Kuang. 2024. Intelligent model update strategy for sequential recommendation. In Proceedings of the ACM on Web Conference 2024. 3117--3128.
[20]
Zheqi Lv, Wenqiao Zhang, Shengyu Zhang, Kun Kuang, Feng Wang, Yongwei Wang, Zhengyu Chen, Tao Shen, Hongxia Yang, Beng Chin Ooi, et al. 2023. DUET: A Tuning-Free Device-Cloud Collaborative Parameters Generation Framework for Efficient Device Model Generalization. In Proceedings of the ACM Web Conference 2023. 3077--3085.
[21]
Weimin Lyu, Xinyu Dong, Rachel Wong, Songzhu Zheng, Kayley Abell-Hart, Fusheng Wang, and Chao Chen. 2022. A Multimodal Transformer: Fusing Clinical Notes with Structured EHR Data for Interpretable In-Hospital Mortality Prediction. In AMIA Annual Symposium Proceedings, Vol. 2022. American Medical Informatics Association, 719.
[22]
Weimin Lyu, Xiao Lin, Songzhu Zheng, Lu Pang, Haibin Ling, Susmit Jha, and Chao Chen. 2024. Task-Agnostic Detector for Insertion-Based Backdoor Attacks. arXiv preprint arXiv:2403.17155 (2024).
[23]
Weimin Lyu, Songzhu Zheng, Haibin Ling, and Chao Chen. 2023. Backdoor Attacks Against Transformers with Attention Enhancement. In ICLR 2023 Workshop on Backdoor Attacks and Defenses in Machine Learning.
[24]
Weimin Lyu, Songzhu Zheng, Tengfei Ma, and Chao Chen. 2022. A Study of the Attention Abnormality in Trojaned BERTs. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4727--4741.
[25]
Weimin Lyu, Songzhu Zheng, Lu Pang, Haibin Ling, and Chao Chen. 2023. Attention-Enhancing Backdoor Attacks Against BERT-based Models. In Findings of the Association for Computational Linguistics: EMNLP 2023. 10672--10690.
[26]
Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer, and Ming-Hsuan Yang. 2022. Class-agnostic Object Detection with Multi-modal Transformer. Springer.
[27]
Vardan Papyan, XY Han, and David L Donoho. 2020. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences 117, 40 (2020), 24652--24663.
[28]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. PMLR, 8748--8763.
[29]
Christos Thrampoulidis, Ganesh Ramachandra Kini, Vala Vakilian, and Tina Behnia. 2022. Imbalance trouble: Revisiting neural-collapse geometry. Advances in Neural Information Processing Systems 35 (2022), 27225--27238.
[30]
Tom Tirer and Joan Bruna. 2022. Extended unconstrained features model for exploring deep neural collapse. In International Conference on Machine Learning. PMLR, 21478--21505.
[31]
Yidong Wang, Zhuohao Yu, Jindong Wang, Qiang Heng, Hao Chen, Wei Ye, Rui Xie, Xing Xie, and Shikun Zhang. 2023. Exploring Vision-Language Models for Imbalanced Learning. arXiv preprint arXiv:2304.01457 (2023).
[32]
E Weinan and Stephan Wojtowytsch. 2022. On the emergence of simplex symmetry in the final and penultimate layers of neural network classifiers. In Mathematical and Scientific Machine Learning. PMLR, 270--290.
[33]
Liang Xie, Yibo Yang, Deng Cai, and Xiaofei He. 2023. Neural collapse inspired attraction-repulsion-balanced loss for imbalanced learning. Neurocomputing (2023).
[34]
Yibo Yang, Liang Xie, Shixiang Chen, Xiangtai Li, Zhouchen Lin, and Dacheng Tao. 2022. Do we really need a learnable classifier at the end of deep neural network? arXiv preprint arXiv:2203.09081 (2022).
[35]
Yibo Yang, Haobo Yuan, Xiangtai Li, Zhouchen Lin, Philip Torr, and Dacheng Tao. 2023. Neural collapse inspired feature-classifier alignment for few-shot class incremental learning. arXiv preprint arXiv:2302.03004 (2023).
[36]
Hantao Yao, Rui Zhang, and Changsheng Xu. 2023. Visual-language prompt tuning with knowledge-guided context optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6757--6767.
[37]
Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. 2021. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783 (2021).
[38]
Junkun Yuan, Xu Ma, Defang Chen, Kun Kuang, Fei Wu, and Lanfen Lin. 2022. Label-Efficient Domain Generalization via Collaborative Exploration and Generalization. In Proceedings of the 30th ACM International Conference on Multimedia. 2361--2370.
[39]
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. 2021. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021).
[40]
Xiaohua Zhai, XiaoWang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. 2022. Lit: Zero-shot transfer with locked-image text tuning. 18123--18133.
[41]
Min Zhang, Haoxuan Li, Fei Wu, and Kun Kuang. 2024. MetaCoCo: A New Few-Shot Classification Benchmark with Spurious Correlation. In International Conference on Learning Representations, ICLR.
[42]
Min Zhang, Junkun Yuan, Yue He, Wenbin Li, Zhengyu Chen, and Kun Kuang. 2023. MAP: Towards Balanced Generalization of IID and OOD through Model-Agnostic Adapters. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV. 11921--11931.
[43]
Wen Zhang, Yushan Zhu, Mingyang Chen, Yuxia Geng, Yufeng Huang, Yajing Xu, Wenting Song, and Huajun Chen. 2023. Structure Pretraining and Prompt Tuning for Knowledge Graph Transfer. In Proceedings of the ACMWeb Conference 2023. 2581--2590.
[44]
Zhisheng Zhong, Jiequan Cui, Yibo Yang, Xiaoyang Wu, Xiaojuan Qi, Xiangyu Zhang, and Jiaya Jia. 2023. Understanding Imbalanced Semantic Segmentation Through Neural Collapse. arXiv preprint arXiv:2301.01100 (2023).
[45]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Conditional prompt learning for vision-language models. 16816--16825.
[46]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models. (2022), 1--12.
[47]
Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. 2022. Prompt aligned Gradient for Prompt Tuning. arXiv preprint arXiv:2205.14865 (2022).
[48]
Didi Zhu, Yinchuan Li, Yunfeng Shao, Jianye Hao, Fei Wu, Kun Kuang, Jun Xiao, and Chao Wu. 2023. Generalized Universal Domain Adaptation with Generative Flow Networks. In ACM International Conference on Multimedia (MM) 2023.
[49]
Didi Zhu, Yinchuan Li, Junkun Yuan, Zexi Li, Kun Kuang, and Chao Wu. 2023. Universal domain adaptation via compressive attention matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6974--6985.
[50]
Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. 2021. A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems 34 (2021), 29820--29834.

Cited By

View all
  • (2025)Uncover the balanced geometry in long-tailed contrastive language-image pretrainingMachine Learning10.1007/s10994-025-06745-w114:4Online publication date: 24-Feb-2025

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2024
6901 pages
ISBN:9798400704901
DOI:10.1145/3637528
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. prompt tuning
  2. representation learning
  3. vision-language models

Qualifiers

  • Research-article

Funding Sources

Conference

KDD '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)274
  • Downloads (Last 6 weeks)38
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Uncover the balanced geometry in long-tailed contrastive language-image pretrainingMachine Learning10.1007/s10994-025-06745-w114:4Online publication date: 24-Feb-2025

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media