research-article

Neural Collapse Anchored Prompt Tuning for Generalizable Vision-Language Models

Authors:

Zexi Li,

Chao WuAuthors Info & Claims

KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 4631 - 4640

https://doi.org/10.1145/3637528.3671690

Published: 24 August 2024 Publication History

Get Access

Abstract

Large-scale vision-language (V-L) models have demonstrated remarkable generalization capabilities for downstream tasks through prompt tuning. However, the mechanisms behind the learned text representations are unknown, limiting further generalization gains, and the limitations are more severe when faced with the prevalent class imbalances seen in web-sourced datasets. Recent advances in the neural collapse (NC) phenomenon of vision-only models suggest that the optimal representation structure is the simplex ETF, which paves the way to study representations in V-L models. In this paper, we make the first attempt to use NC for examining the representations in V-L models via prompt tuning. It is found that NC optimality of text-to-image representations shows a positive correlation with downstream generalizability, which is more severe under class imbalance settings. To improve the representations, we propose Neural-collapse-anchored Prompt Tuning (NPT), a novel method that learns prompts with text and image representations that satisfy the same simplex Equiangular Tight Frame (ETF). NPT incorporates two regularization terms: language-modality collapse and multi-modality isomorphism; and it is compatible with other prompt tuning methods. Extensive experiments show that NPT can consistently help to improve existing prompt tuning techniques across 11 datasets for both balanced and imbalanced settings.

Supplemental Material

MP4 File - Neural Collapse Anchored Prompt Tuning for Generalizable Vision-Language Models

Neural-collapse-anchored Prompt Tuning capitalizes on the benefits of two distinct regularizers: LC Regularizer which fosters the generation of more discriminative textual representations, MI Regularizer which promotes enhanced multi-modal alignment to address imbalance challenges in CLIP.

Download
61.72 MB

References

[1]

Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. 2022. Visual Prompting: Modifying Pixel Space to Adapt Pre-trained Models. arXiv preprint arXiv:2203.17274 (2022).

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models

Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning

Adversarial Prompt Tuning for Vision-Language Models

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations