Cascade Prompt Learning for Vision-Language Model Adaptation

Wu, Ge; Zhang, Xin; Li, Zheng; Chen, Zhaowei; Liang, Jiajun; Yang, Jian; Li, Xiang

doi:10.1007/978-3-031-72973-7_18

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15108))

Included in the following conference series:

European Conference on Computer Vision

431 Accesses

Abstract

Prompt learning has surfaced as an effective approach to enhance the performance of Vision-Language Models (VLMs) like CLIP when applied to downstream tasks. However, current learnable prompt tokens are primarily used for the single phase of adapting to tasks (i.e., adapting prompt), easily leading to overfitting risks. In this work, we propose a novel Cascade Prompt Learning (CasPL) framework to enable prompt learning to serve both generic and specific expertise (i.e., boosting and adapting prompt) simultaneously. Specifically, CasPL is a new learning paradigm comprising two distinct phases of learnable prompts: the first boosting prompt is crafted to extract domain-general knowledge from a senior larger CLIP teacher model by aligning their predicted logits using extensive unlabeled domain images. The second adapting prompt is then cascaded with the frozen first set to fine-tune the downstream tasks, following the approaches employed in prior research. In this manner, CasPL can effectively capture both domain-general and task-specific representations into explicitly different gradual groups of prompts, thus potentially alleviating overfitting issues in the target domain. It’s worth noting that CasPL serves as a plug-and-play module that can seamlessly integrate into any existing prompt learning approach. CasPL achieves a significantly better balance between performance and inference speed, which is especially beneficial for deploying smaller VLM models in resource-constrained environments. Compared to the previous state-of-the-art method PromptSRC, CasPL shows an average improvement of 1.85$\%$ for base classes, 3.44$\%$ for novel classes, and 2.72$\%$ for the harmonic mean over 11 image classification datasets. Code is publicly available at: https://github.com/megvii-research/CasPL.

G. Wu and X. Zhang—Equal contributions. Work is done when Ge Wu is an intern at Megvii Technology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning to Prompt for Vision-Language Models

Article 31 July 2022

MixPrompt: Enhancing Generalizability and Adversarial Robustness for Vision-Language Models via Prompt Fusion

GalLoP: Learning Global and Local Prompts for Vision-Language Models

References

Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_29
Chapter Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Google Scholar
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: CVPR (2014)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv (2018)
Google Scholar
Ding, J., Xue, N., Xia, G.S., Dai, D.: Decoupling zero-shot semantic segmentation. In: CVPR (2022)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ArXiv (2020)
Google Scholar
Fang, Z., Wang, J., Hu, X., Wang, L., Yang, Y., Liu, Z.: Compressing visual-linguistic model via knowledge distillation. In: ICCV (2021)
Google Scholar
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: CVPR Workshops (2004)
Google Scholar
Gao, P., et al.: Clip-adapter: better vision-language models with feature adapters. In: IJCV (2023)
Google Scholar
Gao, T., Fisch, A., Chen, D.: Making pre-trained language models better few-shot learners. ArXiv (2020)
Google Scholar
Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. ArXiv (2021)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J-STARS 12(7), 2217–2226 (2019)
Google Scholar
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: CVPR (2021)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. ArXiv (2015)
Google Scholar
Huang, T., Chu, J., Wei, F.: Unsupervised prompt learning for vision-language models. ArXiv (2022)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)
Google Scholar
Jia, M., et al.: Visual prompt tuning. In: ECCV (2022)
Google Scholar
Jiao, S., Wei, Y., Wang, Y., Zhao, Y., Shi, H.: Learning mask-aware clip representations for zero-shot segmentation. ArXiv (2023)
Google Scholar
Jiao, X., et al.: TinyBERT: distilling BERT for natural language understanding. ArXiv (2019)
Google Scholar
Kahana, J., Cohen, N., Hoshen, Y.: Improving zero-shot models with label distribution priors. ArXiv (2022)
Google Scholar
Khattak, M.U., Rasheed, H., Maaz, M., Khan, S., Khan, F.S.: MaPLe: multi-modal prompt learning. In: CVPR (2023)
Google Scholar
Khattak, M.U., Wasim, S.T., Naseer, M., Khan, S., Yang, M.H., Khan, F.S.: Self-regulating prompts: Foundational model adaptation without forgetting. In: ICCV (2023)
Google Scholar
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: CVPR Workshops (2013)
Google Scholar
Laroudie, C., Bursuc, A., Ha, M.L., Franchi, G.: Improving clip robustness with knowledge distillation and self-training. ArXiv (2023)
Google Scholar
Lee, D.H., et al.: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: ICML Workshop (2013)
Google Scholar
Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. ArXiv (2021)
Google Scholar
Li, X.L., Liang, P.: Prefix-tuning: Optimizing continuous prompts for generation. ArXiv (2021)
Google Scholar
Li, Z., et al.: PromptKD: unsupervised prompt distillation for vision-language models. In: CVPR, pp. 26617–26626 (2024)
Google Scholar
Li, Z., et al.: Curriculum temperature for knowledge distillation. In: AAAI. vol. 37, pp. 1504–1512 (2023)
Google Scholar
Li, Z., Ye, J., Song, M., Huang, Y., Pan, Z.: Online knowledge distillation for efficient pose estimation. In: ICCV, pp. 11740–11750 (2021)
Google Scholar
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55(9), 1–35 (2023)
Google Scholar
Liu, X., et al.: P-tuning: prompt tuning can be comparable to fine-tuning across scales and tasks. In: ACL (2022)
Google Scholar
Liu, Z., Hu, X., Nevatia, R.: Efficient feature distillation for zero-shot detection. ArXiv (2023)
Google Scholar
Lu, Y., Liu, J., Zhang, Y., Liu, Y., Tian, X.: Prompt distribution learning. In: CVPR (2022)
Google Scholar
Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In: CVPR (2022)
Google Scholar
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. ArXiv (2013)
Google Scholar
Menghini, C., Delworth, A., Bach, S.H.: Enhancing clip with clip: exploring pseudolabeling for limited-label prompt tuning. In: NeurIPS (2023)
Google Scholar
Mirza, M.J., et al.: LaFTer: label-free tuning of zero-shot classifier using language and unlabeled image collections. In: NeurIPS (2023)
Google Scholar
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGIP (2008)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do ImageNet classifiers generalize to ImageNet? In: ICML (2019)
Google Scholar
Schick, T., Schütze, H.: Few-shot text generation with pattern-exploiting training. ArXiv (2020)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. ArXiv (2012)
Google Scholar
Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. arXiv preprint arXiv:1910.10699 (2019)
Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. In: NeurIPS (2019)
Google Scholar
Wang, Z., et al.: Multimodal adaptive distillation for leveraging unimodal encoders for vision-language tasks. ArXiv (2022)
Google Scholar
Wang, Z., et al.: CLIP-TD: Clip targeted distillation for vision-language tasks. ArXiv (2022)
Google Scholar
Wang, Z., et al.: DualPrompt: complementary prompting for rehearsal-free continual learning. In: ECCV (2022)
Google Scholar
Wang, Z., et al.: Learning to prompt for continual learning. In: CVPR (2022)
Google Scholar
Wu, K., et al.: TinyCLIP: clip distillation via affinity mimicking and weight inheritance. In: ICCV (2023)
Google Scholar
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: CVPR (2010)
Google Scholar
Yang, C., An, Z., Cai, L., Xu, Y.: Mutual contrastive learning for visual representation learning. In: AAAI. vol. 36, pp. 3045–3053 (2022)
Google Scholar
Yang, C., et al.: CLIP-KD: An empirical study of distilling clip models. ArXiv (2023)
Google Scholar
Yu, W., Liu, Y., Hua, W., Jiang, D., Ren, B., Bai, X.: Turning a clip model into a scene text detector. In: CVPR (2023)
Google Scholar
Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Unified vision and language prompt learning. ArXiv (2022)
Google Scholar
Zhang, J., Wu, S., Gao, L., Shen, H., Song, J.: DePT: Decoupled prompt tuning. ArXiv (2023)
Google Scholar
Zhang, R., et al.: Tip-adapter: Training-free clip-adapter for better vision-language modeling. ArXiv (2021)
Google Scholar
Zhang, W., Deng, W., Cui, Z., Liu, J., Jiao, L.: Object knowledge distillation for joint detection and tracking in satellite videos. IEEE Trans. Geosci. Remote Sens. 62, 1–13 (2024)
Article MATH Google Scholar
Zhao, Z., Wallace, E., Feng, S., Klein, D., Singh, S.: Calibrate before use: improving few-shot performance of language models. In: ICML (2021)
Google Scholar
Zhong, Y., et al.: RegionCLIP: region-based language-image pretraining. In: CVPR (2022)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR (2022)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV (2022)
Google Scholar
Zhu, B., Niu, Y., Han, Y., Wu, Y., Zhang, H.: Prompt-aligned gradient for prompt tuning. In: ICCV (2023)
Google Scholar

Download references

Acknowledgements

This research was supported by the Young Scientists Fund of the National Natural Science Foundation of China (Grant No.62206134), the Fundamental Research Funds for the Central Universities 070-63233084, and the Tianjin Key Laboratory of Visual Computing and Intelligent Perception (VCIP). Computation is supported by the Supercomputing Center of Nankai University (NKSC). This work was supported by the National Science Fund of China under Grant No. 62361166670.

Author information

Authors and Affiliations

VCIP, CS, Nankai University, Tianjin, China
Ge Wu, Xin Zhang, Zheng Li, Jian Yang & Xiang Li
NKIARI, Shenzhen Futian, Shenzhen, China
Xiang Li
Megvii Technology, Beijing, China
Zhaowei Chen & Jiajun Liang

Authors

Ge Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhaowei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jiajun Liang
View author publications
You can also search for this author in PubMed Google Scholar
Jian Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiang Li .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 399 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, G. et al. (2025). Cascade Prompt Learning for Vision-Language Model Adaptation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15108. Springer, Cham. https://doi.org/10.1007/978-3-031-72973-7_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-72973-7_18
Published: 01 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72972-0
Online ISBN: 978-3-031-72973-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cascade Prompt Learning for Vision-Language Model Adaptation