Abstract
Prompt learning has surfaced as an effective approach to enhance the performance of Vision-Language Models (VLMs) like CLIP when applied to downstream tasks. However, current learnable prompt tokens are primarily used for the single phase of adapting to tasks (i.e., adapting prompt), easily leading to overfitting risks. In this work, we propose a novel Cascade Prompt Learning (CasPL) framework to enable prompt learning to serve both generic and specific expertise (i.e., boosting and adapting prompt) simultaneously. Specifically, CasPL is a new learning paradigm comprising two distinct phases of learnable prompts: the first boosting prompt is crafted to extract domain-general knowledge from a senior larger CLIP teacher model by aligning their predicted logits using extensive unlabeled domain images. The second adapting prompt is then cascaded with the frozen first set to fine-tune the downstream tasks, following the approaches employed in prior research. In this manner, CasPL can effectively capture both domain-general and task-specific representations into explicitly different gradual groups of prompts, thus potentially alleviating overfitting issues in the target domain. It’s worth noting that CasPL serves as a plug-and-play module that can seamlessly integrate into any existing prompt learning approach. CasPL achieves a significantly better balance between performance and inference speed, which is especially beneficial for deploying smaller VLM models in resource-constrained environments. Compared to the previous state-of-the-art method PromptSRC, CasPL shows an average improvement of 1.85\(\%\) for base classes, 3.44\(\%\) for novel classes, and 2.72\(\%\) for the harmonic mean over 11 image classification datasets. Code is publicly available at: https://github.com/megvii-research/CasPL.
G. Wu and X. Zhang—Equal contributions. Work is done when Ge Wu is an intern at Megvii Technology.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_29
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: CVPR (2014)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv (2018)
Ding, J., Xue, N., Xia, G.S., Dai, D.: Decoupling zero-shot semantic segmentation. In: CVPR (2022)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ArXiv (2020)
Fang, Z., Wang, J., Hu, X., Wang, L., Yang, Y., Liu, Z.: Compressing visual-linguistic model via knowledge distillation. In: ICCV (2021)
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: CVPR Workshops (2004)
Gao, P., et al.: Clip-adapter: better vision-language models with feature adapters. In: IJCV (2023)
Gao, T., Fisch, A., Chen, D.: Making pre-trained language models better few-shot learners. ArXiv (2020)
Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. ArXiv (2021)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J-STARS 12(7), 2217–2226 (2019)
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: CVPR (2021)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. ArXiv (2015)
Huang, T., Chu, J., Wei, F.: Unsupervised prompt learning for vision-language models. ArXiv (2022)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)
Jia, M., et al.: Visual prompt tuning. In: ECCV (2022)
Jiao, S., Wei, Y., Wang, Y., Zhao, Y., Shi, H.: Learning mask-aware clip representations for zero-shot segmentation. ArXiv (2023)
Jiao, X., et al.: TinyBERT: distilling BERT for natural language understanding. ArXiv (2019)
Kahana, J., Cohen, N., Hoshen, Y.: Improving zero-shot models with label distribution priors. ArXiv (2022)
Khattak, M.U., Rasheed, H., Maaz, M., Khan, S., Khan, F.S.: MaPLe: multi-modal prompt learning. In: CVPR (2023)
Khattak, M.U., Wasim, S.T., Naseer, M., Khan, S., Yang, M.H., Khan, F.S.: Self-regulating prompts: Foundational model adaptation without forgetting. In: ICCV (2023)
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: CVPR Workshops (2013)
Laroudie, C., Bursuc, A., Ha, M.L., Franchi, G.: Improving clip robustness with knowledge distillation and self-training. ArXiv (2023)
Lee, D.H., et al.: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: ICML Workshop (2013)
Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. ArXiv (2021)
Li, X.L., Liang, P.: Prefix-tuning: Optimizing continuous prompts for generation. ArXiv (2021)
Li, Z., et al.: PromptKD: unsupervised prompt distillation for vision-language models. In: CVPR, pp. 26617–26626 (2024)
Li, Z., et al.: Curriculum temperature for knowledge distillation. In: AAAI. vol. 37, pp. 1504–1512 (2023)
Li, Z., Ye, J., Song, M., Huang, Y., Pan, Z.: Online knowledge distillation for efficient pose estimation. In: ICCV, pp. 11740–11750 (2021)
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55(9), 1–35 (2023)
Liu, X., et al.: P-tuning: prompt tuning can be comparable to fine-tuning across scales and tasks. In: ACL (2022)
Liu, Z., Hu, X., Nevatia, R.: Efficient feature distillation for zero-shot detection. ArXiv (2023)
Lu, Y., Liu, J., Zhang, Y., Liu, Y., Tian, X.: Prompt distribution learning. In: CVPR (2022)
Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In: CVPR (2022)
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. ArXiv (2013)
Menghini, C., Delworth, A., Bach, S.H.: Enhancing clip with clip: exploring pseudolabeling for limited-label prompt tuning. In: NeurIPS (2023)
Mirza, M.J., et al.: LaFTer: label-free tuning of zero-shot classifier using language and unlabeled image collections. In: NeurIPS (2023)
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGIP (2008)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do ImageNet classifiers generalize to ImageNet? In: ICML (2019)
Schick, T., Schütze, H.: Few-shot text generation with pattern-exploiting training. ArXiv (2020)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. ArXiv (2012)
Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. arXiv preprint arXiv:1910.10699 (2019)
Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. In: NeurIPS (2019)
Wang, Z., et al.: Multimodal adaptive distillation for leveraging unimodal encoders for vision-language tasks. ArXiv (2022)
Wang, Z., et al.: CLIP-TD: Clip targeted distillation for vision-language tasks. ArXiv (2022)
Wang, Z., et al.: DualPrompt: complementary prompting for rehearsal-free continual learning. In: ECCV (2022)
Wang, Z., et al.: Learning to prompt for continual learning. In: CVPR (2022)
Wu, K., et al.: TinyCLIP: clip distillation via affinity mimicking and weight inheritance. In: ICCV (2023)
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: CVPR (2010)
Yang, C., An, Z., Cai, L., Xu, Y.: Mutual contrastive learning for visual representation learning. In: AAAI. vol. 36, pp. 3045–3053 (2022)
Yang, C., et al.: CLIP-KD: An empirical study of distilling clip models. ArXiv (2023)
Yu, W., Liu, Y., Hua, W., Jiang, D., Ren, B., Bai, X.: Turning a clip model into a scene text detector. In: CVPR (2023)
Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Unified vision and language prompt learning. ArXiv (2022)
Zhang, J., Wu, S., Gao, L., Shen, H., Song, J.: DePT: Decoupled prompt tuning. ArXiv (2023)
Zhang, R., et al.: Tip-adapter: Training-free clip-adapter for better vision-language modeling. ArXiv (2021)
Zhang, W., Deng, W., Cui, Z., Liu, J., Jiao, L.: Object knowledge distillation for joint detection and tracking in satellite videos. IEEE Trans. Geosci. Remote Sens. 62, 1–13 (2024)
Zhao, Z., Wallace, E., Feng, S., Klein, D., Singh, S.: Calibrate before use: improving few-shot performance of language models. In: ICML (2021)
Zhong, Y., et al.: RegionCLIP: region-based language-image pretraining. In: CVPR (2022)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR (2022)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV (2022)
Zhu, B., Niu, Y., Han, Y., Wu, Y., Zhang, H.: Prompt-aligned gradient for prompt tuning. In: ICCV (2023)
Acknowledgements
This research was supported by the Young Scientists Fund of the National Natural Science Foundation of China (Grant No.62206134), the Fundamental Research Funds for the Central Universities 070-63233084, and the Tianjin Key Laboratory of Visual Computing and Intelligent Perception (VCIP). Computation is supported by the Supercomputing Center of Nankai University (NKSC). This work was supported by the National Science Fund of China under Grant No. 62361166670.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wu, G. et al. (2025). Cascade Prompt Learning for Vision-Language Model Adaptation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15108. Springer, Cham. https://doi.org/10.1007/978-3-031-72973-7_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-72973-7_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72972-0
Online ISBN: 978-3-031-72973-7
eBook Packages: Computer ScienceComputer Science (R0)