Skip to main content

Cascade Prompt Learning for Vision-Language Model Adaptation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Prompt learning has surfaced as an effective approach to enhance the performance of Vision-Language Models (VLMs) like CLIP when applied to downstream tasks. However, current learnable prompt tokens are primarily used for the single phase of adapting to tasks (i.e., adapting prompt), easily leading to overfitting risks. In this work, we propose a novel Cascade Prompt Learning (CasPL) framework to enable prompt learning to serve both generic and specific expertise (i.e., boosting and adapting prompt) simultaneously. Specifically, CasPL is a new learning paradigm comprising two distinct phases of learnable prompts: the first boosting prompt is crafted to extract domain-general knowledge from a senior larger CLIP teacher model by aligning their predicted logits using extensive unlabeled domain images. The second adapting prompt is then cascaded with the frozen first set to fine-tune the downstream tasks, following the approaches employed in prior research. In this manner, CasPL can effectively capture both domain-general and task-specific representations into explicitly different gradual groups of prompts, thus potentially alleviating overfitting issues in the target domain. It’s worth noting that CasPL serves as a plug-and-play module that can seamlessly integrate into any existing prompt learning approach. CasPL achieves a significantly better balance between performance and inference speed, which is especially beneficial for deploying smaller VLM models in resource-constrained environments. Compared to the previous state-of-the-art method PromptSRC, CasPL shows an average improvement of 1.85\(\%\) for base classes, 3.44\(\%\) for novel classes, and 2.72\(\%\) for the harmonic mean over 11 image classification datasets. Code is publicly available at: https://github.com/megvii-research/CasPL.

G. Wu and X. Zhang—Equal contributions. Work is done when Ge Wu is an intern at Megvii Technology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_29

    Chapter  Google Scholar 

  2. Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)

    Google Scholar 

  3. Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: CVPR (2014)

    Google Scholar 

  4. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)

    Google Scholar 

  5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv (2018)

    Google Scholar 

  6. Ding, J., Xue, N., Xia, G.S., Dai, D.: Decoupling zero-shot semantic segmentation. In: CVPR (2022)

    Google Scholar 

  7. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ArXiv (2020)

    Google Scholar 

  8. Fang, Z., Wang, J., Hu, X., Wang, L., Yang, Y., Liu, Z.: Compressing visual-linguistic model via knowledge distillation. In: ICCV (2021)

    Google Scholar 

  9. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: CVPR Workshops (2004)

    Google Scholar 

  10. Gao, P., et al.: Clip-adapter: better vision-language models with feature adapters. In: IJCV (2023)

    Google Scholar 

  11. Gao, T., Fisch, A., Chen, D.: Making pre-trained language models better few-shot learners. ArXiv (2020)

    Google Scholar 

  12. Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. ArXiv (2021)

    Google Scholar 

  13. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)

    Google Scholar 

  14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  15. Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J-STARS 12(7), 2217–2226 (2019)

    Google Scholar 

  16. Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: CVPR (2021)

    Google Scholar 

  17. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. ArXiv (2015)

    Google Scholar 

  18. Huang, T., Chu, J., Wei, F.: Unsupervised prompt learning for vision-language models. ArXiv (2022)

    Google Scholar 

  19. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)

    Google Scholar 

  20. Jia, M., et al.: Visual prompt tuning. In: ECCV (2022)

    Google Scholar 

  21. Jiao, S., Wei, Y., Wang, Y., Zhao, Y., Shi, H.: Learning mask-aware clip representations for zero-shot segmentation. ArXiv (2023)

    Google Scholar 

  22. Jiao, X., et al.: TinyBERT: distilling BERT for natural language understanding. ArXiv (2019)

    Google Scholar 

  23. Kahana, J., Cohen, N., Hoshen, Y.: Improving zero-shot models with label distribution priors. ArXiv (2022)

    Google Scholar 

  24. Khattak, M.U., Rasheed, H., Maaz, M., Khan, S., Khan, F.S.: MaPLe: multi-modal prompt learning. In: CVPR (2023)

    Google Scholar 

  25. Khattak, M.U., Wasim, S.T., Naseer, M., Khan, S., Yang, M.H., Khan, F.S.: Self-regulating prompts: Foundational model adaptation without forgetting. In: ICCV (2023)

    Google Scholar 

  26. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: CVPR Workshops (2013)

    Google Scholar 

  27. Laroudie, C., Bursuc, A., Ha, M.L., Franchi, G.: Improving clip robustness with knowledge distillation and self-training. ArXiv (2023)

    Google Scholar 

  28. Lee, D.H., et al.: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: ICML Workshop (2013)

    Google Scholar 

  29. Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. ArXiv (2021)

    Google Scholar 

  30. Li, X.L., Liang, P.: Prefix-tuning: Optimizing continuous prompts for generation. ArXiv (2021)

    Google Scholar 

  31. Li, Z., et al.: PromptKD: unsupervised prompt distillation for vision-language models. In: CVPR, pp. 26617–26626 (2024)

    Google Scholar 

  32. Li, Z., et al.: Curriculum temperature for knowledge distillation. In: AAAI. vol. 37, pp. 1504–1512 (2023)

    Google Scholar 

  33. Li, Z., Ye, J., Song, M., Huang, Y., Pan, Z.: Online knowledge distillation for efficient pose estimation. In: ICCV, pp. 11740–11750 (2021)

    Google Scholar 

  34. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55(9), 1–35 (2023)

    Google Scholar 

  35. Liu, X., et al.: P-tuning: prompt tuning can be comparable to fine-tuning across scales and tasks. In: ACL (2022)

    Google Scholar 

  36. Liu, Z., Hu, X., Nevatia, R.: Efficient feature distillation for zero-shot detection. ArXiv (2023)

    Google Scholar 

  37. Lu, Y., Liu, J., Zhang, Y., Liu, Y., Tian, X.: Prompt distribution learning. In: CVPR (2022)

    Google Scholar 

  38. Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In: CVPR (2022)

    Google Scholar 

  39. Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. ArXiv (2013)

    Google Scholar 

  40. Menghini, C., Delworth, A., Bach, S.H.: Enhancing clip with clip: exploring pseudolabeling for limited-label prompt tuning. In: NeurIPS (2023)

    Google Scholar 

  41. Mirza, M.J., et al.: LaFTer: label-free tuning of zero-shot classifier using language and unlabeled image collections. In: NeurIPS (2023)

    Google Scholar 

  42. Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGIP (2008)

    Google Scholar 

  43. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  44. Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do ImageNet classifiers generalize to ImageNet? In: ICML (2019)

    Google Scholar 

  45. Schick, T., Schütze, H.: Few-shot text generation with pattern-exploiting training. ArXiv (2020)

    Google Scholar 

  46. Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. ArXiv (2012)

    Google Scholar 

  47. Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. arXiv preprint arXiv:1910.10699 (2019)

  48. Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. In: NeurIPS (2019)

    Google Scholar 

  49. Wang, Z., et al.: Multimodal adaptive distillation for leveraging unimodal encoders for vision-language tasks. ArXiv (2022)

    Google Scholar 

  50. Wang, Z., et al.: CLIP-TD: Clip targeted distillation for vision-language tasks. ArXiv (2022)

    Google Scholar 

  51. Wang, Z., et al.: DualPrompt: complementary prompting for rehearsal-free continual learning. In: ECCV (2022)

    Google Scholar 

  52. Wang, Z., et al.: Learning to prompt for continual learning. In: CVPR (2022)

    Google Scholar 

  53. Wu, K., et al.: TinyCLIP: clip distillation via affinity mimicking and weight inheritance. In: ICCV (2023)

    Google Scholar 

  54. Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: CVPR (2010)

    Google Scholar 

  55. Yang, C., An, Z., Cai, L., Xu, Y.: Mutual contrastive learning for visual representation learning. In: AAAI. vol. 36, pp. 3045–3053 (2022)

    Google Scholar 

  56. Yang, C., et al.: CLIP-KD: An empirical study of distilling clip models. ArXiv (2023)

    Google Scholar 

  57. Yu, W., Liu, Y., Hua, W., Jiang, D., Ren, B., Bai, X.: Turning a clip model into a scene text detector. In: CVPR (2023)

    Google Scholar 

  58. Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Unified vision and language prompt learning. ArXiv (2022)

    Google Scholar 

  59. Zhang, J., Wu, S., Gao, L., Shen, H., Song, J.: DePT: Decoupled prompt tuning. ArXiv (2023)

    Google Scholar 

  60. Zhang, R., et al.: Tip-adapter: Training-free clip-adapter for better vision-language modeling. ArXiv (2021)

    Google Scholar 

  61. Zhang, W., Deng, W., Cui, Z., Liu, J., Jiao, L.: Object knowledge distillation for joint detection and tracking in satellite videos. IEEE Trans. Geosci. Remote Sens. 62, 1–13 (2024)

    Article  MATH  Google Scholar 

  62. Zhao, Z., Wallace, E., Feng, S., Klein, D., Singh, S.: Calibrate before use: improving few-shot performance of language models. In: ICML (2021)

    Google Scholar 

  63. Zhong, Y., et al.: RegionCLIP: region-based language-image pretraining. In: CVPR (2022)

    Google Scholar 

  64. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR (2022)

    Google Scholar 

  65. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV (2022)

    Google Scholar 

  66. Zhu, B., Niu, Y., Han, Y., Wu, Y., Zhang, H.: Prompt-aligned gradient for prompt tuning. In: ICCV (2023)

    Google Scholar 

Download references

Acknowledgements

This research was supported by the Young Scientists Fund of the National Natural Science Foundation of China (Grant No.62206134), the Fundamental Research Funds for the Central Universities 070-63233084, and the Tianjin Key Laboratory of Visual Computing and Intelligent Perception (VCIP). Computation is supported by the Supercomputing Center of Nankai University (NKSC). This work was supported by the National Science Fund of China under Grant No. 62361166670.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiang Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 399 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wu, G. et al. (2025). Cascade Prompt Learning for Vision-Language Model Adaptation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15108. Springer, Cham. https://doi.org/10.1007/978-3-031-72973-7_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72973-7_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72972-0

  • Online ISBN: 978-3-031-72973-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics