uCAP: An Unsupervised Prompting Method for Vision-Language Models

Nguyen, A. Tuan; Tai, Kai Sheng; Chen, Bor-Chun; Shukla, Satya Narayan; Yu, Hanchao; Torr, Philip; Tian, Tai-Peng; Lim, Ser-Nam

doi:10.1007/978-3-031-72904-1_25

A. Tuan Nguyen¹³,
Kai Sheng Tai¹³,
Bor-Chun Chen¹³,
Satya Narayan Shukla¹³,
Hanchao Yu¹³,
Philip Torr¹⁴,
Tai-Peng Tian¹³ &
…
Ser-Nam Lim¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15132))

Included in the following conference series:

European Conference on Computer Vision

288 Accesses

Abstract

This paper addresses a significant limitation that prevents Contrastive Language-Image Pretrained Models (CLIP) from achieving optimal performance on downstream image classification tasks. The key problem with CLIP-style zero-shot classification is that it requires domain-specific context in the form of prompts to better align the class descriptions to the downstream data distribution. In particular, prompts for vision-language models are domain-level texts (e.g., “a centered satellite image of ...”) which, together with the class names, are fed into the text encoder to provide more context for the downstream dataset. These prompts are typically manually tuned, which is time consuming and often sub-optimal. To overcome this bottleneck, this paper proposes uCAP, a method to automatically learn domain-specific prompts/contexts using only unlabeled in-domain images. We achieve this by modeling the generation of images given the class names and a domain-specific prompt with an unsupervised likelihood distribution, and then performing inference of the prompts. We validate the proposed method across various models and datasets, showing that uCAP consistently outperforms manually tuned prompts and related baselines on the evaluated datasets: ImageNet, CIFAR-10, CIFAR-100, OxfordPets (up to 2%), SUN397 (up to 5%), and Caltech101 (up to 3%).

A. T. Nguyen—Work done when Tuan was at the University of Oxford and doing an internship at Meta.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning to Prompt for Vision-Language Models

Article 31 July 2022

GalLoP: Learning Global and Local Prompts for Vision-Language Models

Dual Adapter Tuning of Vision–Language Models Using Large Language Models

Article Open access 08 May 2025

References

Chen, G., Yao, W., Song, X., Li, X., Rao, Y., Zhang, K.: PLOT: prompt learning with optimal transport for vision-language models. In: ICLR (2023)
Google Scholar
Cherti, M., et al.: Reproducible scaling laws for contrastive language-image learning. In: CVPR (2023)
Google Scholar
Girdhar, R., et al.: ImageBind: one embedding space to bind them all. In: CVPR (2023)
Google Scholar
Hendrycks, D., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. In: ICCV (2021)
Google Scholar
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: CVPR (2021)
Google Scholar
Huang, T., Chu, J., Wei, F.: Unsupervised prompt learning for vision-language models. arXiv (2022)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
Google Scholar
Lee, D., Song, S., Suh, J., Choi, J., Lee, S., Kim, H.J.: Read-only prompt optimization for vision-language few-shot learning. In: ICCV (2023)
Google Scholar
Li, F.F., Andreeto, M., Ranzato, M., Perona, P.: Caltech 101 (2022). https://doi.org/10.22002/D1.20086
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML (2023)
Google Scholar
Li, X., Wang, Z., Xie, C.: CLIPA-v2: scaling CLIP training with 81.1% zero-shot ImageNet accuracy within a \$10, 000 budget; an extra \$4, 000 unlocks 81.8% accuracy. arXiv (2023)
Google Scholar
Lu, Y., Liu, J., Zhang, Y., Liu, Y., Tian, X.: Prompt distribution learning. In: CVPR (2022)
Google Scholar
Menon, S., Vondrick, C.: Visual classification via description from large language models. In: ICLR (2023)
Google Scholar
Mokady, R., Hertz, A., Bermano, A.H.: ClipCap: CLIP prefix for image captioning. arXiv (2021)
Google Scholar
Nguyen, A.T., Nguyen-Tang, T., Lim, S.N., Torr, P.H.: TIPI: test time adaptation with transformation invariance. In: CVPR (2023)
Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: CVPR (2012)
Google Scholar
Pratt, S., Liu, R., Farhadi, A.: What does a platypus look like? Generating customized prompts for zero-shot image classification. In: ICCV (2022)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do ImageNet classifiers generalize to ImageNet? In: ICML (2019)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv (2012)
Google Scholar
Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A., Hardt, M.: Test-time training with self-supervision for generalization under distribution shifts. In: ICML (2020)
Google Scholar
Wang, D., Shelhamer, E., Liu, S., Olshausen, B., Darrell, T.: Tent: fully test-time adaptation by entropy minimization. In: ICLR (2021)
Google Scholar
Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. In: NeurIPS (2019)
Google Scholar
Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: ICML (2011)
Google Scholar
Wilson, A.G., Izmailov, P.: Deep ensembles as approximate Bayesian inference (2021)
Google Scholar
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: large-scale scene recognition from abbey to zoo. In: CVPR (2010)
Google Scholar
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: CoCa: contrastive captioners are image-text foundation models. arXiv (2022)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR (2022)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV 130(9), 2337–2348 (2022)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Meta, Menlo Park, USA
A. Tuan Nguyen, Kai Sheng Tai, Bor-Chun Chen, Satya Narayan Shukla, Hanchao Yu & Tai-Peng Tian
University of Oxford, Oxford, UK
Philip Torr
University of Central Florida, Orlando, USA
Ser-Nam Lim

Authors

A. Tuan Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Kai Sheng Tai
View author publications
You can also search for this author in PubMed Google Scholar
Bor-Chun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Satya Narayan Shukla
View author publications
You can also search for this author in PubMed Google Scholar
Hanchao Yu
View author publications
You can also search for this author in PubMed Google Scholar
Philip Torr
View author publications
You can also search for this author in PubMed Google Scholar
Tai-Peng Tian
View author publications
You can also search for this author in PubMed Google Scholar
Ser-Nam Lim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Tuan Nguyen .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 296 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, A.T. et al. (2025). uCAP: An Unsupervised Prompting Method for Vision-Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15132. Springer, Cham. https://doi.org/10.1007/978-3-031-72904-1_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-72904-1_25
Published: 21 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72903-4
Online ISBN: 978-3-031-72904-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

uCAP: An Unsupervised Prompting Method for Vision-Language Models