Online Zero-Shot Classification with CLIP

Qian, Qi; Hu, Juhua

doi:10.1007/978-3-031-72980-5_27

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15135))

Included in the following conference series:

European Conference on Computer Vision

336 Accesses

Abstract

Vision-language pre-training such as CLIP enables zero-shot transfer that can classify images according to the candidate class names. While CLIP demonstrates an impressive zero-shot performance on diverse downstream tasks, the distribution from the target data has not been leveraged sufficiently. In this work, we study a novel online zero-shot transfer scenario, where each image arrives in a random order for classification and is visited only once to obtain prediction immediately without storing its representation. Compared with the vanilla zero-shot classification, the proposed framework preserves its flexibility for online service while considering the statistics of the arrived images as the side information to capture the distribution of target data, which can help improve the performance of real-world applications. To tackle the challenge of effective online optimization, we first develop online label learning to model the target data distribution. Then, the proxy of each class in the vision space is further optimized with the proposed online proxy learning method to mitigate the modality gap between images and text. The convergence of both online strategies can be theoretically guaranteed. By combining the predicted label from the online label learning and proxy learning, our online zero-shot transfer method (OnZeta) achieves $78.94\%$ accuracy on ImageNet without accessing the entire data set. Moreover, extensive experiments on other 13 downstream tasks with different vision encoders show a more than $3\%$ improvement on average, which demonstrates the effectiveness of our proposal.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

Application of CLIP for efficient zero-shot learning

Article 26 July 2024

MLTU: mixup long-tail unsupervised zero-shot image classification on vision-language models

Article 05 June 2024

References

Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_29
Chapter Google Scholar
Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2014)
Google Scholar
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: CVPR, pp. 3606–3613 (2014)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 $\times $ 16 words: transformers for image recognition at scale. In: ICLR. OpenReview.net (2021)
Google Scholar
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: CVPR Workshop, p. 178. IEEE (2004)
Google Scholar
Gao, P., et al.: Clip-adapter: better vision-language models with feature adapters. CoRR abs/2110.04544 (2021)
Google Scholar
Hazan, E.: Introduction to online convex optimization. CoRR abs/1909.05207 (2019)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778. IEEE Computer Society (2016)
Google Scholar
Helber, P., Bischke, B., Dengel, A., Borth, D.: Eurosat: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 12(7), 2217–2226 (2019)
Google Scholar
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: ICCV Workshop, pp. 554–561 (2013)
Google Scholar
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
Google Scholar
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013)
Nilsback, M., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGIP, pp. 722–729. IEEE Computer Society (2008)
Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: CVPR, pp. 3498–3505. IEEE (2012)
Google Scholar
Qian, Q., Shang, L., Sun, B., Hu, J., Li, H., Jin, R.: Softtriple loss: deep metric learning without triplet sampling. In: ICCV, pp. 6449–6457. IEEE (2019)
Google Scholar
Qian, Q., Xu, Y., Hu, J.: Intra-modal proxy learning for zero-shot visual categorization with clip. In: NeurIPS (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Google Scholar
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Shu, M., et al.: Test-time prompt tuning for zero-shot generalization in vision-language models. In: NeurIPS (2022)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012)
Google Scholar
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset (2011)
Google Scholar
Wortsman, M., et al.: Robust fine-tuning of zero-shot models. In: CVPR, pp. 7949–7961. IEEE (2022)
Google Scholar
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: large-scale scene recognition from abbey to zoo. In: CVPR, pp. 3485–3492. IEEE Computer Society (2010)
Google Scholar
Zhang, R., et al.: Tip-adapter: training-free adaption of CLIP for few-shot classification. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13695, pp. 493–510. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_29
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR, pp. 16795–16804. IEEE (2022)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vis. 130(9), 2337–2348 (2022)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Alibaba Group, Bellevue, WA, 98004, USA
Qi Qian
School of Engineering and Technology, University of Washington, Tacoma, WA, 98402, USA
Juhua Hu

Authors

Qi Qian
View author publications
You can also search for this author in PubMed Google Scholar
Juhua Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi Qian .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 284 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qian, Q., Hu, J. (2024). Online Zero-Shot Classification with CLIP. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15135. Springer, Cham. https://doi.org/10.1007/978-3-031-72980-5_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-72980-5_27
Published: 29 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72979-9
Online ISBN: 978-3-031-72980-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Online Zero-Shot Classification with CLIP

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

Application of CLIP for efficient zero-shot learning

MLTU: mixup long-tail unsupervised zero-shot image classification on vision-language models

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 284 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Online Zero-Shot Classification with CLIP

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

Application of CLIP for efficient zero-shot learning

MLTU: mixup long-tail unsupervised zero-shot image classification on vision-language models

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 284 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation