CoCoOpter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification

Yan, Jie; Xie, Yuxiang; Guo, Yanming; Wei, Yingmei; Zhang, Xiaoping; Luan, Xidao

doi:10.1007/s13735-023-00286-5

CoCoOpter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification

Regular Paper
Published: 23 August 2023

Volume 12, article number 27, (2023)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Jie Yan¹^na1,
Yuxiang Xie¹^na1,
Yanming Guo¹,
Yingmei Wei¹,
Xiaoping Zhang² &
…
Xidao Luan³

698 Accesses
Explore all metrics

Abstract

Few-shot image classification aims at learning to generalize to unseen new categories from a few training samples. Transfer learning is one prominent approach to the task, which first learns a backbone from the base classes and then trains a classifier on new classes with the prior learned knowledge. Typically, the convolutional neural network (CNN) is the preferred backbone. However, when the samples are limited, the representation ability of the feature extracted by CNN will decrease, thus leading to the performance degradation of few-shot image classification. Recently, the pre-trained large-scale vision-language model like CLIP has shown non-trivial potential, which can be used as a backbone for zero or few-shot transfer on a series of downstream tasks with the prompt. To fully explore the few-shot image classification performance of vision-language models, we propose CoCoOpter, a novel “pre-training + prompt tuning + fine-tuning” paradigm based on CLIP. CoCoOpter alleviates the overfitting and ensures generalizability in unseen new categories. Specifically, it learns an input-specific neural network to relieve overfitting by drawing attention away from a specific category to each specific input sample. Then, to establish connection between the visual and textual signals, it introduces the previously learned visual representations to perform automatic prompt tuning in the middle of the pre-trained CLIP, enabling learning input-specified prompt vectors. Moreover, two learnable lightweight neural networks are added at the end of CLIP to guide information propagation between different classes by fine-tuning both the visual and textual features. We perform extensive experiments on 11 image classification datasets. The results show that CoCoOpter is more generalizable in unseen classes and achieves superior few-shot classification performance with a straightforward design.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

Learning to Prompt for Vision-Language Models

Article 31 July 2022

VSA: Adaptive Visual and Semantic Guided Attention on Few-Shot Learning

Data availability

All data included in this study are available upon request by contact with the corresponding author.

References

Azizi S et al (2021) Big self-supervised models advance medical image classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3478–3488
Chen CFR, Fan Q, Panda R (2021) Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 357–366
Dosovitskiy A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations
Pei Y, Huang Y, Zhang X (2021) Consistency guided network for degraded image classification. IEEE Trans Circuits Syst Video Technol 31(6):2231–2246
Article Google Scholar
Dai X et al (2021) Dynamic head: unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7373–7382
Xie X, Cheng G, Wang J, Yao X, Han J (2021) Oriented r-cnn for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3520–3529
Yin C, Tang J, Yuan T, Xu Z, Wang Y (2022) Bridging the gap between semantic segmentation and instance segmentation. IEEE Trans Multimed 24:4183–4196. https://doi.org/10.1109/TMM.2021.3114541
Article Google Scholar
Zhou L, Gong C, Liu Z, Fu K (2021) SAL: selection and attention losses for weakly supervised semantic segmentation. IEEE Trans Multimed 23:1035–1048. https://doi.org/10.1109/TMM.2020.2991592
Article Google Scholar
Lake BM, Salakhutdinov R, Tenenbaum JB (2015) Human-level concept learning through probabilistic program induction. Science 350(6266):1332–1338
Article MathSciNet MATH Google Scholar
Deng J et al (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 248–255
Lin TY et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755
Lifchitz Y, Avrithis Y, Picard S, Bursuc A (2019) Dense classification and implanting for few-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9258–9267
Liu Y, Schiele B, Sun Q (2020) An ensemble of epoch-wise empirical bayes for few-shot learning. In: European conference on computer vision, pp 404–421
Lin C-C, Chu H-L, Wang Y-CF, Lei C-L (2021) Joint feature disentanglement and hallucination for few-shot image classification. IEEE Trans Image Process 30:9245–9258. https://doi.org/10.1109/TIP.2021.3124322
Article Google Scholar
Chen W-Y, Liu Y-C, Kira Z, Wang Y-CF, Huang JB (2019) A closer look at few-shot classification. In: Proceedings of the international conference on learning representations, pp 1–24
Tian Y, Wang Y, Krishnan D, Tenenbaum JB, Isola P (2020) Rethinking few-shot image classification: a good embedding is all you need. In: European conference on computer vision, pp 266–282
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the international conference on machine learning, pp 1126–1135
Sung F, Yang Y, Zhang L, Xiang T, Torr PHS, Hospedales TM (2018) Learning to compare: relation network for few-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1199–1208
Bertinetto L, Henriques JF, Valmadre J, Torr P, Vedaldi A (2016) Learning feed-forward one-shot learners. In: Proceedings of the advances in neural information processing systems, pp 523–531
Chen Z, Fu Y, Wang Y-X, Ma L, Liu W, Hebert M (2019) Image deformation meta-networks for one-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8672–8681
Chen M et al (2020) Diversity transfer network for few-shot learning. In: Proceedings of the AAAI conference on artificial intelligence, pp 10559–10566
Lin C-C, Wang Y-CF, Lei C-L, Chen K-T (2019) Semantics-guided data hallucination for few-shot visual classification. In: IEEE international conference on image processing (ICIP), pp 3302-3306. https://doi.org/10.1109/ICIP.2019.8803420
Qi H, Brown M, Lowe DG (2018) Low-shot learning with imprinted weights. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2018, pp 5822–5830
Li X, Wu J, Sun Z, Ma Z, Cao J, Xue J-H (2021) BSNet: Bi-similarity network for few-shot fine-grained image classification. IEEE Trans Image Process 30:1318–1331. https://doi.org/10.1109/TIP.2020.3043128
Article MathSciNet Google Scholar
Zhu Y, Min W, Jiang S (2021) Attribute-guided feature learning for few-shot image recognition. IEEE Trans Multimed 23:1200–1209. https://doi.org/10.1109/TMM.2020.2993952
Article Google Scholar
Radford A et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763
Jia C et al (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning, pp 4904–4916
Zhou K, Yang J, Loy CC, Liu Z (2022) Learning to prompt for vision language models. Int J Comput Vis 130(9):2337–2348
Article Google Scholar
Qiu X et al (2020) Pre-trained models for natural language processing: a survey. Sci China Technol Sci 63(10):1872–1897
Article Google Scholar
Li M et al (2022) Bridge-prompt: towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19880–19889
Zhou K, Yang J, Loy C C, Liu Z (2022) Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16816–16825
Zeng Y et al (2022) Point prompt tuning for temporally language grounding. In: Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, pp 2003–2007
Rao Y et al (2022) DenseCLIP: language-guided dense prediction with context-aware prompting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18082–18091
Chen X et al (2022) Knowprompt: knowledge-aware prompt-tuning with synergistic optimization for relation extraction. In Proceedings of the ACM web conference, pp 2778–2788
Gao P et al (2021) Clip-adapter: better vision-language models with feature adapters. arXiv:2110.04544
Li FF, Fergus R, Perona P (2006) One-shot learning of object categories. IEEE Trans Pattern Anal Mach Intell 28(4):594–611. https://doi.org/10.1109/TPAMI.2006.79
Article Google Scholar
Lee Y, Choi S (2018) Gradient-based meta-learning with learned layerwise metric and subspace. In: International conference on machine learning, PMLR, pp 2927–2936
Ravi S, Larochelle H (2017) Optimization as a model for few-shot learning. In: International conference on learning representations
Li W, Wang L, Xu J, Huo J, Gao Y, Luo J (2019) Revisiting local descriptor based image-to-class measure for few-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7260–7268
Zhang H, Zhang J, Koniusz P (2019) Few-shot learning via saliency-guided hallucination of samples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2770–2779
Wang W, Bao H, Dong L, et al (2022) Image as a foreign language: BEiT pretraining for all vision and vision-language tasks. arXiv:2208.10442
Vaswani A et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF international conference on computer vision and pattern recognition, pp 770–778
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
Zhao Z, Wallace E, Feng S, Klein D, Singh S (2021) Calibrate before use: improving few-shot performance of language models. In: International conference on machine learning, pp 12697–12706
Jiang Z, Xu FF, Araki J, Neubig G (2020) How can we know what language models know? Trans Assoc Comput Linguist 8:423–438
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, vol 25
Zhu JY et al (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2223–2232
Liu Y et al (2020) Multilingual denoising pre-training for neural machine translation. Trans Assoc Comput Linguist 8:726–742
Article Google Scholar
Parkhi OM, Vedaldi A, Zisserman A, Jawahar C (2012) Cats and dogs. In: Proceedings of the IEEE/CVF international conference on computer vision and pattern recognition, pp 3498–3505
Krause J, Stark M, Deng J, Fei-Fei L (2013) 3d object representations for fine-grained categorization. In: Proceedings of the IEEE/CVF international conference on computer vision and pattern recognition, pp 554–561
Nilsback ME, Zisserman A (2008) Automated flower classification over a large number of classes. In: 2008 6th Indian conference on computer vision, graphics & image processing, pp 722–729
Bossard L, Guillaumin M, Gool LV (2014) Food-101-mining discriminative components with random forests. In: European conference on computer vision, pp 446–461
Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A (2013) Fine-grained visual classification of aircraft. arXiv:1306.5151
Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) Sun database: large-scale scene recognition from abbey to zoo. In: Proceedings of the IEEE/CVF international conference on computer vision and pattern recognition, pp 3485–3492
Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A (2014) Describing textures in the wild. In: Proceedings of the IEEE/CVF international conference on computer vision and pattern recognition, pp 3606–3613
Helber P, Bischke B, Dengel A, Borth D (2019) EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J Sel Top Appl Earth Obs Remote Sens 12(7):2217–2226
Article Google Scholar
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning, pp 1597–1607

Download references

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China No. 61806218, the National Key Research and Development Program of China No. 2021YFB3100800, and the Ministry of Science and Technology of China No. 2020AAA0108800.

Author information

Jie Yan, Yuxiang Xie contributed equally to this work and should be considered co-first authors.

Authors and Affiliations

College of Systems Engineering, National University of Defense Technology, Changsha, 410000, China
Jie Yan, Yuxiang Xie, Yanming Guo & Yingmei Wei
Department of Electrical, Computer and Biomedical Engineering, Finance Department at the Ted Rogers School of Management, Toronto Metropolitan University (Formerly Ryerson University), Toronto, Canada
Xiaoping Zhang
College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410000, China
Xidao Luan

Authors

Jie Yan
View author publications
You can also search for this author in PubMed Google Scholar
Yuxiang Xie
View author publications
You can also search for this author in PubMed Google Scholar
Yanming Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yingmei Wei
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoping Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xidao Luan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JY and YX wrote the main manuscript. YG and YW revised the manuscript. XZ and XL provided meaningful suggestions to the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Yingmei Wei.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Compliance with Ethical Standards

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yan, J., Xie, Y., Guo, Y. et al. CoCoOpter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification. Int J Multimed Info Retr 12, 27 (2023). https://doi.org/10.1007/s13735-023-00286-5

Download citation

Received: 21 March 2023
Revised: 24 May 2023
Accepted: 10 July 2023
Published: 23 August 2023
DOI: https://doi.org/10.1007/s13735-023-00286-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CoCoOpter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification

Abstract

Access this article

Similar content being viewed by others

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

Learning to Prompt for Vision-Language Models

VSA: Adaptive Visual and Semantic Guided Attention on Few-Shot Learning

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Compliance with Ethical Standards

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CoCoOpter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification

Abstract

Access this article

Similar content being viewed by others

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

Learning to Prompt for Vision-Language Models

VSA: Adaptive Visual and Semantic Guided Attention on Few-Shot Learning

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Compliance with Ethical Standards

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation