Skip to main content
Log in

CoCoOpter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

Few-shot image classification aims at learning to generalize to unseen new categories from a few training samples. Transfer learning is one prominent approach to the task, which first learns a backbone from the base classes and then trains a classifier on new classes with the prior learned knowledge. Typically, the convolutional neural network (CNN) is the preferred backbone. However, when the samples are limited, the representation ability of the feature extracted by CNN will decrease, thus leading to the performance degradation of few-shot image classification. Recently, the pre-trained large-scale vision-language model like CLIP has shown non-trivial potential, which can be used as a backbone for zero or few-shot transfer on a series of downstream tasks with the prompt. To fully explore the few-shot image classification performance of vision-language models, we propose CoCoOpter, a novel “pre-training + prompt tuning + fine-tuning” paradigm based on CLIP. CoCoOpter alleviates the overfitting and ensures generalizability in unseen new categories. Specifically, it learns an input-specific neural network to relieve overfitting by drawing attention away from a specific category to each specific input sample. Then, to establish connection between the visual and textual signals, it introduces the previously learned visual representations to perform automatic prompt tuning in the middle of the pre-trained CLIP, enabling learning input-specified prompt vectors. Moreover, two learnable lightweight neural networks are added at the end of CLIP to guide information propagation between different classes by fine-tuning both the visual and textual features. We perform extensive experiments on 11 image classification datasets. The results show that CoCoOpter is more generalizable in unseen classes and achieves superior few-shot classification performance with a straightforward design.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

All data included in this study are available upon request by contact with the corresponding author.

References

  1. Azizi S et al (2021) Big self-supervised models advance medical image classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3478–3488

  2. Chen CFR, Fan Q, Panda R (2021) Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 357–366

  3. Dosovitskiy A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations

  4. Pei Y, Huang Y, Zhang X (2021) Consistency guided network for degraded image classification. IEEE Trans Circuits Syst Video Technol 31(6):2231–2246

    Article  Google Scholar 

  5. Dai X et al (2021) Dynamic head: unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7373–7382

  6. Xie X, Cheng G, Wang J, Yao X, Han J (2021) Oriented r-cnn for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3520–3529

  7. Yin C, Tang J, Yuan T, Xu Z, Wang Y (2022) Bridging the gap between semantic segmentation and instance segmentation. IEEE Trans Multimed 24:4183–4196. https://doi.org/10.1109/TMM.2021.3114541

    Article  Google Scholar 

  8. Zhou L, Gong C, Liu Z, Fu K (2021) SAL: selection and attention losses for weakly supervised semantic segmentation. IEEE Trans Multimed 23:1035–1048. https://doi.org/10.1109/TMM.2020.2991592

    Article  Google Scholar 

  9. Lake BM, Salakhutdinov R, Tenenbaum JB (2015) Human-level concept learning through probabilistic program induction. Science 350(6266):1332–1338

    Article  MathSciNet  MATH  Google Scholar 

  10. Deng J et al (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 248–255

  11. Lin TY et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755

  12. Lifchitz Y, Avrithis Y, Picard S, Bursuc A (2019) Dense classification and implanting for few-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9258–9267

  13. Liu Y, Schiele B, Sun Q (2020) An ensemble of epoch-wise empirical bayes for few-shot learning. In: European conference on computer vision, pp 404–421

  14. Lin C-C, Chu H-L, Wang Y-CF, Lei C-L (2021) Joint feature disentanglement and hallucination for few-shot image classification. IEEE Trans Image Process 30:9245–9258. https://doi.org/10.1109/TIP.2021.3124322

    Article  Google Scholar 

  15. Chen W-Y, Liu Y-C, Kira Z, Wang Y-CF, Huang JB (2019) A closer look at few-shot classification. In: Proceedings of the international conference on learning representations, pp 1–24

  16. Tian Y, Wang Y, Krishnan D, Tenenbaum JB, Isola P (2020) Rethinking few-shot image classification: a good embedding is all you need. In: European conference on computer vision, pp 266–282

  17. Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the international conference on machine learning, pp 1126–1135

  18. Sung F, Yang Y, Zhang L, Xiang T, Torr PHS, Hospedales TM (2018) Learning to compare: relation network for few-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1199–1208

  19. Bertinetto L, Henriques JF, Valmadre J, Torr P, Vedaldi A (2016) Learning feed-forward one-shot learners. In: Proceedings of the advances in neural information processing systems, pp 523–531

  20. Chen Z, Fu Y, Wang Y-X, Ma L, Liu W, Hebert M (2019) Image deformation meta-networks for one-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8672–8681

  21. Chen M et al (2020) Diversity transfer network for few-shot learning. In: Proceedings of the AAAI conference on artificial intelligence, pp 10559–10566

  22. Lin C-C, Wang Y-CF, Lei C-L, Chen K-T (2019) Semantics-guided data hallucination for few-shot visual classification. In: IEEE international conference on image processing (ICIP), pp 3302-3306. https://doi.org/10.1109/ICIP.2019.8803420

  23. Qi H, Brown M, Lowe DG (2018) Low-shot learning with imprinted weights. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2018, pp 5822–5830

  24. Li X, Wu J, Sun Z, Ma Z, Cao J, Xue J-H (2021) BSNet: Bi-similarity network for few-shot fine-grained image classification. IEEE Trans Image Process 30:1318–1331. https://doi.org/10.1109/TIP.2020.3043128

    Article  MathSciNet  Google Scholar 

  25. Zhu Y, Min W, Jiang S (2021) Attribute-guided feature learning for few-shot image recognition. IEEE Trans Multimed 23:1200–1209. https://doi.org/10.1109/TMM.2020.2993952

    Article  Google Scholar 

  26. Radford A et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763

  27. Jia C et al (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning, pp 4904–4916

  28. Zhou K, Yang J, Loy CC, Liu Z (2022) Learning to prompt for vision language models. Int J Comput Vis 130(9):2337–2348

    Article  Google Scholar 

  29. Qiu X et al (2020) Pre-trained models for natural language processing: a survey. Sci China Technol Sci 63(10):1872–1897

    Article  Google Scholar 

  30. Li M et al (2022) Bridge-prompt: towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19880–19889

  31. Zhou K, Yang J, Loy C C, Liu Z (2022) Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16816–16825

  32. Zeng Y et al (2022) Point prompt tuning for temporally language grounding. In: Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, pp 2003–2007

  33. Rao Y et al (2022) DenseCLIP: language-guided dense prediction with context-aware prompting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18082–18091

  34. Chen X et al (2022) Knowprompt: knowledge-aware prompt-tuning with synergistic optimization for relation extraction. In Proceedings of the ACM web conference, pp 2778–2788

  35. Gao P et al (2021) Clip-adapter: better vision-language models with feature adapters. arXiv:2110.04544

  36. Li FF, Fergus R, Perona P (2006) One-shot learning of object categories. IEEE Trans Pattern Anal Mach Intell 28(4):594–611. https://doi.org/10.1109/TPAMI.2006.79

    Article  Google Scholar 

  37. Lee Y, Choi S (2018) Gradient-based meta-learning with learned layerwise metric and subspace. In: International conference on machine learning, PMLR, pp 2927–2936

  38. Ravi S, Larochelle H (2017) Optimization as a model for few-shot learning. In: International conference on learning representations

  39. Li W, Wang L, Xu J, Huo J, Gao Y, Luo J (2019) Revisiting local descriptor based image-to-class measure for few-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7260–7268

  40. Zhang H, Zhang J, Koniusz P (2019) Few-shot learning via saliency-guided hallucination of samples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2770–2779

  41. Wang W, Bao H, Dong L, et al (2022) Image as a foreign language: BEiT pretraining for all vision and vision-language tasks. arXiv:2208.10442

  42. Vaswani A et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30

  43. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF international conference on computer vision and pattern recognition, pp 770–778

  44. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  45. Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

  46. Zhao Z, Wallace E, Feng S, Klein D, Singh S (2021) Calibrate before use: improving few-shot performance of language models. In: International conference on machine learning, pp 12697–12706

  47. Jiang Z, Xu FF, Araki J, Neubig G (2020) How can we know what language models know? Trans Assoc Comput Linguist 8:423–438

    Article  Google Scholar 

  48. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, vol 25

  49. Zhu JY et al (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2223–2232

  50. Liu Y et al (2020) Multilingual denoising pre-training for neural machine translation. Trans Assoc Comput Linguist 8:726–742

    Article  Google Scholar 

  51. Parkhi OM, Vedaldi A, Zisserman A, Jawahar C (2012) Cats and dogs. In: Proceedings of the IEEE/CVF international conference on computer vision and pattern recognition, pp 3498–3505

  52. Krause J, Stark M, Deng J, Fei-Fei L (2013) 3d object representations for fine-grained categorization. In: Proceedings of the IEEE/CVF international conference on computer vision and pattern recognition, pp 554–561

  53. Nilsback ME, Zisserman A (2008) Automated flower classification over a large number of classes. In: 2008 6th Indian conference on computer vision, graphics & image processing, pp 722–729

  54. Bossard L, Guillaumin M, Gool LV (2014) Food-101-mining discriminative components with random forests. In: European conference on computer vision, pp 446–461

  55. Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A (2013) Fine-grained visual classification of aircraft. arXiv:1306.5151

  56. Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) Sun database: large-scale scene recognition from abbey to zoo. In: Proceedings of the IEEE/CVF international conference on computer vision and pattern recognition, pp 3485–3492

  57. Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  58. Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A (2014) Describing textures in the wild. In: Proceedings of the IEEE/CVF international conference on computer vision and pattern recognition, pp 3606–3613

  59. Helber P, Bischke B, Dengel A, Borth D (2019) EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J Sel Top Appl Earth Obs Remote Sens 12(7):2217–2226

    Article  Google Scholar 

  60. Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning, pp 1597–1607

Download references

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China No. 61806218, the National Key Research and Development Program of China No. 2021YFB3100800, and the Ministry of Science and Technology of China No. 2020AAA0108800.

Author information

Authors and Affiliations

Authors

Contributions

JY and YX wrote the main manuscript. YG and YW revised the manuscript. XZ and XL provided meaningful suggestions to the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Yingmei Wei.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Compliance with Ethical Standards

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yan, J., Xie, Y., Guo, Y. et al. CoCoOpter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification. Int J Multimed Info Retr 12, 27 (2023). https://doi.org/10.1007/s13735-023-00286-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13735-023-00286-5

Keywords

Navigation