Skip to main content

Advertisement

Adversarial domain adaptation with CLIP for few-shot image classification

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Few-shot learning focuses on training efficient models with limited amounts of training data. Its mainstream approaches have evolved from single-modal to multi-modal methods. The Contrastive Vision-Language Pre-training model, known as CLIP, achieves image classification by aligning the embedding spaces of images and text. To better achieve knowledge transfer between image domain and text domain, we propose a fine-tuning framework for vision-language models with CLIP. It introduces a novel adversarial domain adaptation approach, which trains a text and image symmetrical classifier to identify the differences between two domains. To more effectively align text and image into the same space, we adapt two types of confusion loss to construct the aligned semantic space by fine-tuning multi-modal features extractor. Experiments on 11 public datasets show that our proposed method has superior performance compared with state of art CLIP-driven learning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

In our research and data use, we strictly adhere to ethical guidelines, ensuring that informed consent is obtained from all participants. The datasets generated or analysed during the current study are available from the corresponding author on reasonable request.

References

  1. Shen Z, Liu Z, Qin J, Savvides M, Cheng K-T (2021) Partial is better than all: Revisiting fine-tuning strategy for few-shot learning. In: AAAI. pp 9594–9602

  2. Liu H, Tam D, Muqeeth M, Mohta J, Huang T, Bansal M, Raffel CA (2022) Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In: NeurIPS

  3. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. pp 8748–8763

  4. Gao P, Geng S, Zhang R, Ma T, Fang R, Zhang Y, Li H, Qiao Y (2023) Clip-adapter: Better vision-language models with feature adapters. Int J Comput Vis 1–15

  5. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 770–778

  6. Zhang R, Zhang W, Fang R, Gao P, Li K, Dai J, Qiao Y, Li H (2022) Tip-adapter: Training-free adaption of clip for few-shot classification. In: European Conference on Computer Vision. pp 493–510

  7. Adnan M, Arunkumar A, Jain G, Nair P, Soloveychik I, Kamath P (2024) Keyformer: Kv cache reduction through key tokens selection for efficient generative inference. MLsys

  8. Khandelwal U, Levy O, Jurafsky D, Zettlemoyer L, Lewis M (2020) Generalization through memorization: Nearest neighbor language models. ICLR

  9. Liao B, Tan S, Monz C (2023) Make pre-trained model reversible: From parameter to memory efficient fine-tuning. Adv Neural Inf Process Syst

  10. Huisman M, Rijn JN, Plaat A (2021) A survey of deep meta-learning. Artif Intell Rev 54:4483–4541

    Article  MATH  Google Scholar 

  11. Menon S, Vondrick C (2023) Visual classification via description from large language models. In: The eleventh international conference on learning representations

  12. Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q (2021) A comprehensive survey on transfer learning. Proc IEEE 109(1):43–76

    Article  MATH  Google Scholar 

  13. Wang Y, Yao Q, Kwok JT, Ni LM (2021) Generalizing from a few examples: A survey on few-shot learning. ACM Comput Surv 53:1–34

    MATH  Google Scholar 

  14. Bateni P, Goyal R, Masrani V, Wood F, Sigal L (2020) Improved few-shot visual classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 14493–14502

  15. Parnami A, Lee M (2022) Learning from few examples: A summary of approaches to few-shot learning. Comput Vis Pattern Recogn

  16. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F et al (2023) Open and efficient foundation language models. 2302. Preprint at https://doi. org/10.48550/arXiv

  17. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S et al (2023) Gpt-4 technical report. arXiv:2303.08774

  18. Li J, Li D, Savarese S, Hoi S (2023) Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning. PMLR, pp 19730–19742

  19. Xu H, Liu B, Shu L, Yu PS (2019) Bert post-training for review reading comprehension and aspect-based sentiment analysis. NAACL, pp 2324–2335

  20. Qu C, Yang L, Qiu M, Croft WB, Zhang Y, Iyyer M (2019) Bert with history answer embedding for conversational question answering. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. pp 1133–1136

  21. Zhang Y, Sun S, Galley M, Chen Y-C, Brockett C, Gao X, Gao J, Liu J, Dolan B (2020) Dialogpt: Large-scale generative pre-training for conversational response generation. ACL, pp 270–278

  22. Dathathri S, Madotto A, Lan J, Hung J, Frank E, Molino P, Yosinski J, Liu R (2020) Plug and play language models: A simple approach to controlled text generation. ACLR

  23. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25

  24. Minderer M, Gritsenko AA, Stone A, Neumann M, Weissenborn D (2022) Simple open-vocabulary object detection. In: European conference on computer vision. pp 728–755

  25. Du Y, Wei F, Zhang Z, Shi M, Gao Y, Li G (2022) Learning to prompt for open-vocabulary object detection with vision-language model. Comput Vis Pattern Recogn 14064–14073

  26. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision. pp 213–229

  27. Xu M, Zhang Z, Wei F, Lin Y, Cao Y, Hu H, Bai X (2022) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: European Conference on Computer Vision. pp 736–753

  28. Ghifary M, Kleijn WB, Zhang M, Balduzzi D, Li W (2016) Deep reconstruction-classification networks for unsupervised domain adaptation. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. pp 597–613

  29. Gu X, Lin T-Y, Kuo W, Cui Y (2022) Open-vocabulary object detection via vision and language knowledge distillation. ACLR

  30. Liu Q, Wen Y, Han J, Xu C, Xu H, Liang X (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: European Conference on Computer Vision. pp 275–292

  31. Gan Z, Li L, Li C, Wang L, Liu Z, Gao J et al (2022) Vision-language pre-training: Basics, recent advances, and future trends. Found Trend Comput Graph Vis 14(3–4):163–352

    Article  MATH  Google Scholar 

  32. Vinker Y, Pajouheshgar E, Bo JY, Bachmann RC, Bermano AH, Cohen-Or D, Zamir A, Shamir A (2022) Clipasso: Semantically-aware object sketching. ACM Trans Graph 41(4):1–11

    Article  Google Scholar 

  33. Wilson G, Cook DJ (2020) Claims: A survey of unsupervised deep domain adaptation. ACM Trans Intell Syst Technol 11:1–46

    Article  MATH  Google Scholar 

  34. Zonoozi MH, Seydi V (2022) A survey on adversarial domain adaptation. Neural Process Lett 55:2429–2469

    Article  MATH  Google Scholar 

  35. Li L, Wan Z, He H (2020) Dual alignment for partial domain adaptation. IEEE Trans Cybern 51:3404–3416

  36. Li J, Jing M, Lu K, Zhu L, Shen HT (2019) Locality preserving joint transfer for domain adaptation. IEEE Trans Image Process 28(12):6103–6115

    Article  MathSciNet  MATH  Google Scholar 

  37. Zhang C, Zhao Q, Wu H (2022) Deep domain adaptation via joint transfer networks. In: Neurocomputing, vol. 489. pp 441–448

  38. Zhang Y, Tang H, Jia K, Tan M (2019) Domain-symmetric networks for adversarial domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 5031–5040

  39. Zhang L, Gao X (2022) Transfer adaptation learning: A decade survey. In: IEEE Transactions on Neural Networks and Learning Systems, vol. 35. pp 23–44

  40. Ge P, Ren C-X, Xu X-L, Yan H (2023) Unsupervised domain adaptation via deep conditional adaptation network. Pattern Recogn 134

  41. Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop. pp 178–178

  42. Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A (2014) Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 3606–3613

  43. Helber P, Bischke B, Dengel A, Borth D (2019) Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J Selec Topics Appl Earth Obser Remote Sens 12(7):2217–2226

    Article  Google Scholar 

  44. Bossard L, Guillaumin M, Van Gool L (2014) Food-101–mining discriminative components with random forests. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13. pp 446–461

  45. Nilsback M-E, Zisserman A (2008) Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. pp 722–729

  46. Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A (2013) Fine-grained visual classification of aircraft. Comput Vis Pattern Recogn

  47. Parkhi OM, Vedaldi A, Zisserman A, Jawahar C (2012) Cats and dogs. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. pp 3498–3505

  48. Krause J, Stark M, Deng J, Fei-Fei L (2013) 3d object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. pp 554–561

  49. Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. Comput Vis Pattern Recogn

  50. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. ICLR

  51. Zhou K, Yang J, Loy CC, Liu Z (2022) Learning to prompt for vision-language models. Int J Comput Vision 130(9):2337–2348

    Article  MATH  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 62306320), the Natural Science Foundation of Jiangsu Province (No. BK20231063)), the Fundamental Research Funds of Central Universities (No. 2019XKOYMS87), Science and Technology Planning Project of Xuzhou (No. KC21193).

Author information

Authors and Affiliations

Authors

Contributions

Tongfeng Sun, Hongjian Yang, and Zhongnian Li conceived and designed the study. Tongfeng Sun, Hongjian Yang, Zhongnian Li, Xinzheng Xu, and Xiurui Wang performed material preparation, data collection, and analysis. The initial draft of the manuscript was written by Tongfeng Sun, and all authors provided comments on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zhongnian Li.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, T., Yang, H., Li, Z. et al. Adversarial domain adaptation with CLIP for few-shot image classification. Appl Intell 55, 59 (2025). https://doi.org/10.1007/s10489-024-06088-4

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-024-06088-4

Keywords