Prompt learning in computer vision: a survey

Lei, Yiming; Li, Jingqi; Li, Zilong; Cao, Yuan; Shan, Hongming

doi:10.1631/FITEE.2300389

Prompt learning in computer vision: a survey

计算机视觉中的提示学习:综述

Review
Published: 08 February 2024

Volume 25, pages 42–63, (2024)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Yiming Lei (雷一鸣) ORCID: orcid.org/0000-0002-1349-7074¹,
Jingqi Li (李婧琦)¹,
Zilong Li (李子龙)¹,
Yuan Cao (曹原)¹ &
…
Hongming Shan (单洪明) ORCID: orcid.org/0000-0002-0604-3197^2,3,4

2101 Accesses
4 Citations
Explore all metrics

Abstract

Prompt learning has attracted broad attention in computer vision since the large pre-trained vision-language models (VLMs) exploded. Based on the close relationship between vision and language information built by VLM, prompt learning becomes a crucial technique in many important applications such as artificial intelligence generated content (AIGC). In this survey, we provide a progressive and comprehensive review of visual prompt learning as related to AIGC. We begin by introducing VLM, the foundation of visual prompt learning. Then, we review the vision prompt learning methods and prompt-guided generative models, and discuss how to improve the efficiency of adapting AIGC models to specific downstream tasks. Finally, we provide some promising research directions concerning prompt learning.

摘要

自大型预训练视觉—语言模型(VLM)爆发以来,提示学习已在计算机视觉领域引发广泛关注。基于VLM构建的视觉和语言信息之间的密切关系,提示学习成为许多重要应用领域(如人工智能内容生成(AIGC))中的关键技术。本综述循序渐进且全面地总结了与AIGC相关的视觉提示学习。首先介绍了VLM,它是视觉提示学习的基础。然后,回顾了视觉提示学习方法和提示引导生成模型,并讨论了如何提高将AIGC模型适用于下游特定任务的效率。最后,提供了一些有前景的关于提示学习的研究方向。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Artificial Intelligence

References

Abdal R, Qin YP, Wonka P, 2019. Image2StyleGAN: how to embed images into the StyleGAN latent space? Proc IEEE/CVF Int Conf on Computer Vision, p.4431–4440. https://doi.org/10.1109/ICCV.2019.00453
Avrahami O, Lischinski D, Fried O, 2022. Blended diffusion for text-driven editing of natural images. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18187–18197. https://doi.org/10.1109/CVPR52688.2022.01767
Bahng H, Jahanian A, Sankaranarayanan S, et al., 2022. Exploring visual prompts for adapting large-scale models. https://doi.org/10.48550/arXiv.2203.17274
Bar A, Gandelsman Y, Darrell T, et al., 2022. Visual prompting via image inpainting. Proc 36^th Conf on Neural Information Processing Systems, p.25005–25017.
Barnes C, Shechtman E, Finkelstein A, et al., 2009. Patch-Match: a randomized correspondence algorithm for structural image editing. ACM Trans Graph, 28(3):24. https://doi.org/10.1145/1531326.1531330
Article Google Scholar
Cao Y, Zhang DC, Zheng X, et al., 2023. Mutual information boosted precipitation nowcasting from radar images. Remote Sens, 15(6):1639. https://doi.org/10.3390/rs15061639
Article Google Scholar
Chao HQ, Wang K, He YW, et al., 2022. GaitSet: cross-view gait recognition through utilizing gait as a deep set. IEEE Trans Patt Anal Mach Intell, 44(7):3467–3478. https://doi.org/10.1109/TPAMI.2021.3057879
Google Scholar
Chen AC, Yao YG, Chen PY, et al., 2023. Understanding and improving visual prompting: a label-mapping perspective. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19133–19143. https://doi.org/10.1109/CVPR52729.2023.01834
Chen GY, Yao WR, Song XC, et al., 2023. PLOT: prompt learning with optimal transport for vision-language models. Proc 11^th Int Conf on Learning Representations.
Chen Z, Duan YC, Wang WH, et al., 2023. Vision Transformer adapter for dense predictions. Proc 11^th Int Conf on Learning Representations.
Cuturi M, 2013. Sinkhorn distances: lightspeed computation of optimal transport. Proc 26^th Int Conf on Neural Information Processing Systems, p.2292–2300. https://doi.org/10.5555/2999792.2999868
Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171–4186. https://doi.org/10.18653/v1/N19-1423
Dong BW, Zhou P, Yan SC, et al., 2023. LPT: long-tailed prompt tuning for image classification. Proc 11^th Int Conf on Learning Representations.
Dosovitskiy A, Beyer L, Kolesnikov A, et al., 2021. An image is worth 16×16 words: Transformers for image recognition at scale. Proc 9^th Int Conf on Learning Representations.
Du Y, Wei FY, Zhang ZH, et al., 2022. Learning to prompt for open-vocabulary object detection with vision-language model. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.14064–14073. https://doi.org/10.1109/CVPR52688.2022.01369
Feng CM, Li BJ, Xu XX, et al., 2023. Learning federated visual prompt in null space for MRI reconstruction. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8064–8073. https://doi.org/10.1109/CVPR52729.2023.00779
Gao P, Geng SJ, Zhang RR, et al., 2021. CLIP-Adapter: better vision-language models with feature adapters. https://doi.org/10.48550/arXiv.2110.04544
Ge CJ, Huang R, Xie MX, et al., 2022. Domain adaptation via prompt learning. https://doi.org/10.48550/arXiv.2202.06687
Ge JX, Luo HY, Qian SY, et al., 2023. Chain of thought prompt tuning in vision language models. https://doi.org/10.48550/arXiv.2304.07919
Goodfellow I, Pouget-Abadie J, Mirza M, et al., 2020. Generative adversarial networks. Commun ACM, 63(11):139–144. https://doi.org/10.1145/3422622
Article MathSciNet Google Scholar
Gu XY, Lin TY, Kuo WC, et al., 2022. Open-vocabulary object detection via vision and language knowledge distillation. Proc 10^th Int Conf on Learning Representations.
Han K, Wang YH, Chen HT, et al., 2023. A survey on vision Transformer. IEEE Trans Patt Anal Mach Intell, 45(1):87–110. https://doi.org/10.1109/TPAMI.2022.3152247
Article Google Scholar
He KM, Sun J, 2014. Image completion approaches using the statistics of similar patches. IEEE Trans Patt Anal Mach Intell, 36(12):2423–2435. https://doi.org/10.1109/TPAMI.2014.2330611
Article Google Scholar
He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770–778. https://doi.org/10.1109/CVPR.2016.90
He KM, Chen XL, Xie SN, et al., 2022. Masked autoencoders are scalable vision learners. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15979–15988. https://doi.org/10.1109/CVPR52688.2022.01553
Ho J, Jain A, Abbeel P, 2020. Denoising diffusion probabilistic models. Proc 34^th Int Conf on Neural Information Processing Systems, p.574. https://doi.org/10.5555/3495724.3496298
Houlsby N, Giurgiu A, Jastrzebski S, et al., 2019. Parameter-efficient transfer learning for NLP. Proc 36^th Int Conf on Machine Learning, p.2790–2799.
Hu EJ, Shen YL, Wallis P, et al., 2022. LoRA: low-rank adaptation of large language models. Proc 10^th Int Conf on Learning Representations.
Huang ST, Gong B, Pan YL, et al., 2023. VoP: text-video co-operative prompt tuning for cross-modal retrieval. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6565–6574. https://doi.org/10.1109/CVPR52729.2023.00635
Huang ZC, Zeng ZY, Liu B, et al., 2020. Pixel-BERT: aligning image pixels with text by deep multi-modal Transformers. https://doi.org/10.48550/arXiv.2004.00849
Iizuka S, Simo-Serra E, Ishikawa H, 2017. Globally and locally consistent image completion. ACM Trans Graph, 36(4):107. https://doi.org/10.1145/3072959.3073659
Article Google Scholar
Jia C, Yang YF, Xia Y, et al., 2021. Scaling up visual and vision-language representation learning with noisy text supervision. Proc 38^th Int Conf on Machine Learning, p.4904–4916.
Jia ML, Tang LM, Chen BC, et al., 2022. Visual prompt tuning. Proc 17^th European Conf on Computer Vision, p.709–727. https://doi.org/10.1007/978-3-031-19827-4_41
Ju C, Han TD, Zheng KH, et al., 2022. Prompting visual-language models for efficient video understanding. Proc 17^th European Conf on Computer Vision, p.105–124. https://doi.org/10.1007/978-3-031-19833-5_7
Kang M, Zhu JY, Zhang R, et al., 2023. Scaling up GANs for text-to-image synthesis. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10124–10134. https://doi.org/10.1109/CVPR52729.2023.00976
Kaplan J, McCandlish S, Henighan T, et al., 2020. Scaling laws for neural language models. https://doi.org/10.48550/arXiv.2001.08361
Karras T, Laine S, Aila T, 2019. A style-based generator architecture for generative adversarial networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4396–4405. https://doi.org/10.1109/CVPR.2019.00453
Karras T, Laine S, Aittala M, et al., 2020. Analyzing and improving the image quality of StyleGAN. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8107–8116. https://doi.org/10.1109/CVPR42600.2020.00813
Karras T, Aittala M, Laine S, et al., 2021. Alias-free generative adversarial networks. Proc 35^th Conf on Neural Information Processing Systems, p.852–863.
Kawar B, Zada S, Lang O, et al., 2023. Imagic: text-based real image editing with diffusion models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6007–6017. https://doi.org/10.1109/CVPR52729.2023.00582
Khan S, Naseer M, Hayat M, et al., 2022. Transformers in vision: a survey. ACM Comput Surv, 54(10s):200. https://doi.org/10.1145/3505244
Article Google Scholar
Khattak MU, Rasheed H, Maaz M, et al., 2023. MaPLe: multi-modal prompt learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19113–19122. https://doi.org/10.1109/CVPR52729.2023.01832
Kim W, Son B, Kim I, 2021. ViLT: vision-and-language Transformer without convolution or region supervision. Proc 38^th Int Conf on Machine Learning, p.5583–5594.
Kingma DP, Welling M, 2013. Auto-encoding variational Bayes. https://doi.org/10.48550/arXiv.1312.6114
Kirillov A, Mintun E, Ravi N, et al., 2023. Segment anything. https://doi.org/10.48550/arXiv.2304.02643
Kojima T, Gu SS, Reid M, et al., 2022. Large language models are zero-shot reasoners. Proc 36^th Conf on Neural Information Processing Systems.
Kwon H, Song T, Jeong S, et al., 2023. Probabilistic prompt learning for dense prediction. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6768–6777. https://doi.org/10.1109/CVPR52729.2023.00654
Lee JH, Choi I, Kim MH, 2016. Laplacian patch-based image synthesis. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2727–2735. https://doi.org/10.1109/CVPR.2016.298
Lei YM, Zhang JP, Shan HM, 2021. Strided self-supervised low-dose CT denoising for lung nodule classification. Phenomics, 1(6):257–268. https://doi.org/10.1007/s43657-021-00025-y
Article Google Scholar
Lei YM, Zhu HP, Zhang JP, et al., 2022. Meta ordinal regression forest for medical image classification with ordinal labels. IEEE/CAA J Autom Sin, 9(7):1233–1247. https://doi.org/10.1109/JAS.2022.105668
Article Google Scholar
Lei YM, Li ZL, Shen Y, et al., 2023a. CLIP-Lung: textual knowledge-guided lung nodule malignancy prediction. Proc 26^th Int Conf on Medical Image Computing and Computer-Assisted Intervention, p.403–412. https://doi.org/10.1007/978-3-031-43990-2_38
Lei YM, Li ZL, Li YY, et al., 2023b. LICO: explainable models with language-image consistency. https://doi.org/10.48550/arXiv.2310.09821
Li JN, Li DX, Xiong CM, et al., 2022. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. Proc 39^th Int Conf on Machine Learning, p.12888–12900.
Li JQ, Gao JQ, Zhang YZ, et al., 2023a. Motion matters: a novel motion modeling for cross-view gait feature learning. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1–5. https://doi.org/10.1109/ICASSP49357.2023.10096571
Li JQ, Zhang YZ, Shan HM, et al., 2023b. Gaitcotr: improved spatial-temporal representation for gait recognition with a hybrid convolution-Transformer framework. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1–5. https://doi.org/10.1109/ICASSP49357.2023.10096602
Li MK, Xu P, Li CG, et al., 2023. MaskCL: semantic mask-driven contrastive learning for unsupervised person reidentification with clothes change. https://doi.org/10.48550/arXiv.2305.13600
Li WH, Huang XK, Zhu Z, et al., 2022. OrdinalCLIP: learning rank prompts for language-guided ordinal regression. Proc 36^th Conf on Neural Information Processing Systems.
Lin BB, Zhang SL, Yu X, 2021. Gait recognition via effective global-local feature representation and local temporal aggregation. Proc IEEE/CVF Int Conf on Computer Vision, p.14628–14636. https://doi.org/10.1109/ICCV48922.2021.01438
Lin HZ, Cheng X, Wu XY, et al., 2022. CAT: cross attention in vision Transformer. Proc IEEE Int Conf on Multimedia and Expo, p.1–6. https://doi.org/10.1109/ICME52920.2022.9859720
Lin TY, Goyal P, Girshick R, et al., 2017. Focal loss for dense object detection. Proc IEEE Int Conf on Computer Vision, p.2999–3007. https://doi.org/10.1109/ICCV.2017.324
Lin Y, Zhao ZC, Zhu ZJ, et al., 2023. Exploring visual prompts for whole slide image classification with multiple instance learning. https://doi.org/10.48550/arXiv.2303.13122
Ling H, Kreis K, Li DQ, et al., 2021. EditGAN: high-precision semantic image editing. Proc 35^th Conf on Neural Information Processing Systems, p.16331–16345.
Liu PF, Yuan WZ, Fu JL, et al., 2023. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv, 55(9):195. https://doi.org/10.1145/3560815
Article Google Scholar
Liu WH, Shen X, Pun CM, et al., 2023. Explicit visual prompting for low-level structure segmentations. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19434–19445. https://doi.org/10.1109/CVPR52729.2023.01862
Liu YJ, Lu YN, Liu H, et al., 2023. Hierarchical prompt learning for multi-task learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10888–10898. https://doi.org/10.1109/CVPR52729.2023.01048
Lu JS, Clark C, Zellers R, et al., 2023. Unified-IO: a unified model for vision, language, and multi-modal tasks. Proc 11^th Int Conf on Learning Representations.
Lu P, Mishra S, Xia T, et al., 2022. Learn to explain: multimodal reasoning via thought chains for science question answering. Proc 36^th Conf on Neural Information Processing Systems, p.2507–2521.
Lu YN, Liu JZ, Zhang YG, et al., 2022. Prompt distribution learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5196–5205. https://doi.org/10.1109/CVPR52688.2022.00514
Lugmayr A, Danelljan M, Romero A, et al., 2022. Repaint: inpainting using denoising diffusion probabilistic models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11451–11461. https://doi.org/10.1109/CVPR52688.2022.01117
Ma ZY, Luo G, Gao J, et al., 2022. Open-vocabulary one-stage detection with hierarchical visual-language knowledge distillation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.14054–14063. https://doi.org/10.1109/CVPR52688.2022.01368
Mao CZ, Teotia R, Sundar A, et al., 2023. Doubly right object recognition: a why prompt for visual rationales. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2722–2732. https://doi.org/10.1109/CVPR52729.2023.00267
Milletari F, Navab N, Ahmadi SA, 2016. V-Net: fully convolutional neural networks for volumetric medical image segmentation. Proc 4^th Int Conf on 3D Vision, p.565–571. https://doi.org/10.1109/3DV.2016.79
Nichol AQ, Dhariwal P, Ramesh A, et al., 2022. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. Proc 39^th Int Conf on Machine Learning, p.16784–16804.
Oh C, Hwang H, Lee HY, et al., 2023. BlackVIP: black-box visual prompting for robust transfer learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.24224–24235. https://doi.org/10.1109/CVPR52729.2023.02320
Perarnau G, van de Weijer J, Raducanu B, et al., 2016. Invertible conditional GANs for image editing. https://doi.org/10.48550/arXiv.1611.06355
Pfeiffer J, Kamath A, Rücklé A, et al., 2020a. AdapterFusion: non-destructive task composition for transfer learning. Proc 16^th Conf of the European Chapter of the Association for Computational Linguistics: Main Volume, p.487–503. https://doi.org/10.18653/v1/2021.eacl-main.39
Pfeiffer J, Rücklé A, Poth C, et al., 2020b. AdapterHub: a framework for adapting Transformers. Proc Conf on Empirical Methods in Natural Language Processing: System Demonstrations, p.46–54. https://doi.org/10.18653/v1/2020.emnlp-demos.7
Radford A, Kim JW, Hallacy C, et al., 2021. Learning transferable visual models from natural language supervision. Proc 38^th Int Conf on Machine Learning, p.8748–8763.
Radford A, Kim JW, Xu T, et al., 2023. Robust speech recognition via large-scale weak supervision. Proc 40^th Int Conf on Machine Learning, p.28492–28518.
Ramesh A, Pavlov M, Goh G, et al., 2021. Zero-shot text-to-image generation. Proc 38^th Int Conf on Machine Learning, p.8821–8831.
Ramesh A, Dhariwal P, Nichol A, et al., 2022. Hierarchical text-conditional image generation with CLIP latents. https://doi.org/10.48550/arXiv.2204.06125
Reed S, Akata Z, Yan XC, et al., 2016a. Generative adversarial text to image synthesis. Proc 33^rd Int Conf on Machine Learning, p.1060–1069. https://doi.org/10.5555/3045390.3045503
Reed S, Akata Z, Mohan S, et al., 2016b. Learning what and where to draw. Proc 30^th Int Conf on Neural Information Processing Systems, p.217–225. https://doi.org/10.5555/3157096.3157121
Rombach R, Blattmann A, Lorenz D, et al., 2022. High-resolution image synthesis with latent diffusion models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10674–10685. https://doi.org/10.1109/CVPR52688.2022.01042
Ruiz N, Li YZ, Jampani V, et al., 2023. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22500–22510. https://doi.org/10.1109/CVPR52729.2023.02155
Selvaraju RR, Cogswell M, Das A, et al., 2017. Grad-CAM: visual explanations from deep networks via gradient-based localization. Proc IEEE Int Conf on Computer Vision, p.618–626. https://doi.org/10.1109/ICCV.2017.74
Shamshad F, Khan S, Zamir SW, et al., 2023. Transformers in medical imaging: a survey. Med Image Anal, 88:102802. https://doi.org/10.1016/j.media.2023.102802
Article Google Scholar
Smith JS, Hsu YC, Zhang LY, et al., 2023. Continual diffusion: continual customization of text-to-image diffusion with C-LoRA. https://doi.org/10.48550/arXiv.2304.06027
Sohn K, Chang HW, Lezama J, et al., 2023. Visual prompt tuning for generative transfer learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19840–19851. https://doi.org/10.1109/CVPR52729.2023.01900
Sung YL, Cho J, Bansal M, 2022. VL-Adapter: parameter-efficient transfer learning for vision-and-language tasks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5217–5227. https://doi.org/10.1109/CVPR52688.2022.00516
Suvorov R, Logacheva E, Mashikhin A, et al., 2022. Resolution-robust large mask inpainting with Fourier convolutions. Proc IEEE/CVF Winter Conf on Applications of Computer Vision, p.3172–3182. https://doi.org/10.1109/WACV51458.2022.00323
Tao M, Tang H, Wu F, et al., 2022. DF-GAN: a simple and effective baseline for text-to-image synthesis. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.16494–16504. https://doi.org/10.1109/CVPR52688.2022.01602
Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31^st Int Conf on Neural Information Processing Systems, p.6000–6010. https://doi.org/10.5555/3295222.3295349
Wang F, Li ML, Lin XD, et al., 2023. Learning to decompose visual features with latent textual prompts. Proc 11^th Int Conf on Learning Representations.
Wang S, Saharia C, Montgomery C, et al., 2023. Imagen Editor and EditBench: advancing and evaluating text-guided image inpainting. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18359–18369. https://doi.org/10.1109/CVPR52729.2023.01761
Wang TC, Liu MY, Zhu JY, et al., 2018. High-resolution image synthesis and semantic manipulation with conditional GANs. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8798–8807. https://doi.org/10.1109/CVPR.2018.00917
Wang XL, Wang W, Cao Y, et al., 2023. Images speak in images: a generalist painter for in-context visual learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6830–6839. https://doi.org/10.1109/CVPR52729.2023.00660
Wang ZF, Zhang ZZ, Lee CY, et al., 2022. Learning to prompt for continual learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.139–149. https://doi.org/10.1109/CVPR52688.2022.00024
Wei J, Wang XZ, Schuurmans D, et al., 2022. Chain-of-thought prompting elicits reasoning in large language models. Proc 36^th Conf on Neural Information Processing Systems.
Xiao ZX, Chen YZ, Zhang L, et al., 2023. Instruction-ViT: multi-modal prompts for instruction learning in ViT. https://doi.org/10.48550/arXiv.2305.00201
Xie SA, Zhang ZF, Lin Z, et al., 2023. SmartBrush: text and shape guided object inpainting with diffusion model. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22428–22437. https://doi.org/10.1109/CVPR52729.2023.02148
Xing YH, Wu QR, Cheng D, et al., 2022. Class-aware visual prompt tuning for vision-language pre-trained model. https://doi.org/10.48550/arXiv.2208.08340
Xu T, Zhang PC, Huang QY, et al., 2018. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1316–1324. https://doi.org/10.1109/CVPR.2018.00143
Xu ZB, Sun J, 2010. Image inpainting by patch propagation using patch sparsity. IEEE Trans Image Process, 19(5):1153–1165. https://doi.org/10.1109/TIP.2010.2042098
Article MathSciNet Google Scholar
Xu ZH, Shen B, Tang YL, et al., 2022. Deep clinical phenotyping of Parkinson’s disease: towards a new era of research and clinical care. Phenomics, 2(5):349–361. https://doi.org/10.1007/s43657-022-00051-4
Article Google Scholar
Xue H, Salim FD, 2022. Prompt-based time series forecasting: a new task and dataset. http://export.arxiv.org/abs/2210.08964v1
Yao Y, Zhang A, Zhang ZY, et al., 2021. CPT: colorful prompt tuning for pre-trained vision-language models. https://doi.org/10.48550/arXiv.2109.11797
Yu JH, Lin Z, Yang JM, et al., 2018. Generative image in-painting with contextual attention. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5505–5514. https://doi.org/10.1109/CVPR.2018.00577
Yu JH, Lin Z, Yang JM, et al., 2019. Free-form image inpainting with gated convolution. Proc IEEE/CVF Int Conf on Computer Vision, p.4470–4479. https://doi.org/10.1109/ICCV.2019.00457
Yu WW, Liu YL, Hua W, et al., 2023. Turning a CLIP model into a scene text detector. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6978–6988. https://doi.org/10.1109/CVPR52729.2023.00674
Yu Y, Rong L, Wang MY, et al., 2022. Prompt learning for multi-modal COVID-19 diagnosis. Proc IEEE Int Conf on Bioinformatics and Biomedicine, p.2803–2807. https://doi.org/10.1109/BIBM55620.2022.9995157
Zhang H, Xu T, Li HS, et al., 2017. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. Proc IEEE Int Conf on Computer Vision, p.5908–5916. https://doi.org/10.1109/ICCV.2017.629
Zhang LM, Rao A, Agrawala M, 2023. Adding conditional control to text-to-image diffusion models. https://doi.org/10.48550/arXiv.2302.05543
Zhang ZJ, Zhao Z, Zhang Z, et al., 2020. Text-guided image inpainting. Proc 28^th ACM Int Conf on Multimedia, p.4079–4087. https://doi.org/10.1145/3394171.3413939
Zhang ZS, Zhang A, Li M, et al., 2022. Automatic chain of thought prompting in large language models. Proc 11^th Int Conf on Learning Representations.
Zhang ZS, Zhang A, Li M, et al., 2023. Multimodal chain-of-thought reasoning in language models. https://doi.org/10.48550/arXiv.2302.00923
Zhou KY, Yang JK, Loy CC, et al., 2022a. Conditional prompt learning for vision-language models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.16795–16804. https://doi.org/10.1109/CVPR52688.2022.01631
Zhou KY, Yang JK, Loy CC, et al., 2022b. Learning to prompt for vision-language models. Int J Comput Vis, 130(9):2337–2348. https://doi.org/10.1007/s11263-022-01653-1
Article Google Scholar
Zhou YQ, Barnes C, Shechtman E, et al., 2021. TransFill: reference-guided image inpainting by merging multiple color and spatial transformations. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2266–2267. https://doi.org/10.1109/CVPR46437.2021.00230
Zhu HP, Shan HM, Zhang YH, et al., 2022. Convolutional ordinal regression forest for image ordinal estimation. IEEE Trans Neur Netw Learn Syst, 33(8):4084–4095. https://doi.org/10.1109/TNNLS.2021.3055816
Article MathSciNet Google Scholar
Zhu JW, Lai SM, Chen X, et al., 2023. Visual prompt multi-modal tracking. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9516–9526. https://doi.org/10.1109/CVPR52729.2023.00918

Download references

Author information

Authors and Affiliations

Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, 200438, China
Yiming Lei (雷一鸣), Jingqi Li (李婧琦), Zilong Li (李子龙) & Yuan Cao (曹原)
Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, 200433, China
Hongming Shan (单洪明)
MOE Frontiers Center for Brain Science, Fudan University, Shanghai, 200433, China
Hongming Shan (单洪明)
Shanghai Center for Brain Science and Brain-Inspired Technology, Shanghai, 201210, China
Hongming Shan (单洪明)

Authors

Yiming Lei (雷一鸣)
View author publications
You can also search for this author inPubMed Google Scholar
Jingqi Li (李婧琦)
View author publications
You can also search for this author inPubMed Google Scholar
Zilong Li (李子龙)
View author publications
You can also search for this author inPubMed Google Scholar
Yuan Cao (曹原)
View author publications
You can also search for this author inPubMed Google Scholar
Hongming Shan (单洪明)
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Yiming LEI and Hongming SHAN designed the structure and logic of the paper. Yiming LEI drafted the whole paper. Yuan CAO reviewed the visual prompt learning part. Zilong LI reviewed the prompt-guided generative models part. Jingqi LI reviewed the prompt tuning part. Yiming LEI and Hongming SHAN revised and finalized the paper. All the authors proofread the paper.

Corresponding authors

Correspondence to Yiming Lei (雷一鸣) or Hongming Shan (单洪明).

Ethics declarations

All the authors declare that they have no conflict of interest.

Additional information

Project supported by the National Natural Science Foundation of China (Nos. 62306075 and 62101136), the China Postdoctoral Science Foundation (No. 2022TQ0069), the Natural Science Foundation of Shanghai, China (No. 21ZR1403600), the Shanghai Municipal of Science and Technology Project, China (No. 20JC1419500), and the Shanghai Center for Brain Science and Brain-Inspired Technology, China

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lei, Y., Li, J., Li, Z. et al. Prompt learning in computer vision: a survey. Front Inform Technol Electron Eng 25, 42–63 (2024). https://doi.org/10.1631/FITEE.2300389

Download citation

Received: 31 May 2023
Accepted: 17 October 2023
Published: 08 February 2024
Issue Date: January 2024
DOI: https://doi.org/10.1631/FITEE.2300389

Key words

关键词

CLC number

TP181

Part of a collection:

FITEE Special Issue on Recent Advances in Artificial Intelligence Generated Content (AIGC)

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Prompt learning in computer vision: a survey

Abstract

摘要

Access this article

Subscribe and save

Buy Now

Explore related subjects

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

关键词

CLC number

Subscribe and save

Buy Now