Abstract
Traditional methods for adapting pre-trained vision models to downstream tasks involve fine-tuning some or all of the model’s parameters. There are a number of trade-offs with this approach. When too many parameters are fine-tuned, the model may lose the benefits associated with pre-training, such as the ability to generalize to out-of-distribution data. But, if instead too few parameters are fine-tuned, the model may be unable to adapt effectively for the tasks downstream. In this paper, we propose Visual Prompt Tuning (VPT) as an alternative to fine-tuning for Transformer-based vision models. Our method is closely related to, and inspired by, prefix-tuning of language models [22]. We find that, by adding additional parameters to a pre-trained model, VPT offers similar performance to fine-tuning the final layer. In addition, for low-data settings and for specialized tasks, such as traffic sign recognition, satellite photo recognition and handwriting classification, the performance of Transformer-based vision models is improved with the use of VPT.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Specifically ViT-B/32, which is available at https://github.com/openai/CLIP.
- 2.
It does change the number of outputs, but CLIP only uses the one corresponding to the “class” embedding.
References
Berg, T., et al.: Birdsnap: large-scale fine-grained visual categorization of birds. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2019–2026 (2014). https://doi.org/10.1109/CVPR.2014.259
Bossard, L., et al.: Food-101 - mining discriminative components with random forests. In: European Conference on Computer Vision, pp. 446–461 (2014). https://doi.org/10.1007/978-3-319-10599-4_29
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Carion, N., et al.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229. Springer (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Cheng, G., et al.: Remote sensing image scene classification: benchmark and state of the art. In: Proceedings of the IEEE, vol. 105, pp. 1865–1883 (2017). https://doi.org/10.1109/JPROC.2017.2675998
Cimpoi, M., et al.: Describing textures in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606–3613 (2014). https://doi.org/10.1109/CVPR.2014.461
Coates, A., et al.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, vol. 15, pp. 215–223 (2011)
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Ehteshami Bejnordi, B., et al.: Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 318, 2199–2210 (2017). https://doi.org/10.1001/jama.2017.14585
Goodfellow, I.J., et al.: Challenges in representation learning: a report on three machine learning contests. Neural Netw. 64, 59–63 (2015). https://doi.org/10.1016/j.neunet.2014.09.005
Helber, P., et al.: Introducing EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. In: IEEE International Geoscience and Remote Sensing Symposium, pp. 204–207 (2018). https://doi.org/10.1109/IGARSS.2018.8519248
Helber, P., et al.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Selected Top. Appl. Earth Observ. Remote Sens. 12, 2217–2226 (2019). https://doi.org/10.1109/JSTARS.2019.2918242
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: Proceedings of the 36th International Conference on Machine Learning, vol. 97 (2019)
Jayakumar, S.M., et al.: Multiplicative interactions and where to find them. In: International Conference on Learning Representations (2019)
Johnson, J., et al.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1988–1997 (2017). https://doi.org/10.1109/CVPR.2017.215
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)
Kouw, W.M., Loog, M.: An introduction to domain adaptation and transfer learning. Delft University of Technology, Technical report (2018)
Krause, J., et al.: 3D object representations for fine-grained categorization. In: IEEE International Conference on Computer Vision Workshops, pp. 554–561 (2013). https://doi.org/10.1109/ICCVW.2013.77
Krizhevsky, A.: Learning multiple layers of features from tiny images. University of Toronto, Technical report (2009)
Lake, B., et al.: One shot learning of simple visual concepts. In: Proceedings of the Annual Meeting of the Cognitive Science Society 33 (2011)
Lecun, Y., et al.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, pp. 2278–2324 (1998). https://doi.org/10.1109/5.726791
Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4582–4597 (2021). https://doi.org/10.18653/v1/2021.acl-long.353
Fei-Fei, L., et al.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: Conference on Computer Vision and Pattern Recognition Workshop, pp. 178–178 (2004). https://doi.org/10.1109/CVPR.2004.383
Maji, S., et al.: Fine-grained visual classification of aircraft. arXiv preprint (2013)
Nilsback, M., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729 (2008). https://doi.org/10.1109/ICVGIP.2008.47
Parkhi, O.M., et al.: Cats and dogs. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3498–3505 (2012). https://doi.org/10.1109/CVPR.2012.6248092
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32, pp. 8024–8035 (2019)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Rawat, W., Wang, Z.: Deep convolutional neural networks for image classification: a comprehensive review. Neural Comput. 29, 2352–2449 (2017). https://doi.org/10.1162/neco_a_00990
Rebuffi, S.A., et al.: Learning multiple visual domains with residual adapters. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Reynolds, L., McDonell, K.: Prompt programming for large language models: beyond the few-shot paradigm. In: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (2021). https://doi.org/10.1145/3411763.3451760
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Shin, T., et al.: AutoPrompt: eliciting knowledge from language models with automatically generated prompts. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 4222–4235 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.346
Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013)
Soomro, K., et al.: UCF101: a dataset of 101 human actions classes from videos in the wild. University of Central Florida, Technical report (2012)
Stallkamp, J., et al.: Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Netw. 32, 323–32 (2012). https://doi.org/10.1016/j.neunet.2012.02.016
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Veeling, B.S., et al.: Rotation equivariant CNNs for digital pathology. In: Medical Image Computing and Computer Assisted Intervention, pp. 210–218 (2018). https://doi.org/10.1007/978-3-030-00934-2_24
Xiao, J., et al.: SUN database: large-scale scene recognition from abbey to zoo. In: IEEE Conference on Computer Vision and Pattern Recognition (2010). https://doi.org/10.1109/CVPR.2010.5539970
Xiao, J., Ehinger, K.A., Hays, J., Torralba, A., Oliva, A.: SUN database: exploring a large collection of scene categories. Int. J. Comput. Vis. 119(1), 3–22 (2014). https://doi.org/10.1007/s11263-014-0748-y
Zamir, A.R., et al.: Taskonomy: disentangling task transfer learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3712–3722 (2018). https://doi.org/10.1109/CVPR.2018.00391
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Conder, J., Jefferson, J., Pages, N., Jawed, K., Nejati, A., Sagar, M. (2022). Efficient Transfer Learning for Visual Tasks via Continuous Optimization of Prompts. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing – ICIAP 2022. ICIAP 2022. Lecture Notes in Computer Science, vol 13231. Springer, Cham. https://doi.org/10.1007/978-3-031-06427-2_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-06427-2_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06426-5
Online ISBN: 978-3-031-06427-2
eBook Packages: Computer ScienceComputer Science (R0)