Efficient Transfer Learning for Visual Tasks via Continuous Optimization of Prompts

Conder, Jonathan; Jefferson, Josephine; Pages, Nathan; Jawed, Khurram; Nejati, Alireza; Sagar, Mark

doi:10.1007/978-3-031-06427-2_25

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13231))

Included in the following conference series:

International Conference on Image Analysis and Processing

2005 Accesses
2 Citations

Abstract

Traditional methods for adapting pre-trained vision models to downstream tasks involve fine-tuning some or all of the model’s parameters. There are a number of trade-offs with this approach. When too many parameters are fine-tuned, the model may lose the benefits associated with pre-training, such as the ability to generalize to out-of-distribution data. But, if instead too few parameters are fine-tuned, the model may be unable to adapt effectively for the tasks downstream. In this paper, we propose Visual Prompt Tuning (VPT) as an alternative to fine-tuning for Transformer-based vision models. Our method is closely related to, and inspired by, prefix-tuning of language models [22]. We find that, by adding additional parameters to a pre-trained model, VPT offers similar performance to fine-tuning the final layer. In addition, for low-data settings and for specialized tasks, such as traffic sign recognition, satellite photo recognition and handwriting classification, the performance of Transformer-based vision models is improved with the use of VPT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Specifically ViT-B/32, which is available at https://github.com/openai/CLIP.
2.
It does change the number of outputs, but CLIP only uses the one corresponding to the “class” embedding.

References

Berg, T., et al.: Birdsnap: large-scale fine-grained visual categorization of birds. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2019–2026 (2014). https://doi.org/10.1109/CVPR.2014.259
Bossard, L., et al.: Food-101 - mining discriminative components with random forests. In: European Conference on Computer Vision, pp. 446–461 (2014). https://doi.org/10.1007/978-3-319-10599-4_29
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Carion, N., et al.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229. Springer (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Cheng, G., et al.: Remote sensing image scene classification: benchmark and state of the art. In: Proceedings of the IEEE, vol. 105, pp. 1865–1883 (2017). https://doi.org/10.1109/JPROC.2017.2675998
Cimpoi, M., et al.: Describing textures in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606–3613 (2014). https://doi.org/10.1109/CVPR.2014.461
Coates, A., et al.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, vol. 15, pp. 215–223 (2011)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Ehteshami Bejnordi, B., et al.: Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 318, 2199–2210 (2017). https://doi.org/10.1001/jama.2017.14585
Article Google Scholar
Goodfellow, I.J., et al.: Challenges in representation learning: a report on three machine learning contests. Neural Netw. 64, 59–63 (2015). https://doi.org/10.1016/j.neunet.2014.09.005
Article Google Scholar
Helber, P., et al.: Introducing EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. In: IEEE International Geoscience and Remote Sensing Symposium, pp. 204–207 (2018). https://doi.org/10.1109/IGARSS.2018.8519248
Helber, P., et al.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Selected Top. Appl. Earth Observ. Remote Sens. 12, 2217–2226 (2019). https://doi.org/10.1109/JSTARS.2019.2918242
Article Google Scholar
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: Proceedings of the 36th International Conference on Machine Learning, vol. 97 (2019)
Google Scholar
Jayakumar, S.M., et al.: Multiplicative interactions and where to find them. In: International Conference on Learning Representations (2019)
Google Scholar
Johnson, J., et al.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1988–1997 (2017). https://doi.org/10.1109/CVPR.2017.215
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)
Google Scholar
Kouw, W.M., Loog, M.: An introduction to domain adaptation and transfer learning. Delft University of Technology, Technical report (2018)
Google Scholar
Krause, J., et al.: 3D object representations for fine-grained categorization. In: IEEE International Conference on Computer Vision Workshops, pp. 554–561 (2013). https://doi.org/10.1109/ICCVW.2013.77
Krizhevsky, A.: Learning multiple layers of features from tiny images. University of Toronto, Technical report (2009)
Google Scholar
Lake, B., et al.: One shot learning of simple visual concepts. In: Proceedings of the Annual Meeting of the Cognitive Science Society 33 (2011)
Google Scholar
Lecun, Y., et al.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, pp. 2278–2324 (1998). https://doi.org/10.1109/5.726791
Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4582–4597 (2021). https://doi.org/10.18653/v1/2021.acl-long.353
Fei-Fei, L., et al.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: Conference on Computer Vision and Pattern Recognition Workshop, pp. 178–178 (2004). https://doi.org/10.1109/CVPR.2004.383
Maji, S., et al.: Fine-grained visual classification of aircraft. arXiv preprint (2013)
Google Scholar
Nilsback, M., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729 (2008). https://doi.org/10.1109/ICVGIP.2008.47
Parkhi, O.M., et al.: Cats and dogs. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3498–3505 (2012). https://doi.org/10.1109/CVPR.2012.6248092
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32, pp. 8024–8035 (2019)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Rawat, W., Wang, Z.: Deep convolutional neural networks for image classification: a comprehensive review. Neural Comput. 29, 2352–2449 (2017). https://doi.org/10.1162/neco_a_00990
Rebuffi, S.A., et al.: Learning multiple visual domains with residual adapters. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Reynolds, L., McDonell, K.: Prompt programming for large language models: beyond the few-shot paradigm. In: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (2021). https://doi.org/10.1145/3411763.3451760
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Shin, T., et al.: AutoPrompt: eliciting knowledge from language models with automatically generated prompts. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 4222–4235 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.346
Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013)
Google Scholar
Soomro, K., et al.: UCF101: a dataset of 101 human actions classes from videos in the wild. University of Central Florida, Technical report (2012)
Google Scholar
Stallkamp, J., et al.: Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Netw. 32, 323–32 (2012). https://doi.org/10.1016/j.neunet.2012.02.016
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Veeling, B.S., et al.: Rotation equivariant CNNs for digital pathology. In: Medical Image Computing and Computer Assisted Intervention, pp. 210–218 (2018). https://doi.org/10.1007/978-3-030-00934-2_24
Xiao, J., et al.: SUN database: large-scale scene recognition from abbey to zoo. In: IEEE Conference on Computer Vision and Pattern Recognition (2010). https://doi.org/10.1109/CVPR.2010.5539970
Xiao, J., Ehinger, K.A., Hays, J., Torralba, A., Oliva, A.: SUN database: exploring a large collection of scene categories. Int. J. Comput. Vis. 119(1), 3–22 (2014). https://doi.org/10.1007/s11263-014-0748-y
Article MathSciNet Google Scholar
Zamir, A.R., et al.: Taskonomy: disentangling task transfer learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3712–3722 (2018). https://doi.org/10.1109/CVPR.2018.00391

Download references

Author information

Authors and Affiliations

Soul Machines, Auckland, New Zealand
Jonathan Conder, Josephine Jefferson, Nathan Pages, Khurram Jawed, Alireza Nejati & Mark Sagar

Authors

Jonathan Conder
View author publications
You can also search for this author in PubMed Google Scholar
Josephine Jefferson
View author publications
You can also search for this author in PubMed Google Scholar
Nathan Pages
View author publications
You can also search for this author in PubMed Google Scholar
Khurram Jawed
View author publications
You can also search for this author in PubMed Google Scholar
Alireza Nejati
View author publications
You can also search for this author in PubMed Google Scholar
Mark Sagar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jonathan Conder .

Editor information

Editors and Affiliations

Boston University, Boston, MA, USA
Stan Sclaroff
National Research Council, Lecce, Italy
Cosimo Distante
National Research Council, Lecce, Italy
Marco Leo
University of Catania, Catania, Italy
Giovanni M. Farinella
Technische Universität München, Garching, Germany
Federico Tombari

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 119 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Conder, J., Jefferson, J., Pages, N., Jawed, K., Nejati, A., Sagar, M. (2022). Efficient Transfer Learning for Visual Tasks via Continuous Optimization of Prompts. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing – ICIAP 2022. ICIAP 2022. Lecture Notes in Computer Science, vol 13231. Springer, Cham. https://doi.org/10.1007/978-3-031-06427-2_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-06427-2_25
Published: 15 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06426-5
Online ISBN: 978-3-031-06427-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics