Learnable Masked Tokens for Improved Transferability of Self-supervised Vision Transformers

Hu, Hao; Baldassarre, Federico; Azizpour, Hossein

doi:10.1007/978-3-031-26409-2_25

Hao Hu¹³,
Federico Baldassarre¹³ &
Hossein Azizpour¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13715))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1257 Accesses

Abstract

Vision transformers have recently shown remarkable performance in various visual recognition tasks specifically for self-supervised representation learning. The key advantage of transformers for self supervised learning, compared to their convolutional counterparts, is the reduced inductive biases that makes transformers amenable to learning rich representations from massive amounts of unlabelled data. On the other hand, this flexibility makes self-supervised vision transformers susceptible to overfitting when fine-tuning them on small labeled target datasets. Therefore, in this work, we make a simple yet effective architectural change by introducing new learnable masked tokens to vision transformers whereby we reduce the effect of overfitting in transfer learning while retaining the desirable flexibility of vision transformers. Through several experiments based on two seminal self-supervised vision transformers, SiT and DINO, and several small target visual recognition tasks, we show consistent and significant improvements in the accuracy of the fine-tuned models across all target tasks.

This work is partially supported by KTH Digital Futures and Wallenberg AI, Autonomous Systems and Software Program (WASP).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Training Vision Transformers with only 2040 Images

Semi-supervised Vision Transformers

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Article 12 January 2023

Notes

1.
We omit LN and MLP layers in between for convenience.
2.
We remove the layer index l, and replace it with the image index k for convenience.

References

Vaswani, A., et al.: Attention is all you need. Advances In: Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ArXiv Preprint ArXiv:2010.11929. (2020)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A. & Jégou, H. Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021)
Google Scholar
Bahdanau, D., Cho, K. & Bengio, Y.: Neural machine translation by jointly learning to align and translate. ArXiv Preprint ArXiv:1409.0473. (2014)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ArXiv Preprint ArXiv:1409.1556. (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.: Densely connected convolutional networks. In: Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Google Scholar
Atito, S., Awais, M., Kittler, J.: SIT: Self-supervised vision transformer. ArXiv Preprint ArXiv:2104.03602 (2021)
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. ArXiv Preprint ArXiv:2104.14294 (2021)
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. ArXiv Preprint ArXiv:2003.04297 (2020)
Grill, J., et al.: Bootstrap your own latent: A new approach to self-supervised learning. ArXiv Preprint ArXiv:2006.07733 (2020)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020)
Google Scholar
Park, T., Efros, A.A., Zhang, R., Zhu, J.-Y.: Contrastive learning for unpaired image-to-image translation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 319–345. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_19
Chapter Google Scholar
Prillo, S., Eisenschlos, J.: SoftSort: a continuous relaxation for the argsort operator. International Conference on Machine Learning, pp. 7793–7802 (2020)
Google Scholar
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. ArXiv Preprint ArXiv:1803.07728 (2018)
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. ArXiv Preprint ArXiv:2006.09882 (2020)
Krizhevsky, A., Hinton, G., Others Learning multiple layers of features from tiny images. (Citeseer 2009
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L. ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Google Scholar
Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 215–223 (2011)
Google Scholar
Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.: Caltech-UCSD birds 200 (California Institute of Technology, 2010)
Google Scholar
Yu, X., Zhao, Y., Gao, Y., Yuan, X., Xiong, S.: Benchmark platform for ultra-fine-grained visual categorization beyond human performance. In: Proceedings Of The IEEE/CVF International Conference on Computer Vision, pp. 10285–10295 (2021)
Google Scholar
Zhao, J., Zhang, Y., He, X., Xie, P.: Covid-CT-dataset: a CT scan dataset about Covid-19. ArXiv Preprint ArXiv:2003.13865 490 (2020)
Brown, T., et al.: Language models are few-shot learners. ArXiv Preprint ArXiv:2005.14165 (2020)
Lepikhin, D., et al.: Gshard: scaling giant models with conditional computation and automatic sharding. ArXiv Preprint ArXiv:2006.16668 (2020)
Pan, S., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2009)
Article Google Scholar
Noroozi, M., Favaro, P.: unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5
Chapter Google Scholar
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_40
Chapter Google Scholar
Wu, Z., Xiong, Y., Yu, S., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings Of The IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
Google Scholar
Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. ArXiv Preprint ArXiv:1807.03748 (2018)
Zheng, H., Fu, J., Mei, T., Luo, J.: Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings Of The IEEE International Conference on Computer Vision, pp. 5209–5217 (2017)
Google Scholar
Wei, X., Xie, C., Wu, J., Shen, C.: Mask-CNN: localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recogn. 76, 704–714 (2018)
Article Google Scholar
Nawaz, S., Calefati, A., Caraffini, M., Landro, N., Gallo, I.: Are these birds similar: Learning branched networks for fine-grained representations. In: 2019 International Conference on Image and Vision Computing New Zealand (IVCNZ), pp. 1–5 (2019)
Google Scholar
Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D., Chao, L.: Learning deep transformer models for machine translation. ArXiv Preprint ArXiv:1906.01787 (2019)
Baevski, A., Auli, M.: Adaptive input representations for neural language modeling. ArXiv Preprint ArXiv:1809.10853 (2018)
Choe, J.., Shim, H.: Attention-based dropout layer for weakly supervised object localization. In: Proceedings Of The IEEE/CVF Conference On Computer Vision And Pattern Recognition, pp. 2219–2228 (2019)
Google Scholar
Yun, S., Han, D., Oh, S., Chun, S., Choe, J., Yoo, Y.: CutMix: regularization strategy to train strong classifiers with localizable features. In: Proceedings Of The IEEE/CVF International Conference On Computer Vision, pp. 6023–6032 (2019)
Google Scholar
Dosovitskiy, A., Springenberg, J., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with convolutional neural networks. Adv. Neural. Inf. Process. Syst. 27, 766–774 (2014)
Google Scholar
Jenni, S., Favaro, P.: Self-supervised feature learning by learning to spot artifacts. In: Proceedings of The IEEE Conference on Computer Vision And Pattern Recognition, pp. 2733–2742 (2018)
Google Scholar
Haeusser, P., Plapp, J., Golkov, V., Aljalbout, E., Cremers, D.: Associative deep clustering: training a classification network with no labels. In: German Conference On Pattern Recognition, pp. 18–32 (2018)
Google Scholar
Ji, X., Henriques, J., Vedaldi, A.: Invariant information clustering for unsupervised image classification and segmentation. In: Proceedings Of The IEEE/CVF International Conference On Computer Vision, pp. 9865–9874 (2019)
Google Scholar
Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 139–156. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_9
Chapter Google Scholar
Hjelm, R., et al.: Learning deep representations by mutual information estimation and maximization. ArXiv Preprint ArXiv:1808.06670 (2018)
Patacchiola, M., Storkey, A.: Self-supervised relational reasoning for representation learning. ArXiv Preprint ArXiv:2006.05849 (2020)
He, X., Yang, X., Zhang, S., Zhao, J., Zhang, Y., Xing, E., Xie, P.: Sample-efficient deep learning for COVID-19 diagnosis based on CT scans. Medrxiv (2020)
Google Scholar
Wei, X., Zhang, Y., Gong, Y., Zhang, J., Zheng, N.: Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 365–380. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_22
Chapter Google Scholar
Dubey, A., Gupta, O., Raskar, R., Naik, N.: Maximum-entropy fine-grained classification. ArXiv Preprint ArXiv:1809.05934 (2018)
Wang, Y., Morariu, V., Davis, L.: Learning a discriminative filter bank within a CNN for fine-grained recognition. In: Proceedings Of The IEEE Conference on Computer Vision And Pattern Recognition, ,pp. 4148–4157 (2018)
Google Scholar
Gao, Y., Han, X., Wang, X., Huang, W., Scott, M.: Channel interaction networks for fine-grained image categorization. In: Proceedings Of The AAAI Conference On Artificial Intelligence. 34, 10818–10825 (2020)
Google Scholar
Chen, Y., Bai, Y., Zhang, W., Mei, T.: Destruction and construction learning for fine-grained image recognition. In: Proceedings Of The IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5157–5166 (2019)
Google Scholar
Luo, W., et al,: Cross-X learning for fine-grained visual categorization. In: Proceedings Of The IEEE/CVF International Conference On Computer Vision, pp. 8242–8251 (2019)
Google Scholar
He, J., et al.:TransFG: a Transformer Architecture for fine-grained recognition. ArXiv Preprint ArXiv:2103.07976 (2021)
Chen, X., He, K.: Exploring simple Siamese representation learning. In: Proceedings Of The IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)
Google Scholar
Asano, Y., Rupprecht, C.., Vedaldi, A.: Self-labelling via simultaneous clustering and representation learning. ArXiv Preprint ArXiv:1911.05371 (2019)
Li, J., Zhou, P., Xiong, C., Hoi, S..: Prototypical contrastive learning of unsupervised representations. ArXiv Preprint ArXiv:2005.04966 (2020)
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., Sutskever, I.: Generative pretraining from pixels. In: International Conference On Machine Learning, pp. 1691–1703 (2020)
Google Scholar
Devlin, J., Chang, M., Lee, K.,Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. ArXiv Preprint ArXiv:1810.04805 (2018)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Google Scholar
Yuan, L., et al.: Tokens-to-token vit: Training vision transformers from scratch on ImageNet. ArXiv Preprint ArXiv:2101.11986 (2021)
Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., Wu, W.:Incorporating convolution designs into visual transformers. ArXiv Preprint ArXiv:2103.11816 (2021)
Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L.: Localvit: Bringing locality to vision transformers. ArXiv Preprint ArXiv:2104.05707 (2021)
Hudson, D., Zitnick, C.: Generative adversarial transformers. ArXiv Preprint ArXiv:2103.01209 (2021)
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.: Dynamicvit: efficient vision transformers with dynamic token sparsification. Adv. Neural. Inf. Process. Syst. 34, 13937–13949 (2021)
Google Scholar
Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Not all patches are what you need: Expediting vision transformers via token reorganizations. ArXiv Preprint ArXiv:2202.07800 (2022)
Tang, Y., Han, K., Wang, Y., Xu, C., Guo, J., Xu, C., Tao, D.: Patch slimming for efficient vision transformers. In: Proceedings Of The IEEE/CVF Conference On Computer Vision And Pattern Recognition, pp. 12165–12174 (2022)
Google Scholar

Download references

Acknowledgements

The project was partially funded by KTH Digital Futures and the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation.

Author information

Authors and Affiliations

KTH Royal Institute of Technology, Stockholm, Sweden
Hao Hu, Federico Baldassarre & Hossein Azizpour

Authors

Hao Hu
View author publications
You can also search for this author in PubMed Google Scholar
Federico Baldassarre
View author publications
You can also search for this author in PubMed Google Scholar
Hossein Azizpour
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hao Hu .

Editor information

Editors and Affiliations

Grenoble Alpes University, Saint Martin d'Hères, France
Massih-Reza Amini
INSA Rouen Normandy, Saint Etienne du Rouvray, France
Stéphane Canu
Ruhr-Universität Bochum, Bochum, Germany
Asja Fischer
KU Leuven, Leuven, Belgium
Tias Guns
Central European University, Vienna, Austria
Petra Kralj Novak
Aristotle University of Thessaloniki, Thessaloniki, Greece
Grigorios Tsoumakas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, H., Baldassarre, F., Azizpour, H. (2023). Learnable Masked Tokens for Improved Transferability of Self-supervised Vision Transformers. In: Amini, MR., Canu, S., Fischer, A., Guns, T., Kralj Novak, P., Tsoumakas, G. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2022. Lecture Notes in Computer Science(), vol 13715. Springer, Cham. https://doi.org/10.1007/978-3-031-26409-2_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-26409-2_25
Published: 17 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26408-5
Online ISBN: 978-3-031-26409-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Learnable Masked Tokens for Improved Transferability of Self-supervised Vision Transformers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Training Vision Transformers with only 2040 Images

Semi-supervised Vision Transformers

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Learnable Masked Tokens for Improved Transferability of Self-supervised Vision Transformers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Training Vision Transformers with only 2040 Images

Semi-supervised Vision Transformers

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation