Swelling-ViT: Rethink Data-Efficient Vision Transformer from Locality

Hu, Chuanrui; Chen, Bin; Feng, Xin; Nian, Fudong; Wang, Jiaxin; Li, Teng

doi:10.1007/978-981-97-8505-6_3

Chuanrui Hu¹⁵,
Bin Chen^15,16,
Xin Feng¹⁶,
Fudong Nian¹⁷,
Jiaxin Wang¹⁸ &
…
Teng Li¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15034))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

140 Accesses

Abstract

In the domain of computer vision, Transformers have shown great promise, yet they face difficulties when trained from scratch on small datasets, often underperforming compared to convolutional neural networks (ConvNets). Our work highlights Vision Transformers (ViTs) experience a challenge with unfocused attention when trained on limited datasets. This insight has catalyzed the development of our Swelling ViT framework, an adaptive training strategy that initializes ViT with a local attention window, allowing it to expand gradually during training. This innovative approach enables the model to more easily learn local features thereby mitigating the attention dispersion phenomenon. Our empirical evaluation on the Cifar100 dataset with Swelling ViT-B has yielded remarkable results, achieving an accuracy of 82.60% after 300 epochs from scratch and further improving to 83.31% with 900 epochs of training. These outcomes not only signify a state-of-the-art performance but also underscore the Swelling ViT’s capability to effectively address the attention dispersion issue, particularly on small datasets. Moreover, the robustness of our Swelling ViT is affirmed by its consistent performance on the extensive ImageNet dataset, confirming that the strategy does not compromise effectiveness when scaled to larger data regimes. This work, therefore, not only bridges the gap in data efficiency for ViT models but also introduces a versatile solution that can be readily adapted to various domains, regardless of data availability.

Chuanrui Hu, Bin Chen Equal contribution. Corresponding Author: Teng Li. This work is supported by the project of Excellent Research and Innovation Teams in Anhui Province’s Universities (No. 2024AH010030)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bao, H., Dong, L., Wei, F.: BEiT: BERT pre-training of image transformers (2021)
Google Scholar
Cao, Y.H., Yu, H., Wu, J.: Training vision transformers with only 2040 images. In: European Conference on Computer Vision, pp. 220–237. Springer (2022)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Computer Vision and Pattern Recognition (2009)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2018)
Google Scholar
Ding, X., Zhang, X., Han, J., Ding, G.: Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In: Computer Vision and Pattern Recognition, pp. 11963–11975 (2022)
Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Gani, H., Naseer, M., Yaqub, M.: How to train vision transformer on small-scale datasets? In: 33rd British machine vision conference 2022, BMVC 2022, London, UK, November 21–24, 2022. BMVA Press (2022), https://bmvc2022.mpi-inf.mpg.de/0731.pdf
Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv: Computer Vision and Pattern Recognition (2021)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition (2016)
Google Scholar
Krizhevsky, A.: Learning multiple layers of features from tiny images (2009)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems (2012)
Google Scholar
Lee, S., Lee, S., Song, B.C.: Improving vision transformers to learn small-size dataset from scratch. IEEE Access 10, 123212–123224 (2022)
Article Google Scholar
Liu, Y., Sangineto, E., Bi, W., Sebe, N., Lepri, B., Nadai, M.: Efficient training of visual transformers with small datasets. Adv. Neural. Inf. Process. Syst. 34, 23818–23830 (2021)
Google Scholar
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Transformer: Hierarchical vision transformer using shifted windows. International Conference on Computer Vision (2021)
Google Scholar
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. CVPR (2022)
Google Scholar
Lu, Z., Xie, H., Liu, C., Zhang, Y.: Bridging the gap between vision transformers and convolutional neural networks on small datasets. Adv. Neural. Inf. Process. Syst. 35, 14663–14677 (2022)
Google Scholar
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning (2011)
Google Scholar
Pouransari, H., Ghili, S.: Tiny imagenet visual recognition challenge (2014)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training. arXiv (2018)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog (2019)
Google Scholar
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. (2019)
Google Scholar
Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: International Conference on Computer Vision (2017)
Google Scholar
Tan, M., Le, Q.V.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning (2019)
Google Scholar
Team, T.T.: Flowers (2019), http://download.tensorflow.org/example_images/flower_photos.tgz
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning (2021)
Google Scholar
Touvron, H., Cord, M., Jégou, H.: Deit iii: Revenge of the vit. In: European Conference on Computer Vision, pp. 516–533. Springer (2022)
Google Scholar
Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B.A., Shlens, J.: Scaling local self-attention for parameter efficient visual backbones. In: Computer Vision and Pattern Recognition (2021)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Neural Information Processing Systems (2017)
Google Scholar
Wang, W., Li, S., Shao, J., Jumahong, H.: Lkc-net: large kernel convolution object detection network. Sci. Rep. 13(1), 9535 (2023)
Article Google Scholar
Wu, K., Peng, H., Chen, M., Fu, J., Chao, H.: Rethinking and improving relative position encoding for vision transformer. In: International Conference on Computer Vision (2021)
Google Scholar
Zhang, Z., Zhang, H., Zhao, L., Chen, T., Arik, S.Ö., Pfister, T.: Nested hierarchical transformer: Towards accurate, data-efficient and interpretable visual understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 3417–3425 (2022)
Google Scholar
Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., He, Q.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)
Article Google Scholar

Download references

Author information

Authors and Affiliations

360AIResearch, Beijing, China
Chuanrui Hu & Bin Chen
Chongqing University of Technology, Chongqing, China
Bin Chen & Xin Feng
Hefei University, Hefei, China
Fudong Nian
Anhui University of Science and Technology, Huainan, China
Jiaxin Wang
Anhui University, Hefei, China
Teng Li

Authors

Chuanrui Hu
View author publications
You can also search for this author in PubMed Google Scholar
Bin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xin Feng
View author publications
You can also search for this author in PubMed Google Scholar
Fudong Nian
View author publications
You can also search for this author in PubMed Google Scholar
Jiaxin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Teng Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Teng Li .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, C., Chen, B., Feng, X., Nian, F., Wang, J., Li, T. (2025). Swelling-ViT: Rethink Data-Efficient Vision Transformer from Locality. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15034. Springer, Singapore. https://doi.org/10.1007/978-981-97-8505-6_3

Download citation

DOI: https://doi.org/10.1007/978-981-97-8505-6_3
Published: 07 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8504-9
Online ISBN: 978-981-97-8505-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Swelling-ViT: Rethink Data-Efficient Vision Transformer from Locality