Abstract
In the domain of computer vision, Transformers have shown great promise, yet they face difficulties when trained from scratch on small datasets, often underperforming compared to convolutional neural networks (ConvNets). Our work highlights Vision Transformers (ViTs) experience a challenge with unfocused attention when trained on limited datasets. This insight has catalyzed the development of our Swelling ViT framework, an adaptive training strategy that initializes ViT with a local attention window, allowing it to expand gradually during training. This innovative approach enables the model to more easily learn local features thereby mitigating the attention dispersion phenomenon. Our empirical evaluation on the Cifar100 dataset with Swelling ViT-B has yielded remarkable results, achieving an accuracy of 82.60% after 300 epochs from scratch and further improving to 83.31% with 900 epochs of training. These outcomes not only signify a state-of-the-art performance but also underscore the Swelling ViT’s capability to effectively address the attention dispersion issue, particularly on small datasets. Moreover, the robustness of our Swelling ViT is affirmed by its consistent performance on the extensive ImageNet dataset, confirming that the strategy does not compromise effectiveness when scaled to larger data regimes. This work, therefore, not only bridges the gap in data efficiency for ViT models but also introduces a versatile solution that can be readily adapted to various domains, regardless of data availability.
Chuanrui Hu, Bin Chen Equal contribution. Corresponding Author: Teng Li. This work is supported by the project of Excellent Research and Innovation Teams in Anhui Province’s Universities (No. 2024AH010030)
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bao, H., Dong, L., Wei, F.: BEiT: BERT pre-training of image transformers (2021)
Cao, Y.H., Yu, H., Wu, J.: Training vision transformers with only 2040 images. In: European Conference on Computer Vision, pp. 220–237. Springer (2022)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Computer Vision and Pattern Recognition (2009)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2018)
Ding, X., Zhang, X., Han, J., Ding, G.: Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In: Computer Vision and Pattern Recognition, pp. 11963–11975 (2022)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Gani, H., Naseer, M., Yaqub, M.: How to train vision transformer on small-scale datasets? In: 33rd British machine vision conference 2022, BMVC 2022, London, UK, November 21–24, 2022. BMVA Press (2022), https://bmvc2022.mpi-inf.mpg.de/0731.pdf
Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv: Computer Vision and Pattern Recognition (2021)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition (2016)
Krizhevsky, A.: Learning multiple layers of features from tiny images (2009)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems (2012)
Lee, S., Lee, S., Song, B.C.: Improving vision transformers to learn small-size dataset from scratch. IEEE Access 10, 123212–123224 (2022)
Liu, Y., Sangineto, E., Bi, W., Sebe, N., Lepri, B., Nadai, M.: Efficient training of visual transformers with small datasets. Adv. Neural. Inf. Process. Syst. 34, 23818–23830 (2021)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Transformer: Hierarchical vision transformer using shifted windows. International Conference on Computer Vision (2021)
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. CVPR (2022)
Lu, Z., Xie, H., Liu, C., Zhang, Y.: Bridging the gap between vision transformers and convolutional neural networks on small datasets. Adv. Neural. Inf. Process. Syst. 35, 14663–14677 (2022)
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning (2011)
Pouransari, H., Ghili, S.: Tiny imagenet visual recognition challenge (2014)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training. arXiv (2018)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog (2019)
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. (2019)
Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: International Conference on Computer Vision (2017)
Tan, M., Le, Q.V.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning (2019)
Team, T.T.: Flowers (2019), http://download.tensorflow.org/example_images/flower_photos.tgz
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning (2021)
Touvron, H., Cord, M., Jégou, H.: Deit iii: Revenge of the vit. In: European Conference on Computer Vision, pp. 516–533. Springer (2022)
Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B.A., Shlens, J.: Scaling local self-attention for parameter efficient visual backbones. In: Computer Vision and Pattern Recognition (2021)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Neural Information Processing Systems (2017)
Wang, W., Li, S., Shao, J., Jumahong, H.: Lkc-net: large kernel convolution object detection network. Sci. Rep. 13(1), 9535 (2023)
Wu, K., Peng, H., Chen, M., Fu, J., Chao, H.: Rethinking and improving relative position encoding for vision transformer. In: International Conference on Computer Vision (2021)
Zhang, Z., Zhang, H., Zhao, L., Chen, T., Arik, S.Ö., Pfister, T.: Nested hierarchical transformer: Towards accurate, data-efficient and interpretable visual understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 3417–3425 (2022)
Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., He, Q.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Hu, C., Chen, B., Feng, X., Nian, F., Wang, J., Li, T. (2025). Swelling-ViT: Rethink Data-Efficient Vision Transformer from Locality. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15034. Springer, Singapore. https://doi.org/10.1007/978-981-97-8505-6_3
Download citation
DOI: https://doi.org/10.1007/978-981-97-8505-6_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8504-9
Online ISBN: 978-981-97-8505-6
eBook Packages: Computer ScienceComputer Science (R0)