A data efficient transformer based on Swin Transformer

Yao, Dazhi; Shao, Yunxue

doi:10.1007/s00371-023-02939-2

A data efficient transformer based on Swin Transformer

Original article
Published: 30 July 2023

Volume 40, pages 2589–2598, (2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

1807 Accesses
Explore all metrics

Abstract

Almost all Vision Transformer-based models need to pre-train on the massive datasets and costly computation. Suppose researchers do not have enough data to train a Vision Transformer-based model or do not have powerful GPUs to implement computation for millions of labeled data. In that case, Vision Transformer-based models have no advantages over CNNs. Swin Transformer is brought forward to figure out these problems by applying the shifted window-based self-attention, which has linear computational complexity. Although Swin Transformer significantly reduces computing costs and works well on mid-size datasets, it still performs not well when it trains on a small-size dataset. In this paper, we propose a hierarchical and data-efficient Transformer based on Swin Transformer, which we call ESwin Transformer. We mainly redesigned the patch embedding module and patch merging module of Swin Transformer. We merely applied some unsophisticated convolutional components to these modules, which significantly improved performance when we trained our model on a small dataset. Our empirical results show that ESwin Transformer trained on CIFAR10/CIFAR100 with no extra data for 300 epochs achieves $97.17\%$/$83.78\%$ accuracy and performs better than Swin Transformer and DeiT in the same training time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TinyConv-PVT: A Deeper Fusion Model of CNN and Transformer for Tiny Dataset

TransMCGC: a recast vision transformer for small-scale image classification tasks

Article 04 January 2023

Towards Efficient Adversarial Training on Vision Transformers

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

All data generated or analyzed during this study are included in this published article, and the dataset used or analyzed during the current study is open online.

References

Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778(2016)
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: The International Conference on Machine Learning, pp. 6105–6114 (2019)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2015)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Huang, G., Liu, Z., Van, L., Maaten, D., Weinberger, K, Q.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Lecun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)
Article Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013)
Attention is all you need: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A, N., Kaiser, L., Polosukhin, I. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017)
Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S. Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., Shao, L.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2020)
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: efficient vision transformers with dynamic token sparsification. Adv. Neural Inf. Process. Syst. 34, 13937–13949 (2021)
Google Scholar
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: Cvt: Introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808 (2021)
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. Adv. Neural Inf. Process. Syst. 34, 15908–15919 (2021)
Google Scholar
Zhang, Z., Zhang, H., Zhao, L., Chen, T., Pfister, T.: Aggregating nested transformer. arXiv preprint arXiv:2105.12723 (2021)
Zhang, Q., Yang, Y.B.: Rest: an efficient transformer for visual recognition. Adv Neural Inf. Process. Syst. 34, 15475–15485 (2021)
Google Scholar
Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z. H., Yan, S., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567 (2021)
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 32–42 (2021)
Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi. H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021)
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: IEEE International Conference on Computer Vision, pp. 764–773 (2017)
Beltagy, I., Peters, M, E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Toronto, ON, Canada (2009)
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)
Ba, J. L., Kiros, J. R., Hinton, G.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 843–852 (2017)
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S. J.: Rethinking spatial dimensions of vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11936–11945 (2021)
Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., Wu, W.: Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 579–588 (2021)
Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Dosovitskiy, A.: Mlp-mixer: an all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 34, 24261–24272 (2021)
Google Scholar
Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.: Early convolutions help transformers see better. arXiv preprint arXiv:2106.14881 (2021)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., et al.: Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8026 (2019)
Le, Y., Yang, X.: Tiny imagenet visual recognition challenge. CS 231N 7(7), 3 (2015)
Google Scholar
Cubuk, E, D., Zoph, B., Shlens, J., Le, Q, V.: Randaugment: Practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703 (2020)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization.arXiv preprint arXiv:1710.09412 (2018)
Yun, S., Han, D., Oh, S, J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: IEEE International Conference on Computer Vision, pp. 6023–6032 (2019)
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: The National Conference on Artificial Intelligence, pp. 13001–13008 (2020)
Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K. Q.: Deep networks with stochastic depth. In: Europeon Conference on Computer Vision (2016)
Kingma, D. P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv: 1412.6980 (2014)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (2015)
Nair, V., Hinton, G. E.: Rectified linear units improve restricted Boltzmann machines. In: The International Conference on Machine Learning (2010)

Download references

Acknowledgements

We thank many colleagues at our lab for their help.

Author information

Authors and Affiliations

School of Computer Science and Technology, Nanjing Tech University, Nanjing, China
Dazhi Yao & Yunxue Shao

Authors

Dazhi Yao
View author publications
You can also search for this author in PubMed Google Scholar
Yunxue Shao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunxue Shao.

Ethics declarations

Conflict of interest

There is no conflict of interest or competing interests to declare.

Code availability

The source code used during the current study are available from https://github.com/ygdr2020/eswin_transformer

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

1.1 Detailed architectures

The detailed architecture specifications are shown in Table 8, where an input image size of 32$\times $32 is assumed for all architectures. "CE" indicates a conv embedding module, "96-d" denotes the output dimension is 96 when feature maps pass this module, "DS" represents a downsample module, and "win.sz.4$\times $4" indicates a MSA (multi-head self-attention) module with window size of 4$\times $4.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yao, D., Shao, Y. A data efficient transformer based on Swin Transformer. Vis Comput 40, 2589–2598 (2024). https://doi.org/10.1007/s00371-023-02939-2

Download citation

Accepted: 28 May 2023
Published: 30 July 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s00371-023-02939-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A data efficient transformer based on Swin Transformer

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

TinyConv-PVT: A Deeper Fusion Model of CNN and Transformer for Tiny Dataset

TransMCGC: a recast vision transformer for small-scale image classification tasks

Towards Efficient Adversarial Training on Vision Transformers

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Code availability

Additional information

Publisher's Note

Appendix A

1.1 Detailed architectures

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A data efficient transformer based on Swin Transformer

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

TinyConv-PVT: A Deeper Fusion Model of CNN and Transformer for Tiny Dataset

TransMCGC: a recast vision transformer for small-scale image classification tasks

Towards Efficient Adversarial Training on Vision Transformers

Explore related subjects

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Code availability

Additional information

Publisher's Note

Appendix A

Appendix A

1.1 Detailed architectures

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation