Skip to main content
Log in

A data efficient transformer based on Swin Transformer

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Almost all Vision Transformer-based models need to pre-train on the massive datasets and costly computation. Suppose researchers do not have enough data to train a Vision Transformer-based model or do not have powerful GPUs to implement computation for millions of labeled data. In that case, Vision Transformer-based models have no advantages over CNNs. Swin Transformer is brought forward to figure out these problems by applying the shifted window-based self-attention, which has linear computational complexity. Although Swin Transformer significantly reduces computing costs and works well on mid-size datasets, it still performs not well when it trains on a small-size dataset. In this paper, we propose a hierarchical and data-efficient Transformer based on Swin Transformer, which we call ESwin Transformer. We mainly redesigned the patch embedding module and patch merging module of Swin Transformer. We merely applied some unsophisticated convolutional components to these modules, which significantly improved performance when we trained our model on a small dataset. Our empirical results show that ESwin Transformer trained on CIFAR10/CIFAR100 with no extra data for 300 epochs achieves \(97.17\%\)/\(83.78\%\) accuracy and performs better than Swin Transformer and DeiT in the same training time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

All data generated or analyzed during this study are included in this published article, and the dataset used or analyzed during the current study is open online.

References

  1. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)

    Google Scholar 

  2. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778(2016)

  3. Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: The International Conference on Machine Learning, pp. 6105–6114 (2019)

  4. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2015)

  5. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

  6. Huang, G., Liu, Z., Van, L., Maaten, D., Weinberger, K, Q.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)

  7. Lecun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)

    Article  Google Scholar 

  8. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  9. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)

  10. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

  11. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013)

  12. Attention is all you need: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A, N., Kaiser, L., Polosukhin, I. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017)

    Google Scholar 

  13. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

  14. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S. Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

  15. Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., Shao, L.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)

  16. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2020)

  17. Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: efficient vision transformers with dynamic token sparsification. Adv. Neural Inf. Process. Syst. 34, 13937–13949 (2021)

    Google Scholar 

  18. Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: Cvt: Introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808 (2021)

  19. Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. Adv. Neural Inf. Process. Syst. 34, 15908–15919 (2021)

    Google Scholar 

  20. Zhang, Z., Zhang, H., Zhao, L., Chen, T., Pfister, T.: Aggregating nested transformer. arXiv preprint arXiv:2105.12723 (2021)

  21. Zhang, Q., Yang, Y.B.: Rest: an efficient transformer for visual recognition. Adv Neural Inf. Process. Syst. 34, 15475–15485 (2021)

    Google Scholar 

  22. Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z. H., Yan, S., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567 (2021)

  23. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 32–42 (2021)

  24. Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi. H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021)

  25. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: IEEE International Conference on Computer Vision, pp. 764–773 (2017)

  26. Beltagy, I., Peters, M, E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)

  27. Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Toronto, ON, Canada (2009)

  28. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)

  29. Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)

  30. Ba, J. L., Kiros, J. R., Hinton, G.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  31. Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 843–852 (2017)

  32. Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S. J.: Rethinking spatial dimensions of vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11936–11945 (2021)

  33. Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., Wu, W.: Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 579–588 (2021)

  34. Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Dosovitskiy, A.: Mlp-mixer: an all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 34, 24261–24272 (2021)

    Google Scholar 

  35. Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.: Early convolutions help transformers see better. arXiv preprint arXiv:2106.14881 (2021)

  36. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., et al.: Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8026 (2019)

  37. Le, Y., Yang, X.: Tiny imagenet visual recognition challenge. CS 231N 7(7), 3 (2015)

    Google Scholar 

  38. Cubuk, E, D., Zoph, B., Shlens, J., Le, Q, V.: Randaugment: Practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703 (2020)

  39. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization.arXiv preprint arXiv:1710.09412 (2018)

  40. Yun, S., Han, D., Oh, S, J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: IEEE International Conference on Computer Vision, pp. 6023–6032 (2019)

  41. Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: The National Conference on Artificial Intelligence, pp. 13001–13008 (2020)

  42. Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K. Q.: Deep networks with stochastic depth. In: Europeon Conference on Computer Vision (2016)

  43. Kingma, D. P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv: 1412.6980 (2014)

  44. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (2015)

  45. Nair, V., Hinton, G. E.: Rectified linear units improve restricted Boltzmann machines. In: The International Conference on Machine Learning (2010)

Download references

Acknowledgements

We thank many colleagues at our lab for their help.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunxue Shao.

Ethics declarations

Conflict of interest

There is no conflict of interest or competing interests to declare.

Code availability

The source code used during the current study are available from https://github.com/ygdr2020/eswin_transformer

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

Appendix A

1.1 Detailed architectures

The detailed architecture specifications are shown in Table 8, where an input image size of 32\(\times \)32 is assumed for all architectures. "CE" indicates a conv embedding module, "96-d" denotes the output dimension is 96 when feature maps pass this module, "DS" represents a downsample module, and "win.sz.4\(\times \)4" indicates a MSA (multi-head self-attention) module with window size of 4\(\times \)4.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yao, D., Shao, Y. A data efficient transformer based on Swin Transformer. Vis Comput 40, 2589–2598 (2024). https://doi.org/10.1007/s00371-023-02939-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-02939-2

Keywords

Navigation