Skip to main content

Advertisement

Log in

TransMCGC: a recast vision transformer for small-scale image classification tasks

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Multi-stage hierarchical structure is a basic and effective design pattern in convolution neural networks (CNNs). Recently, Vision Transformers (ViTs) have achieved impressive performance as a new architecture for various vision tasks. However, many unknown properties of ViTs need to be further explored. In this paper, we empirically find that despite having no explicit multi-stage hierarchical design like CNNs, ViT models are able to automatically organize layers into stages (or blockgroups) to gradually extract different levels of feature information. Moreover, ViT models organize more highly similar Transformer blocks in the last stage, where the multi-head self-attention becomes less effective to learn useful concepts for feature learning and thus may limit the model to get the expected performance gain. To this end, we further recast a new ViT framework, named TransMCGC, replacing the inefficient Transformer blocks in the last stage of Vision Transformer with the proposed convolutional operation MCGC blocks. The MCGC block consists of two sub-modules in parallel: Multi-branch Convolution module to integrate local neighborhood features and multi-scale context information, and Global Context module to capture global dependencies with negligible parameters. In this way, the proposed MCGC block integrates collaboratively convolution locality and global dependencies to enhance the feature learning ability of the model. Finally, extensive experiments on six standard small-scale benchmark datasets, including CIFAR10, CIFAR100, Stanford Cars, Oxford102flowers, DTD and Food101, demonstrate the effectiveness of the proposed MCGC block and indicate that our TransMCGC models achieve better performance over baseline model ViT, while achieving competitive performance compared to state-of-the-art ViT variants.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The datasets analysed during the current study are available from the corresponding author on reasonable request.

References

  1. Hubel DH, Wiesel TN (1962) Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol 160(1):106–154

    Article  Google Scholar 

  2. Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  3. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Commun ACM 60:84–90

    Article  Google Scholar 

  4. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the 31st AAAI conference on artificial intelligence, 4–9 Feb 2017, San Francisco, California, USA, pp 4278–4284

  5. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778

  6. Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117

    Article  Google Scholar 

  7. Scherer D, Müller AC, Behnke S (2010) Evaluation of pooling operations in convolutional architectures for object recognition. In: Proceedings of artificial neural networks—ICANN 2010—20th international conference, Thessaloniki, Greece, 15–18 Sept 2010, Part III. Lecture notes in computer science, vol 6354, pp 92–101

  8. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Proceedings of computer vision—ECCV 2014—13th European Conference, Zurich, Switzerland, September 6–12, 2014, Part I. Lecture notes in computer science, vol 8689, pp 818–833

  9. Qin Z, Yu F, Liu C, Chen X (2018) How convolutional neural networks see the world—a survey of convolutional neural network visualization methods. Math Found Comput 1(2):149–180

    Article  Google Scholar 

  10. Cadoni M, Lagorio A, Khellat-Kihel S, Grosso E (2021) On the correlation between human fixations, handcrafted and CNN features. Neural Comput Appl 33:11905–11922

    Article  Google Scholar 

  11. Wang X, Girshick RB, Gupta AK, He K (2018) Non-local neural networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 7794–7803

  12. Zhao H, Jia J, Koltun V (2020) Exploring self-attention for image recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10073–10082

  13. Srinivas A, Lin T-Y, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 16514–16524

  14. Wang W, Cui Y, Li G, Jiang C, Deng S (2020) A self-attention-based destruction and construction learning fine-grained image classification method for retail product recognition. Neural Comput Appl 1:1–10

    Google Scholar 

  15. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  16. Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, vol 1 (Long and Short Papers), pp 4171–4186

  17. de la Rosa J, Pérez Á, Sisto MD, Hernández L, Díaz A, Ros S, González-Blanco E (2021) Transformers analyzing poetry: multilingual metrical pattern prediction with transfomer-based language models. Neural Comput Appl

  18. Bhowmick RS, Ganguli I, Sil J (2022) Character-level inclusive transformer architecture for information gain in low resource code-mixed language. Neural Comput Appl

  19. Chen M, Radford A, Child R, Wu J, Jun H, Luan D, Sutskever I (2020) Generative pretraining from pixels. In: Proceedings of the 37th international conference on machine learning, ICML 2020, 13–18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol 119, pp 1691–1703

  20. Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A, Tran D (2018) Image transformer. In: International conference on machine learning, pp 4055–4064. PMLR

  21. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations

  22. Wang Z, Zhang Y, Liu Y, Wang Z, Coleman S, Kerr D (2022) Tf-sod: a novel transformer framework for salient object detection. Neural Comput Appl

  23. Yuan K, Guo S, Liu Z, Zhou A, Yu F, Wu W (2021) Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 579–588

  24. d’Ascoli S, Touvron H, Leavitt ML, Morcos AS, Biroli G, Sagun L (2021) Convit: improving vision transformers with soft convolutional inductive biases. In: Proceedings of the 38th international conference on machine learning, ICML 2021, 18–24 July 2021, Virtual Event, vol 139, pp 2286–2296

  25. Xu Y, Zhang Q, Zhang J, Tao D (2021) ViTAE: Vision transformer advanced by exploring intrinsic inductive bias. In: Advances in neural information processing systems

  26. Han K, Xiao A, Wu E, Guo J, Xu C, Wang Y (2021) Transformer in transformer. In: Advances in neural information processing systems

  27. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 10012–10022

  28. Wang Y, Xie Y, Fan L, Hu G (2022) Stmg: swin transformer for multi-label image recognition with graph convolution network. Neural Comput Appl

  29. Heo B, Yun S, Han D, Chun S, Choe J, Oh SJ (2021) Rethinking spatial dimensions of vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 11936–11945

  30. Cordonnier J-B, Loukas A, Jaggi M (2020) On the relationship between self-attention and convolutional layers. In: International conference on learning representations

  31. Peng G, Lu J, Li H, Mottaghi R, Kembhavi A (2021) Container: context aggregation networks. In: Advances in neural information processing systems

  32. Varma M, Prabhu NS (2021) [re]: On the relationship between self-attention and convolutional layers

  33. Xiao T, Dollar P, Singh M, Mintun E, Darrell T, Girshick R (2021) Early convolutions help transformers see better. In: Advances in neural information processing systems

  34. Zhang H, Dana KJ, Shi J, Zhang Z, Wang X, Tyagi A, Agrawal A (2018) Context encoding for semantic segmentation. 2018 IEEE/CVF conference on computer vision and pattern recognition, 7151–7160

  35. Tan M, Le QV (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th international conference on machine learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. Proceedings of machine learning research, vol 97, pp 6105–6114

  36. Jiang Y, Yang F, Zhu H, Zhou D, Zeng X (2019) Nonlinear CNN: improving CNNs with quadratic convolutions. Neural Comput Appl 32:8507–8516

    Article  Google Scholar 

  37. Leng J, Liu Y, Chen S (2019) Context-aware attention network for image recognition. Neural Comput Appl 31:9295–9305

    Article  Google Scholar 

  38. Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: Proceedings of the 38th international conference on machine learning, ICML 2021, 18–24 July 2021, virtual event. Proceedings of machine learning research, vol 139, pp 10347–10357

  39. Ba J, Kiros JR, Hinton GE (2016) Layer normalization. arXiv:1607.06450

  40. Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv: Learning

  41. Wang P, Zheng W, Chen T, Wang Z (2022) Anti-oversmoothing in deep vision transformers via the fourier domain analysis: from theory to practice. In: International conference on learning representations

  42. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vision 115:211–252

    Article  MathSciNet  Google Scholar 

  43. Pan Z, Zhuang B, Liu J, He H, Cai J (2021) Scalable vision transformers with hierarchical pooling. In: IEEE/CVF international conference on computer vision (ICCV), pp 367–376

  44. Xie J, Zeng R, Wang Q, Zhou Z, Li P (2021) So-vit: mind visual tokens for vision transformer. CoRR. arxiv:2104.10935

  45. Krizhevsky A (2009) Learning multiple layers of features from tiny images

  46. Sidorov G, Gelbukh A, Gómez-Adorno H, Pinto D (2014) Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18:2014

    Article  Google Scholar 

  47. Kadowaki N, Kishida K (2020) Empirical comparison of word similarity measures based on co-occurrence, context, and a vector space model. J Inf Sci Theory Pract 8:6–17

    Google Scholar 

  48. Nguyen T, Raghu M, Kornblith S (2021) Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. In: International conference on learning representations

  49. Lin J (1991) Divergence measures based on the Shannon entropy. IEEE Trans Inf Theory 37:145–151

    Article  MathSciNet  MATH  Google Scholar 

  50. Xiao T, Li Y, Zhu J, Yu Z, Liu T (2019) Sharing attention weights for fast transformer. In: Proceedings of the 28th international joint conference on artificial intelligence, IJCAI 2019, Macao, China, August 10–16, 2019, pp 5292–5298

  51. Michel P, Levy O, Neubig G (2019) Are sixteen heads really better than one? In: Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, 8–14 Dec 2019, Vancouver, BC, Canada, pp 14014–14024

  52. Dong Y, Cordonnier J, Loukas A (2021) Attention is not all you need: pure attention loses rank doubly exponentially with depth. In: Meila M, Zhang T (eds) Proceedings of the 38th international conference on machine learning, ICML 2021, 18-24 July 2021, virtual event. Proceedings of machine learning research, vol 139, pp 2793–2803. PMLR

  53. Shi H, GAO J, Xu H, Liang X, Li Z, Kong L, Lee SMS, Kwok J (2022) Revisiting over-smoothing in BERT from the perspective of graph. In: International conference on learning representations

  54. Huang W, Rong Y, Xu T, Sun F, Huang J (2020) Tackling over-smoothing for general graph convolutional networks. CoRR. arxiv:2008.09864

  55. Xu K, Li C, Tian Y, Sonobe T, Kawarabayashi K, Jegelka S (2018) Representation learning on graphs with jumping knowledge networks. In: Dy JG, Krause A (eds) Proceedings of the 35th international conference on machine learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10–15, 2018. Proceedings of Machine Learning Research, vol 80, pp 5449–5458. PMLR

  56. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille, France, 6-11 July 2015. JMLR workshop and conference proceedings, vol 37, pp 448–456

  57. Bossard L, Guillaumin M, Gool LV (2014) Food-101 - mining discriminative components with random forests. In: Computer vision—ECCV 2014—13th European conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part VI. Lecture notes in computer science, vol 8694, pp 446–461

  58. Krause J, Stark M, Deng J, Fei-Fei L (2013) 3d object representations for fine-grained categorization. In: IEEE international conference on computer vision workshops, ICCV Workshops 2013, Sydney, Australia, December 1–8, 2013, pp 554–561

  59. Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A (2014) Describing textures in the wild. In: 2014 IEEE conference on computer vision and pattern recognition, CVPR 2014, Columbus, OH, USA, 23–28 June 2014, pp 3606–3613

  60. Nilsback M, Zisserman A (2008) Automated flower classification over a large number of classes. In: 6th Indian conference on computer vision, graphics & image processing, ICVGIP 2008, Bhubaneswar, India, 16–19 Dec 2008, pp 722–729

  61. Zhong Z, Zheng L, Kang G, Li S, Yang Y (2020) Random erasing data augmentation. In: AAAI

  62. Cubuk ED, Zoph B, Shlens J, Le QV (2020) Randaugment: practical automated data augmentation with a reduced search space. 2020 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp 3008–3017

  63. Kornblith S, Shlens J, Le QV (2019) Do better imagenet models transfer better? In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019, pp 2661–2671

  64. Zoph B, Vasudevan V, Shlens J, Le QV (2018) Learning transferable architectures for scalable image recognition. 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 8697–8710

  65. Tan M, Le QV (2021) Efficientnetv2: smaller models and faster training. In: Proceedings of the 38th international conference on machine learning,ICML 2021, July 18–24 2021, virtual event. Proceedings of machine learning research, vol 139, pp 10096–10106

  66. Huang Y, Cheng Y, Bapna A, Firat O, Chen D, Chen MX, Lee H, Ngiam J, Le QV, Wu Y, Chen Z (2019) Gpipe: efficient training of giant neural networks using pipeline parallelism. In: Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, 8–14 Dec 2019, Vancouver, BC, Canada, pp 103–112

  67. Kolesnikov A, Beyer L, Zhai X, Puigcerver J, Yung J, Gelly S, Houlsby N (2020) Big transfer (bit): General visual representation learning. In: ECCV

  68. Touvron H, Bojanowski P, Caron M, Cord M, El-Nouby A, Grave E, Joulin A, Synnaeve G, Verbeek J, Jégou H (2021) Resmlp: feedforward networks for image classification with data-efficient training. CoRR. arxiv:2105.03404 2021

  69. Tatsunami Y, Taki M (2021) Raftmlp: How much can be done without attention and with less spatial locality?

  70. Tolstikhin I, Houlsby N, Kolesnikov A, Beyer L, Zhai X, Unterthiner T, Yung J, Steiner AP, Keysers D, Uszkoreit J, Lucic M, Dosovitskiy A (2021) MLP-mixer: An all-MLP architecture for vision. In: Advances in neural information processing systems

  71. Rao Y, Zhao W, Zhu Z, Lu J, Zhou J (2021) Global filter networks for image classification. In: Advances in neural information processing systems

  72. Chen C-F, Panda R, Fan Q (2022) Regionvit: Regional-to-local attention for vision transformers. In: International conference on learning representations

  73. Chen C, Fan Q, Panda R (2021) Crossvit: cross-attention multi-scale vision transformer for image classification. CoRR. arxiv:2103.14899 2021

  74. El-Nouby A, Touvron H, Caron M, Bojanowski P, Douze M, Joulin A, Laptev I, Neverova N, Synnaeve G, Verbeek J, Jegou H (2021) XCit: Cross-covariance image transformers. In: Advances in neural information processing systems

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant Nos. 61872153 and 61972288).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Min-Rong Chen or Pei-Shan Li.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1: Attention visualization

Appendix 1: Attention visualization

More visualization results. The attention maps of the class token for pretrained ViT-B/16 model with 12 layers (rows) and 12 heads (columns). It can be seen that the attention maps of the class token with similar attention patterns automatically are organized into blockgroups. And in the last stage, the attention maps of the class token repeat more similar attention patterns to generate a larger, highly correlated blockgroups.

figure a

Average attention visualization for pretrained ViT-B/16 model with 12 layers (rows) and 12 heads (columns). It can be observed that different visualization samples have the same pattern: the layers with similar attention patterns automatically organize into blockgroups to constitute multiple stages, thus gradually capturing different levels of feature information.

figure b

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiang, JW., Chen, MR., Li, PS. et al. TransMCGC: a recast vision transformer for small-scale image classification tasks. Neural Comput & Applic 35, 7697–7718 (2023). https://doi.org/10.1007/s00521-022-08067-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-08067-7

Keywords