Abstract
Multi-stage hierarchical structure is a basic and effective design pattern in convolution neural networks (CNNs). Recently, Vision Transformers (ViTs) have achieved impressive performance as a new architecture for various vision tasks. However, many unknown properties of ViTs need to be further explored. In this paper, we empirically find that despite having no explicit multi-stage hierarchical design like CNNs, ViT models are able to automatically organize layers into stages (or blockgroups) to gradually extract different levels of feature information. Moreover, ViT models organize more highly similar Transformer blocks in the last stage, where the multi-head self-attention becomes less effective to learn useful concepts for feature learning and thus may limit the model to get the expected performance gain. To this end, we further recast a new ViT framework, named TransMCGC, replacing the inefficient Transformer blocks in the last stage of Vision Transformer with the proposed convolutional operation MCGC blocks. The MCGC block consists of two sub-modules in parallel: Multi-branch Convolution module to integrate local neighborhood features and multi-scale context information, and Global Context module to capture global dependencies with negligible parameters. In this way, the proposed MCGC block integrates collaboratively convolution locality and global dependencies to enhance the feature learning ability of the model. Finally, extensive experiments on six standard small-scale benchmark datasets, including CIFAR10, CIFAR100, Stanford Cars, Oxford102flowers, DTD and Food101, demonstrate the effectiveness of the proposed MCGC block and indicate that our TransMCGC models achieve better performance over baseline model ViT, while achieving competitive performance compared to state-of-the-art ViT variants.








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The datasets analysed during the current study are available from the corresponding author on reasonable request.
References
Hubel DH, Wiesel TN (1962) Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol 160(1):106–154
Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Commun ACM 60:84–90
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the 31st AAAI conference on artificial intelligence, 4–9 Feb 2017, San Francisco, California, USA, pp 4278–4284
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
Scherer D, Müller AC, Behnke S (2010) Evaluation of pooling operations in convolutional architectures for object recognition. In: Proceedings of artificial neural networks—ICANN 2010—20th international conference, Thessaloniki, Greece, 15–18 Sept 2010, Part III. Lecture notes in computer science, vol 6354, pp 92–101
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Proceedings of computer vision—ECCV 2014—13th European Conference, Zurich, Switzerland, September 6–12, 2014, Part I. Lecture notes in computer science, vol 8689, pp 818–833
Qin Z, Yu F, Liu C, Chen X (2018) How convolutional neural networks see the world—a survey of convolutional neural network visualization methods. Math Found Comput 1(2):149–180
Cadoni M, Lagorio A, Khellat-Kihel S, Grosso E (2021) On the correlation between human fixations, handcrafted and CNN features. Neural Comput Appl 33:11905–11922
Wang X, Girshick RB, Gupta AK, He K (2018) Non-local neural networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 7794–7803
Zhao H, Jia J, Koltun V (2020) Exploring self-attention for image recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10073–10082
Srinivas A, Lin T-Y, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 16514–16524
Wang W, Cui Y, Li G, Jiang C, Deng S (2020) A self-attention-based destruction and construction learning fine-grained image classification method for retail product recognition. Neural Comput Appl 1:1–10
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, vol 1 (Long and Short Papers), pp 4171–4186
de la Rosa J, Pérez Á, Sisto MD, Hernández L, Díaz A, Ros S, González-Blanco E (2021) Transformers analyzing poetry: multilingual metrical pattern prediction with transfomer-based language models. Neural Comput Appl
Bhowmick RS, Ganguli I, Sil J (2022) Character-level inclusive transformer architecture for information gain in low resource code-mixed language. Neural Comput Appl
Chen M, Radford A, Child R, Wu J, Jun H, Luan D, Sutskever I (2020) Generative pretraining from pixels. In: Proceedings of the 37th international conference on machine learning, ICML 2020, 13–18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol 119, pp 1691–1703
Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A, Tran D (2018) Image transformer. In: International conference on machine learning, pp 4055–4064. PMLR
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations
Wang Z, Zhang Y, Liu Y, Wang Z, Coleman S, Kerr D (2022) Tf-sod: a novel transformer framework for salient object detection. Neural Comput Appl
Yuan K, Guo S, Liu Z, Zhou A, Yu F, Wu W (2021) Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 579–588
d’Ascoli S, Touvron H, Leavitt ML, Morcos AS, Biroli G, Sagun L (2021) Convit: improving vision transformers with soft convolutional inductive biases. In: Proceedings of the 38th international conference on machine learning, ICML 2021, 18–24 July 2021, Virtual Event, vol 139, pp 2286–2296
Xu Y, Zhang Q, Zhang J, Tao D (2021) ViTAE: Vision transformer advanced by exploring intrinsic inductive bias. In: Advances in neural information processing systems
Han K, Xiao A, Wu E, Guo J, Xu C, Wang Y (2021) Transformer in transformer. In: Advances in neural information processing systems
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 10012–10022
Wang Y, Xie Y, Fan L, Hu G (2022) Stmg: swin transformer for multi-label image recognition with graph convolution network. Neural Comput Appl
Heo B, Yun S, Han D, Chun S, Choe J, Oh SJ (2021) Rethinking spatial dimensions of vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 11936–11945
Cordonnier J-B, Loukas A, Jaggi M (2020) On the relationship between self-attention and convolutional layers. In: International conference on learning representations
Peng G, Lu J, Li H, Mottaghi R, Kembhavi A (2021) Container: context aggregation networks. In: Advances in neural information processing systems
Varma M, Prabhu NS (2021) [re]: On the relationship between self-attention and convolutional layers
Xiao T, Dollar P, Singh M, Mintun E, Darrell T, Girshick R (2021) Early convolutions help transformers see better. In: Advances in neural information processing systems
Zhang H, Dana KJ, Shi J, Zhang Z, Wang X, Tyagi A, Agrawal A (2018) Context encoding for semantic segmentation. 2018 IEEE/CVF conference on computer vision and pattern recognition, 7151–7160
Tan M, Le QV (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th international conference on machine learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. Proceedings of machine learning research, vol 97, pp 6105–6114
Jiang Y, Yang F, Zhu H, Zhou D, Zeng X (2019) Nonlinear CNN: improving CNNs with quadratic convolutions. Neural Comput Appl 32:8507–8516
Leng J, Liu Y, Chen S (2019) Context-aware attention network for image recognition. Neural Comput Appl 31:9295–9305
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: Proceedings of the 38th international conference on machine learning, ICML 2021, 18–24 July 2021, virtual event. Proceedings of machine learning research, vol 139, pp 10347–10357
Ba J, Kiros JR, Hinton GE (2016) Layer normalization. arXiv:1607.06450
Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv: Learning
Wang P, Zheng W, Chen T, Wang Z (2022) Anti-oversmoothing in deep vision transformers via the fourier domain analysis: from theory to practice. In: International conference on learning representations
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vision 115:211–252
Pan Z, Zhuang B, Liu J, He H, Cai J (2021) Scalable vision transformers with hierarchical pooling. In: IEEE/CVF international conference on computer vision (ICCV), pp 367–376
Xie J, Zeng R, Wang Q, Zhou Z, Li P (2021) So-vit: mind visual tokens for vision transformer. CoRR. arxiv:2104.10935
Krizhevsky A (2009) Learning multiple layers of features from tiny images
Sidorov G, Gelbukh A, Gómez-Adorno H, Pinto D (2014) Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18:2014
Kadowaki N, Kishida K (2020) Empirical comparison of word similarity measures based on co-occurrence, context, and a vector space model. J Inf Sci Theory Pract 8:6–17
Nguyen T, Raghu M, Kornblith S (2021) Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. In: International conference on learning representations
Lin J (1991) Divergence measures based on the Shannon entropy. IEEE Trans Inf Theory 37:145–151
Xiao T, Li Y, Zhu J, Yu Z, Liu T (2019) Sharing attention weights for fast transformer. In: Proceedings of the 28th international joint conference on artificial intelligence, IJCAI 2019, Macao, China, August 10–16, 2019, pp 5292–5298
Michel P, Levy O, Neubig G (2019) Are sixteen heads really better than one? In: Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, 8–14 Dec 2019, Vancouver, BC, Canada, pp 14014–14024
Dong Y, Cordonnier J, Loukas A (2021) Attention is not all you need: pure attention loses rank doubly exponentially with depth. In: Meila M, Zhang T (eds) Proceedings of the 38th international conference on machine learning, ICML 2021, 18-24 July 2021, virtual event. Proceedings of machine learning research, vol 139, pp 2793–2803. PMLR
Shi H, GAO J, Xu H, Liang X, Li Z, Kong L, Lee SMS, Kwok J (2022) Revisiting over-smoothing in BERT from the perspective of graph. In: International conference on learning representations
Huang W, Rong Y, Xu T, Sun F, Huang J (2020) Tackling over-smoothing for general graph convolutional networks. CoRR. arxiv:2008.09864
Xu K, Li C, Tian Y, Sonobe T, Kawarabayashi K, Jegelka S (2018) Representation learning on graphs with jumping knowledge networks. In: Dy JG, Krause A (eds) Proceedings of the 35th international conference on machine learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10–15, 2018. Proceedings of Machine Learning Research, vol 80, pp 5449–5458. PMLR
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille, France, 6-11 July 2015. JMLR workshop and conference proceedings, vol 37, pp 448–456
Bossard L, Guillaumin M, Gool LV (2014) Food-101 - mining discriminative components with random forests. In: Computer vision—ECCV 2014—13th European conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part VI. Lecture notes in computer science, vol 8694, pp 446–461
Krause J, Stark M, Deng J, Fei-Fei L (2013) 3d object representations for fine-grained categorization. In: IEEE international conference on computer vision workshops, ICCV Workshops 2013, Sydney, Australia, December 1–8, 2013, pp 554–561
Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A (2014) Describing textures in the wild. In: 2014 IEEE conference on computer vision and pattern recognition, CVPR 2014, Columbus, OH, USA, 23–28 June 2014, pp 3606–3613
Nilsback M, Zisserman A (2008) Automated flower classification over a large number of classes. In: 6th Indian conference on computer vision, graphics & image processing, ICVGIP 2008, Bhubaneswar, India, 16–19 Dec 2008, pp 722–729
Zhong Z, Zheng L, Kang G, Li S, Yang Y (2020) Random erasing data augmentation. In: AAAI
Cubuk ED, Zoph B, Shlens J, Le QV (2020) Randaugment: practical automated data augmentation with a reduced search space. 2020 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp 3008–3017
Kornblith S, Shlens J, Le QV (2019) Do better imagenet models transfer better? In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019, pp 2661–2671
Zoph B, Vasudevan V, Shlens J, Le QV (2018) Learning transferable architectures for scalable image recognition. 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 8697–8710
Tan M, Le QV (2021) Efficientnetv2: smaller models and faster training. In: Proceedings of the 38th international conference on machine learning,ICML 2021, July 18–24 2021, virtual event. Proceedings of machine learning research, vol 139, pp 10096–10106
Huang Y, Cheng Y, Bapna A, Firat O, Chen D, Chen MX, Lee H, Ngiam J, Le QV, Wu Y, Chen Z (2019) Gpipe: efficient training of giant neural networks using pipeline parallelism. In: Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, 8–14 Dec 2019, Vancouver, BC, Canada, pp 103–112
Kolesnikov A, Beyer L, Zhai X, Puigcerver J, Yung J, Gelly S, Houlsby N (2020) Big transfer (bit): General visual representation learning. In: ECCV
Touvron H, Bojanowski P, Caron M, Cord M, El-Nouby A, Grave E, Joulin A, Synnaeve G, Verbeek J, Jégou H (2021) Resmlp: feedforward networks for image classification with data-efficient training. CoRR. arxiv:2105.03404 2021
Tatsunami Y, Taki M (2021) Raftmlp: How much can be done without attention and with less spatial locality?
Tolstikhin I, Houlsby N, Kolesnikov A, Beyer L, Zhai X, Unterthiner T, Yung J, Steiner AP, Keysers D, Uszkoreit J, Lucic M, Dosovitskiy A (2021) MLP-mixer: An all-MLP architecture for vision. In: Advances in neural information processing systems
Rao Y, Zhao W, Zhu Z, Lu J, Zhou J (2021) Global filter networks for image classification. In: Advances in neural information processing systems
Chen C-F, Panda R, Fan Q (2022) Regionvit: Regional-to-local attention for vision transformers. In: International conference on learning representations
Chen C, Fan Q, Panda R (2021) Crossvit: cross-attention multi-scale vision transformer for image classification. CoRR. arxiv:2103.14899 2021
El-Nouby A, Touvron H, Caron M, Bojanowski P, Douze M, Joulin A, Laptev I, Neverova N, Synnaeve G, Verbeek J, Jegou H (2021) XCit: Cross-covariance image transformers. In: Advances in neural information processing systems
Acknowledgements
This work was supported by National Natural Science Foundation of China (Grant Nos. 61872153 and 61972288).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix 1: Attention visualization
Appendix 1: Attention visualization
More visualization results. The attention maps of the class token for pretrained ViT-B/16 model with 12 layers (rows) and 12 heads (columns). It can be seen that the attention maps of the class token with similar attention patterns automatically are organized into blockgroups. And in the last stage, the attention maps of the class token repeat more similar attention patterns to generate a larger, highly correlated blockgroups.

Average attention visualization for pretrained ViT-B/16 model with 12 layers (rows) and 12 heads (columns). It can be observed that different visualization samples have the same pattern: the layers with similar attention patterns automatically organize into blockgroups to constitute multiple stages, thus gradually capturing different levels of feature information.

Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xiang, JW., Chen, MR., Li, PS. et al. TransMCGC: a recast vision transformer for small-scale image classification tasks. Neural Comput & Applic 35, 7697–7718 (2023). https://doi.org/10.1007/s00521-022-08067-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-08067-7