Skip to main content

Advertisement

CRViT: Vision transformer advanced by causality and inductive bias for image recognition

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Vision Transformer (ViT) has shown powerful potential in various vision tasks by exploiting Transformer’s self-attention mechanism and global perception capability. However, to train a large number of network parameters, ViT requires a huge amount of data and number of computational resources, thus performing poorly on small and medium-sized datasets. Compared to ViT, convolutional networks maintain high accuracy despite the small amount of data due to the consideration of the inductive bias (IB). Besides, causal relationships can explore the underlying correlation of data structures, making the deep learning networks more intelligent. In this work, we propose a Causal Relationship Vision Transformer (CRViT), which refines ViT by fusing causal relationships and IB. We propose a random fourier features module that makes feature vectors independent of each other and uses convolution to learn correct correlation between feature vectors and extract causal features to introduce causal relationships in our network. The structure of convolutional downsampling significantly reduces the number of parameters of our model while introducing IB. Experimental validations underscore the data efficiency of CRViT, achieving a Top-1 accuracy of 80.6% on the ImageNet-1k dataset. This surpasses the ViT benchmark by 2.7% while concurrently reducing parameters by 92%. This enhanced performance is also consistent across smaller datasets, including T-ImageNet, CIFAR, and SVHN. We create the counterfactual dataset Colorful MNIST and experimentally demonstrate that causality is truly joined.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability and access

Data is openly available in a public repository. The data that support the findings of this study are openly available in ImageNet-1k at https://www.image-net.org/challenges/LSVRC/2012/, Tiny ImageNet at http://cs231n.stanford.edu/tiny-imagenet-200.zip, CIFAR-10 at https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz, CIFAR-100 at https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz, SVHN at http://ufldl.stanford.edu/housenumbers/.

References

  1. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems. NIPS’17. Curran Associates Inc., Red Hook, NY, USA, pp 6000–6010

  2. Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp 4171–4186

  3. Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding, 1–20. arXiv preprint arXiv:1804.07461

  4. Ojo OE, Ta HT, Gelbukh A, Calvo H, Adebanji OO, Sidorov G (2023) Transformer-based approaches to sentiment detection. In: Recent developments and the new directions of research, foundations, and applications: selected papers of the 8th world conference on soft computing, February 03–05, 2022, Baku, Azerbaijan, Vol. II, pp 101–110

  5. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Houlsby N (2020) An image is worth 16x16 words: transformers for image recognition at scale, 1–22. arXiv preprint arXiv:2010.11929

  6. Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE international conference on computer vision, pp 843–852

  7. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1–9

  8. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, vol 25, pp 1–9

  9. Simonyan K (2014) Very deep convolutional networks for large-scale image recognition, 1–14. arXiv preprint arXiv:1409.1556

  10. Tan M (2019) Efficientnet: rethinking model scaling for convolutional neural networks, 1–11. arXiv preprint arXiv:1905.11946

  11. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  12. Long Y, Wen Y, Han J, Xu H, Ren P, Zhang W, Zhao S, Liang X (2023) Capdet: unifying dense captioning and open-world detection pretraining. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15233–15243

  13. Lei J, Hu X, Wang Y, Liu D (2023) Pyramidflow: high-resolution defect contrastive localization using pyramid normalizing flow. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14143–14152

  14. Chen F, Zhang H, Hu K, Huang Y-K, Zhu C, Savvides M (2023) Enhanced training of query-based object detection via selective query recollection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23756–23765

  15. Yun S, Han D, Oh SJ, Chun S, Choe J, Yoo Y (2019) Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6023–6032

  16. Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning, pp 10347–10357

  17. Heo B, Yun S, Han D, Chun S, Choe J, Oh SJ (2021) Rethinking spatial dimensions of vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 11936–11945

  18. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2014) Imagenet large scale visual recognition challenge. Int J Comput Vis, 1–42

  19. Pan X, Ye T, Xia Z, Song S, Huang G (2023) Slide-transformer: hierarchical vision transformer with local self-attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2082–2091

  20. Lin X-Y, Xu Y-Y, Wang W-J, Zhang Y, Feng F-L (2023) Mitigating spurious correlations for self-supervised recommendation. Mach Intell Res 20(2):263–275

    Article  MATH  Google Scholar 

  21. Rahimi A, Recht B (2007) Random features for large-scale kernel machines. Adv Neural Inf Process Syst 20:1–8

    MATH  Google Scholar 

  22. Pearl J, Glymour M, Jewell NP (2016) Causal inference in statistics: a primer. John Wiley & Sons, Hoboken, pp 1–90

    MATH  Google Scholar 

  23. Wang L, Boddeti VN (2022) Do learned representations respect causal relationships? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 264–274

  24. Hu X, Tang K, Miao C, Hua X-S, Zhang H (2021) Distilling causal effect of data in class-incremental learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3957–3966

  25. Schölkopf B, Locatello F, Bauer S, Ke NR, Kalchbrenner N, Goyal A, Bengio Y (2021) Toward causal representation learning. Proc IEEE 109(5):612–634

    Article  MATH  Google Scholar 

  26. Liu Y, Wei Y-S, Yan H, Li G-B, Lin L (2022) Causal reasoning meets visual representation learning: a prospective study. Mach Intell Res 19(6):485–511

    Article  MATH  Google Scholar 

  27. Ouyang C, Chen C, Li S, Li Z, Qin C, Bai W, Rueckert D (2022) Causality-inspired single-source domain generalization for medical image segmentation. IEEE Trans Med Imaging 42(4):1095–1106

    Article  MATH  Google Scholar 

  28. Chen Z, Tian Z, Zhu J, Li C, Du S (2022) C-cam: causal cam for weakly supervised semantic segmentation on medical image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11676–11685

  29. Miao J, Chen C, Liu F, Wei H, Heng P-A (2023) Caussl: causality-inspired semi-supervised learning for medical image segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 21426–21437

  30. Ouyang C, Chen C, Li S, Li Z, Qin C, Bai W, Rueckert D (2021) Causality-inspired single-source domain generalization for medical image segmentation. IEEE Trans Med Imaging 42(4):1095–1106

    Article  MATH  Google Scholar 

  31. Zhang Y, Huang Z-A, Hong Z, Wu S, Wu J, Tan K (2024) Mixed prototype correction for causal inference in medical image classification. In: ACM Multimedia 2024, pp 1–10

  32. Yang Z, Lin M, Zhong X, Wu Y, Wang Z (2023) Good is bad: causality inspired cloth-debiasing for cloth-changing person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1472–1481

  33. Rao Y, Chen G, Lu J, Zhou J (2021) Counterfactual attention learning for fine-grained visual categorization and re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1025–1034

  34. Niu Y, Tang K, Zhang H, Lu Z, Hua X-S, Wen J-R (2021) Counterfactual vqa: a cause-effect look at language bias. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12700–12710

  35. Yang X, Zhang H, Qi G, Cai J (2021) Causal attention for vision-language tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9847–9857

  36. Yu K, Guo X, Liu L, Li J, Wang H, Ling Z, Wu X (2020) Causality-based feature selection: methods and evaluations. ACM Comput Surv (CSUR) 53(5):1–36

    Article  MATH  Google Scholar 

  37. Yu K, Liu L, Li J (2021) A unified view of causal and non-causal feature selection. ACM Trans Knowl Discov Data (TKDD) 15(4):1–46

    Article  MATH  Google Scholar 

  38. Yu K, Yang Y, Ding W (2022) Causal feature selection with missing data. ACM Trans Knowl Discov Data (TKDD) 16(4):1–24

    Article  MATH  Google Scholar 

  39. Li X, Zhang Z, Wei G, Lan C, Zeng W, Jin X, Chen Z (2021) Confounder identification-free causal visual feature learning, 1–21. arXiv preprint arXiv:2111.13420

  40. Li L, Lin Y, Zhao H, Chen J, Li S (2021) Causality-based online streaming feature selection. Concurr Comput Pract Exp 33(20):6347

    Article  MATH  Google Scholar 

  41. Wu P, Liu J (2021) Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Trans Image Process 30:3513–3527

    Article  MATH  Google Scholar 

  42. Zhang X, Wong Y, Wu X, Lu J, Kankanhalli M, Li X, Geng W (2021) Learning causal representation for training cross-domain pose estimator via generative interventions. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 11270–11280

  43. Wang T, Zhou C, Sun Q, Zhang H (2021) Causal attention for unbiased visual recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3091–3100

  44. Yue Z, Wang T, Sun Q, Hua X-S, Zhang H (2021) Counterfactual zero-shot and open-set visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15404–15414

  45. Li W, Li Z (2022) Causal-setr: a segmentation transformer variant based on causal intervention. In: Proceedings of the Asian conference on computer vision, pp 756–772

  46. Zhang D, Zhang H, Tang J, Hua X-S, Sun Q (2020) Causal intervention for weakly-supervised semantic segmentation. Adv Neural Inf Process Syst 33:655–666

    MATH  Google Scholar 

  47. Wang T, Huang J, Zhang H, Sun Q (2020) Visual commonsense r-cnn. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10760–10770

  48. Goudet O, Kalainathan D, Caillou P, Guyon I, Lopez-Paz D, Sebag M (2017) Causal generative neural networks, 1–7. arXiv preprint arXiv:1711.08936

  49. Wang D, Yang Y, Tao C, Gan Z, Chen L, Kong F, Henao R, Carin L (2020) Proactive pseudo-intervention: causally informed contrastive learning for interpretable vision models, 1–19. arXiv preprint arXiv:2012.03369

  50. Lin G, Xu Y, Lai H, Yin J (2024) Revisiting few-shot learning from a causal perspective. IEEE Trans Knowl Data Eng, 1–13

  51. Lecun Y, Bottou L (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  MATH  Google Scholar 

  52. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022

  53. Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 568–578

  54. Strudel R, Garcia R, Laptev I, Schmid C (2021) Segmenter: transformer for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7262–7272

  55. Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y (2021) Transunet: transformers make strong encoders for medical image segmentation abs/2102.04306:1–13

  56. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp 213–229

  57. Graham B, El-Nouby A, Touvron H, Stock P, Joulin A, Jégou H, Douze M (2021) Levit: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 12259–12269

  58. Dai Z, Liu H, Le QV, Tan M (2021) Coatnet: marrying convolution and attention for all data sizes. Adv Neural Inf Process Syst 34:3965–3977

    MATH  Google Scholar 

  59. Peng Z, Huang W, Gu S, Xie L, Wang Y, Jiao J, Ye Q (2021) Conformer: local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 367–376

  60. Chen Y, Dai X, Chen D, Liu M, Dong X, Yuan L, Liu Z (2022) Mobile-former: bridging mobilenet and transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5270–5279

  61. Yuan K, Guo S, Liu Z, Zhou A, Yu F, Wu W (2021) Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 579–588

  62. Li Y, Zhang K, Cao J, Timofte R, Van Gool L (2021) Localvit: Bringing locality to vision transformers, 1–10. arXiv preprint arXiv:2104.05707

  63. Srinivas A, Lin T-Y, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16519–16529

  64. Mehta S, Rastegari M (2021) Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer, 1–26. arXiv preprint arXiv:2110.02178

  65. Guo J, Han K, Wu H, Tang Y, Chen X, Wang Y, Xu C (2022) Cmt: convolutional neural networks meet vision transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12175–12185

  66. Yan H, Li Z, Li W, Wang C, Wu M, Zhang C (2021) Contnet: why not use convolution and transformer at the same time?, 1–12. arXiv preprint arXiv:2104.13497

  67. Huang Z, Ben Y, Luo G, Cheng P, Yu G, Fu B (2021) Shuffle transformer: rethinking spatial shuffle for vision transformer, 1–12. arXiv preprint arXiv:2106.03650

  68. Pan Z, Zhuang B, Liu J, He H, Cai J (2021) Scalable vision transformers with hierarchical pooling. In: Proceedings of the IEEE/cvf international conference on computer vision, pp 377–386

  69. Marr D (2010) Vision: a computational investigation into the human representation and processing of visual information. MIT press, Cambridge

  70. Brincat SL, Connor CE (2004) Underlying principles of visual shape selectivity in posterior inferotemporal cortex. Nat Neurosci 7(8):880–886

    Article  MATH  Google Scholar 

  71. Essen DCv (1997) A tension-based theory of morphogenesis and compact wiring in the central nervous system. Nature 385(6614):313–318

  72. Hubel DH, Wiesel TN et al (1959) Receptive fields of single neurones in the cat’s striate cortex. J Physiol 148(3):574–591

    Article  MATH  Google Scholar 

  73. Felleman DJ, Van Essen DC (1991) Distributed hierarchical processing in the primate cerebral cortex. Cerebral cortex (New York, NY: 1991) 1(1):1–47

    MATH  Google Scholar 

  74. Zhang X, Cui P, Xu R, Zhou L, He Y, Shen Z (2021) Deep stable learning for out-of-distribution generalization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5372–5382

  75. Yadav R, Priyanka Kacker P (2024) Automedsys: automatic facial micro-expression detection system using random fourier features based neural network. Int J Inf Technol 16(2):1073–1086

    Google Scholar 

  76. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115:211–252

  77. Krizhevsky A, Nair V, Hinton G (2009) Cifar-10 and cifar-100 datasets 6(1):1. https://www.cs.toronto.edu/kriz/cifar. html

  78. Howard AG (2047) Mobilenets: efficient convolutional neural networks for mobile vision applications, 1–9. arXiv preprint arXiv:1704.04861

  79. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520

  80. Howard A, Sandler M, Chu G, Chen L-C, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V et al (2019) Searching for mobilenetv3. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1314–1324

  81. Mehta S, Rastegari M, Shapiro L, Hajishirzi H (2019) Espnetv2: a light-weight, power efficient, and general purpose convolutional neural network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9190–9200

  82. Ma N, Zhang X, Zheng H-T, Sun J (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 116–131

  83. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708

  84. Woo S, Debnath S, Hu R, Chen X, Liu Z, Kweon IS, Xie S (2023) Convnext v2: co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16133–16142

  85. Zhang P, Dai X, Yang J, Xiao B, Yuan L, Zhang L, Gao J (2021) Multi-scale vision longformer: a new vision transformer for high-resolution image encoding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2998–3008

  86. Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L (2022) Pvt v2: improved baselines with pyramid vision transformer. Comput Vis Media 8(3):415–424

  87. Wu Y-H, Liu Y, Zhan X, Cheng M-M (2022) P2t: pyramid pooling transformer for scene understanding. IEEE Trans Pattern Anal Mach Intell 45(11):12760–12771

    Article  Google Scholar 

  88. Mehta S, Rastegari M (2022) Separable self-attention for mobile vision transformers, 1–18. arXiv preprint arXiv:arXiv:2206.02680

  89. Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang Z-H, Tay FE, Feng J, Yan S (2021) Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 558–567

Download references

Funding

This work is supported by the National Key R&D Program of China [2022ZD0119501]; National Natural Science Foundation of China[62302278, 52374221]; Sci. & Tech. Development Fund of Shandong Province of China [ZR2022MF288, ZR2023MF097]; the Taishan Scholar Program of Shandong Province[ts20190936]; Young Talent of Lifting engineering for Science and Technology in Shandong, China [SDAST2024QTA055]. Natural Science Foundation of Shandong Province [ZR2023QF014]; Natural Science Foundation of Qingdao Municipality [23-2-1-112-zyyd-jch].

Author information

Authors and Affiliations

Authors

Contributions

Faming Lu: Data curation, Supervision, Writing - review & editing Kunhao Jia: Formal analysis, Methodology, Software, Roles/Writing - original draft Xue Zhang: Project administration, Resources, Supervision, Writing - review & editing Lin Sun: Supervision, Writing - review & editing

Corresponding author

Correspondence to Xue Zhang.

Ethics declarations

Ethical and informed consent for data used

This study is based on publicly available data for which ethical approval is not required. All data are de-identified and collected in a manner consistent with ethical standards for research.

Competing Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, F., Jia, K., Zhang, X. et al. CRViT: Vision transformer advanced by causality and inductive bias for image recognition. Appl Intell 55, 68 (2025). https://doi.org/10.1007/s10489-024-05910-3

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-024-05910-3

Keywords