Abstract
Vision Transformer (ViT) has shown powerful potential in various vision tasks by exploiting Transformer’s self-attention mechanism and global perception capability. However, to train a large number of network parameters, ViT requires a huge amount of data and number of computational resources, thus performing poorly on small and medium-sized datasets. Compared to ViT, convolutional networks maintain high accuracy despite the small amount of data due to the consideration of the inductive bias (IB). Besides, causal relationships can explore the underlying correlation of data structures, making the deep learning networks more intelligent. In this work, we propose a Causal Relationship Vision Transformer (CRViT), which refines ViT by fusing causal relationships and IB. We propose a random fourier features module that makes feature vectors independent of each other and uses convolution to learn correct correlation between feature vectors and extract causal features to introduce causal relationships in our network. The structure of convolutional downsampling significantly reduces the number of parameters of our model while introducing IB. Experimental validations underscore the data efficiency of CRViT, achieving a Top-1 accuracy of 80.6% on the ImageNet-1k dataset. This surpasses the ViT benchmark by 2.7% while concurrently reducing parameters by 92%. This enhanced performance is also consistent across smaller datasets, including T-ImageNet, CIFAR, and SVHN. We create the counterfactual dataset Colorful MNIST and experimentally demonstrate that causality is truly joined.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability and access
Data is openly available in a public repository. The data that support the findings of this study are openly available in ImageNet-1k at https://www.image-net.org/challenges/LSVRC/2012/, Tiny ImageNet at http://cs231n.stanford.edu/tiny-imagenet-200.zip, CIFAR-10 at https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz, CIFAR-100 at https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz, SVHN at http://ufldl.stanford.edu/housenumbers/.
References
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems. NIPS’17. Curran Associates Inc., Red Hook, NY, USA, pp 6000–6010
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp 4171–4186
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding, 1–20. arXiv preprint arXiv:1804.07461
Ojo OE, Ta HT, Gelbukh A, Calvo H, Adebanji OO, Sidorov G (2023) Transformer-based approaches to sentiment detection. In: Recent developments and the new directions of research, foundations, and applications: selected papers of the 8th world conference on soft computing, February 03–05, 2022, Baku, Azerbaijan, Vol. II, pp 101–110
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Houlsby N (2020) An image is worth 16x16 words: transformers for image recognition at scale, 1–22. arXiv preprint arXiv:2010.11929
Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE international conference on computer vision, pp 843–852
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1–9
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, vol 25, pp 1–9
Simonyan K (2014) Very deep convolutional networks for large-scale image recognition, 1–14. arXiv preprint arXiv:1409.1556
Tan M (2019) Efficientnet: rethinking model scaling for convolutional neural networks, 1–11. arXiv preprint arXiv:1905.11946
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Long Y, Wen Y, Han J, Xu H, Ren P, Zhang W, Zhao S, Liang X (2023) Capdet: unifying dense captioning and open-world detection pretraining. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15233–15243
Lei J, Hu X, Wang Y, Liu D (2023) Pyramidflow: high-resolution defect contrastive localization using pyramid normalizing flow. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14143–14152
Chen F, Zhang H, Hu K, Huang Y-K, Zhu C, Savvides M (2023) Enhanced training of query-based object detection via selective query recollection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23756–23765
Yun S, Han D, Oh SJ, Chun S, Choe J, Yoo Y (2019) Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6023–6032
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning, pp 10347–10357
Heo B, Yun S, Han D, Chun S, Choe J, Oh SJ (2021) Rethinking spatial dimensions of vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 11936–11945
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2014) Imagenet large scale visual recognition challenge. Int J Comput Vis, 1–42
Pan X, Ye T, Xia Z, Song S, Huang G (2023) Slide-transformer: hierarchical vision transformer with local self-attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2082–2091
Lin X-Y, Xu Y-Y, Wang W-J, Zhang Y, Feng F-L (2023) Mitigating spurious correlations for self-supervised recommendation. Mach Intell Res 20(2):263–275
Rahimi A, Recht B (2007) Random features for large-scale kernel machines. Adv Neural Inf Process Syst 20:1–8
Pearl J, Glymour M, Jewell NP (2016) Causal inference in statistics: a primer. John Wiley & Sons, Hoboken, pp 1–90
Wang L, Boddeti VN (2022) Do learned representations respect causal relationships? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 264–274
Hu X, Tang K, Miao C, Hua X-S, Zhang H (2021) Distilling causal effect of data in class-incremental learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3957–3966
Schölkopf B, Locatello F, Bauer S, Ke NR, Kalchbrenner N, Goyal A, Bengio Y (2021) Toward causal representation learning. Proc IEEE 109(5):612–634
Liu Y, Wei Y-S, Yan H, Li G-B, Lin L (2022) Causal reasoning meets visual representation learning: a prospective study. Mach Intell Res 19(6):485–511
Ouyang C, Chen C, Li S, Li Z, Qin C, Bai W, Rueckert D (2022) Causality-inspired single-source domain generalization for medical image segmentation. IEEE Trans Med Imaging 42(4):1095–1106
Chen Z, Tian Z, Zhu J, Li C, Du S (2022) C-cam: causal cam for weakly supervised semantic segmentation on medical image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11676–11685
Miao J, Chen C, Liu F, Wei H, Heng P-A (2023) Caussl: causality-inspired semi-supervised learning for medical image segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 21426–21437
Ouyang C, Chen C, Li S, Li Z, Qin C, Bai W, Rueckert D (2021) Causality-inspired single-source domain generalization for medical image segmentation. IEEE Trans Med Imaging 42(4):1095–1106
Zhang Y, Huang Z-A, Hong Z, Wu S, Wu J, Tan K (2024) Mixed prototype correction for causal inference in medical image classification. In: ACM Multimedia 2024, pp 1–10
Yang Z, Lin M, Zhong X, Wu Y, Wang Z (2023) Good is bad: causality inspired cloth-debiasing for cloth-changing person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1472–1481
Rao Y, Chen G, Lu J, Zhou J (2021) Counterfactual attention learning for fine-grained visual categorization and re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1025–1034
Niu Y, Tang K, Zhang H, Lu Z, Hua X-S, Wen J-R (2021) Counterfactual vqa: a cause-effect look at language bias. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12700–12710
Yang X, Zhang H, Qi G, Cai J (2021) Causal attention for vision-language tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9847–9857
Yu K, Guo X, Liu L, Li J, Wang H, Ling Z, Wu X (2020) Causality-based feature selection: methods and evaluations. ACM Comput Surv (CSUR) 53(5):1–36
Yu K, Liu L, Li J (2021) A unified view of causal and non-causal feature selection. ACM Trans Knowl Discov Data (TKDD) 15(4):1–46
Yu K, Yang Y, Ding W (2022) Causal feature selection with missing data. ACM Trans Knowl Discov Data (TKDD) 16(4):1–24
Li X, Zhang Z, Wei G, Lan C, Zeng W, Jin X, Chen Z (2021) Confounder identification-free causal visual feature learning, 1–21. arXiv preprint arXiv:2111.13420
Li L, Lin Y, Zhao H, Chen J, Li S (2021) Causality-based online streaming feature selection. Concurr Comput Pract Exp 33(20):6347
Wu P, Liu J (2021) Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Trans Image Process 30:3513–3527
Zhang X, Wong Y, Wu X, Lu J, Kankanhalli M, Li X, Geng W (2021) Learning causal representation for training cross-domain pose estimator via generative interventions. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 11270–11280
Wang T, Zhou C, Sun Q, Zhang H (2021) Causal attention for unbiased visual recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3091–3100
Yue Z, Wang T, Sun Q, Hua X-S, Zhang H (2021) Counterfactual zero-shot and open-set visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15404–15414
Li W, Li Z (2022) Causal-setr: a segmentation transformer variant based on causal intervention. In: Proceedings of the Asian conference on computer vision, pp 756–772
Zhang D, Zhang H, Tang J, Hua X-S, Sun Q (2020) Causal intervention for weakly-supervised semantic segmentation. Adv Neural Inf Process Syst 33:655–666
Wang T, Huang J, Zhang H, Sun Q (2020) Visual commonsense r-cnn. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10760–10770
Goudet O, Kalainathan D, Caillou P, Guyon I, Lopez-Paz D, Sebag M (2017) Causal generative neural networks, 1–7. arXiv preprint arXiv:1711.08936
Wang D, Yang Y, Tao C, Gan Z, Chen L, Kong F, Henao R, Carin L (2020) Proactive pseudo-intervention: causally informed contrastive learning for interpretable vision models, 1–19. arXiv preprint arXiv:2012.03369
Lin G, Xu Y, Lai H, Yin J (2024) Revisiting few-shot learning from a causal perspective. IEEE Trans Knowl Data Eng, 1–13
Lecun Y, Bottou L (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 568–578
Strudel R, Garcia R, Laptev I, Schmid C (2021) Segmenter: transformer for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7262–7272
Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y (2021) Transunet: transformers make strong encoders for medical image segmentation abs/2102.04306:1–13
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp 213–229
Graham B, El-Nouby A, Touvron H, Stock P, Joulin A, Jégou H, Douze M (2021) Levit: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 12259–12269
Dai Z, Liu H, Le QV, Tan M (2021) Coatnet: marrying convolution and attention for all data sizes. Adv Neural Inf Process Syst 34:3965–3977
Peng Z, Huang W, Gu S, Xie L, Wang Y, Jiao J, Ye Q (2021) Conformer: local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 367–376
Chen Y, Dai X, Chen D, Liu M, Dong X, Yuan L, Liu Z (2022) Mobile-former: bridging mobilenet and transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5270–5279
Yuan K, Guo S, Liu Z, Zhou A, Yu F, Wu W (2021) Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 579–588
Li Y, Zhang K, Cao J, Timofte R, Van Gool L (2021) Localvit: Bringing locality to vision transformers, 1–10. arXiv preprint arXiv:2104.05707
Srinivas A, Lin T-Y, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16519–16529
Mehta S, Rastegari M (2021) Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer, 1–26. arXiv preprint arXiv:2110.02178
Guo J, Han K, Wu H, Tang Y, Chen X, Wang Y, Xu C (2022) Cmt: convolutional neural networks meet vision transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12175–12185
Yan H, Li Z, Li W, Wang C, Wu M, Zhang C (2021) Contnet: why not use convolution and transformer at the same time?, 1–12. arXiv preprint arXiv:2104.13497
Huang Z, Ben Y, Luo G, Cheng P, Yu G, Fu B (2021) Shuffle transformer: rethinking spatial shuffle for vision transformer, 1–12. arXiv preprint arXiv:2106.03650
Pan Z, Zhuang B, Liu J, He H, Cai J (2021) Scalable vision transformers with hierarchical pooling. In: Proceedings of the IEEE/cvf international conference on computer vision, pp 377–386
Marr D (2010) Vision: a computational investigation into the human representation and processing of visual information. MIT press, Cambridge
Brincat SL, Connor CE (2004) Underlying principles of visual shape selectivity in posterior inferotemporal cortex. Nat Neurosci 7(8):880–886
Essen DCv (1997) A tension-based theory of morphogenesis and compact wiring in the central nervous system. Nature 385(6614):313–318
Hubel DH, Wiesel TN et al (1959) Receptive fields of single neurones in the cat’s striate cortex. J Physiol 148(3):574–591
Felleman DJ, Van Essen DC (1991) Distributed hierarchical processing in the primate cerebral cortex. Cerebral cortex (New York, NY: 1991) 1(1):1–47
Zhang X, Cui P, Xu R, Zhou L, He Y, Shen Z (2021) Deep stable learning for out-of-distribution generalization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5372–5382
Yadav R, Priyanka Kacker P (2024) Automedsys: automatic facial micro-expression detection system using random fourier features based neural network. Int J Inf Technol 16(2):1073–1086
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115:211–252
Krizhevsky A, Nair V, Hinton G (2009) Cifar-10 and cifar-100 datasets 6(1):1. https://www.cs.toronto.edu/kriz/cifar. html
Howard AG (2047) Mobilenets: efficient convolutional neural networks for mobile vision applications, 1–9. arXiv preprint arXiv:1704.04861
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520
Howard A, Sandler M, Chu G, Chen L-C, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V et al (2019) Searching for mobilenetv3. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1314–1324
Mehta S, Rastegari M, Shapiro L, Hajishirzi H (2019) Espnetv2: a light-weight, power efficient, and general purpose convolutional neural network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9190–9200
Ma N, Zhang X, Zheng H-T, Sun J (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 116–131
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Woo S, Debnath S, Hu R, Chen X, Liu Z, Kweon IS, Xie S (2023) Convnext v2: co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16133–16142
Zhang P, Dai X, Yang J, Xiao B, Yuan L, Zhang L, Gao J (2021) Multi-scale vision longformer: a new vision transformer for high-resolution image encoding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2998–3008
Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L (2022) Pvt v2: improved baselines with pyramid vision transformer. Comput Vis Media 8(3):415–424
Wu Y-H, Liu Y, Zhan X, Cheng M-M (2022) P2t: pyramid pooling transformer for scene understanding. IEEE Trans Pattern Anal Mach Intell 45(11):12760–12771
Mehta S, Rastegari M (2022) Separable self-attention for mobile vision transformers, 1–18. arXiv preprint arXiv:arXiv:2206.02680
Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang Z-H, Tay FE, Feng J, Yan S (2021) Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 558–567
Funding
This work is supported by the National Key R&D Program of China [2022ZD0119501]; National Natural Science Foundation of China[62302278, 52374221]; Sci. & Tech. Development Fund of Shandong Province of China [ZR2022MF288, ZR2023MF097]; the Taishan Scholar Program of Shandong Province[ts20190936]; Young Talent of Lifting engineering for Science and Technology in Shandong, China [SDAST2024QTA055]. Natural Science Foundation of Shandong Province [ZR2023QF014]; Natural Science Foundation of Qingdao Municipality [23-2-1-112-zyyd-jch].
Author information
Authors and Affiliations
Contributions
Faming Lu: Data curation, Supervision, Writing - review & editing Kunhao Jia: Formal analysis, Methodology, Software, Roles/Writing - original draft Xue Zhang: Project administration, Resources, Supervision, Writing - review & editing Lin Sun: Supervision, Writing - review & editing
Corresponding author
Ethics declarations
Ethical and informed consent for data used
This study is based on publicly available data for which ethical approval is not required. All data are de-identified and collected in a manner consistent with ethical standards for research.
Competing Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lu, F., Jia, K., Zhang, X. et al. CRViT: Vision transformer advanced by causality and inductive bias for image recognition. Appl Intell 55, 68 (2025). https://doi.org/10.1007/s10489-024-05910-3
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-05910-3