Abstract
We present a novel SEgmentaion TRansformer variant based on causal intervention. It serves as an improved vision encoder for semantic segmentation. Many studies have proved that vision transformers (ViT) can achieve a competitive benchmark on these downstream tasks, which shows that they can learn feature representations well. In other words, it is good at observing the instance from the image. However, in the human visual system, to recognize the objects in the scene, it is necessary to observe the objects themselves and introduce some prior knowledge for producing higher confidence results. Inspired by this, we introduced a structural causal model (SCM) to model images, category labels, and context. Beyond observing, we propose a causal intervention method by removing the confounding bias of global context and plugging it in the ViT encoder. Unlike other sequence-to-sequence prediction tasks, we use causal intervention instead of likelihood. Besides, the proxy training objective of the framework is to predict the contextual objects of a region. Finally, we combine this encoder with the segmentation decoder. Experiments show that our proposed method is flexible and effective.
This work is supported by National Natural Science Foundation of China (Nos. 62276073, 61966004), Guangxi Natural Science Foundation (No. 2019GXNSF DA245018), Guangxi “Bagui Scholar” Teams for Innovation and Research Project, and Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Badde, S., Hong, F., Landy, M.S.: Causal inference and the evolution of opposite neurons. Proceed. Nat. Acad. Sci. 118(36), e2112686118 (2021)
Bengio, Y., et al.: A meta-transfer objective for learning to disentangle causal mechanisms. arXiv preprint arXiv:1901.10912 (2019)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chalupka, K., Perona, P., Eberhardt, F.: Visual causal feature learning. arXiv preprint arXiv:1412.2309 (2014)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Chen, S., Li, Z., Tang, Z.: Relation R-CNN: a graph based relation-aware network for object detection. IEEE Signal Process. Lett. 27, 1680–1684 (2020)
Chen, S., Li, Z., Yang, X.: Knowledge reasoning for semantic segmentation. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2340–2344 (2021)
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)
Dasgupta, I., et al.: Causal reasoning from meta-reinforcement learning. arXiv preprint arXiv:1901.08162 (2019)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fu, J., et al.: Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3146–3154 (2019)
Hou, Q., Zhang, L., Cheng, M.M., Feng, J.: Strip pooling: rethinking spatial pooling for scene parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4003–4012 (2020)
Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: CCNet: Criss-Cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 603–612 (2019)
Kalainathan, D., Goudet, O., Guyon, I., Lopez-Paz, D., Sebag, M.: Sam: structural agnostic model, causal discovery and penalized adversarial learning (2018)
Kocaoglu, M., Snyder, C., Dimakis, A.G., Vishwanath, S.: CausalGAN: learning causal implicit generative models with adversarial training. arXiv preprint arXiv:1709.02023 (2017)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)
Li, Z., Sun, Y., Zhu, J., Tang, S., Zhang, C., Ma, H.: Improve relation extraction with dual attention-guided graph convolutional networks. Neural Comput. Appl. 33(6), 1773–1784 (2021)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Lopez-Paz, D., Nishihara, R., Chintala, S., Scholkopf, B., Bottou, L.: Discovering causal signals in images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6979–6987 (2017)
Pearl, J.: Theoretical impediments to machine learning with seven sparks from the causal revolution. arXiv preprint arXiv:1801.04016 (2018)
Pearl, J., Glymour, M., Jewell, N.P.: Causal inference in statistics: a primer. John Wiley & Sons (2016)
Pearl, J., Mackenzie, D.: The book of why: the new science of cause and effect. Basic books (2018)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
Quan, Y., Li, Z., Chen, S., Zhang, C., Ma, H.: Joint deep separable convolution network and border regression reinforcement for object detection. Neural Comput. Appl. 33(9), 4299–4314 (2021)
Redondo-Cabrera, C., Baptista-Ríos, M., López-Sastre, R.J.: Learning to exploit the prior network knowledge for weakly supervised semantic segmentation. IEEE Trans. Image Process. 28(7), 3649–3661 (2019)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural. Inf. Process. Syst. 28, 91–99 (2015)
Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: transformer for semantic segmentation. arXiv preprint arXiv:2105.05633 (2021)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Wang, T., Huang, J., Zhang, H., Sun, Q.: Visual commonsense representation learning via causal inference. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
Wei, H., Li, Z., Huang, F., Zhang, C., Ma, H., Shi, Z.: Integrating scene semantic knowledge into image captioning. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 17(2), 1–22 (2021)
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: CvT: introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808 (2021)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Yang, X., Zhang, H., Qi, G., Cai, J.: Causal attention for vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9847–9857 (2021)
Zhang, D., Zhang, H., Tang, J., Hua, X.S., Sun, Q.: Causal intervention for weakly-supervised semantic segmentation. In: Advances in Neural Information Processing Systems 33 (2020)
Zhang, H., Zhang, H., Wang, C., Xie, J.: Co-occurrent features in semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 548–557 (2019)
Zhang, J., Li, Z., Zhang, C., Ma, H.: Stable self-attention adversarial learning for semi-supervised semantic image segmentation. J. Vis. Commun. Image Represent. 78, 103170 (2021)
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)
Zhu, Z., Xu, M., Bai, S., Huang, T., Bai, X.: Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 593–602 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, W., Li, Z. (2023). Causal-SETR: A SEgmentation TRansformer Variant Based on Causal Intervention. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13847. Springer, Cham. https://doi.org/10.1007/978-3-031-26293-7_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-26293-7_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26292-0
Online ISBN: 978-3-031-26293-7
eBook Packages: Computer ScienceComputer Science (R0)