CausalCLIPSeg: Unlocking CLIP’s Potential in Referring Medical Image Segmentation with Causal Intervention

Chen, Yaxiong; Wei, Minghong; Zheng, Zixuan; Hu, Jingliang; Shi, Yilei; Xiong, Shengwu; Zhu, Xiao Xiang; Mou, Lichao

doi:10.1007/978-3-031-72384-1_8

Yaxiong Chen^14,15,
Minghong Wei¹⁴,
Zixuan Zheng¹⁶,
Jingliang Hu¹⁶,
Yilei Shi¹⁶,
Shengwu Xiong^14,15,
Xiao Xiang Zhu¹⁷ &
…
Lichao Mou¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15003))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

1988 Accesses

Abstract

Referring medical image segmentation targets delineating lesions indicated by textual descriptions. Aligning visual and textual cues is challenging due to their distinct data properties. Inspired by large-scale pre-trained vision-language models, we propose CausalCLIPSeg, an end-to-end framework for referring medical image segmentation that leverages CLIP. Despite not being trained on medical data, we enforce CLIP’s rich semantic space onto the medical domain by a tailored cross-modal decoding method to achieve text-to-pixel alignment. Furthermore, to mitigate confounding bias that may cause the model to learn spurious correlations instead of meaningful causal relationships, CausalCLIPSeg introduces a causal intervention module which self-annotates confounders and excavates causal features from inputs for segmentation judgments. We also devise an adversarial min-max game to optimize causal features while penalizing confounding ones. Extensive experiments demonstrate the state-of-the-art performance of our proposed method. Code is available at https://github.com/WUTCM-Lab/CausalCLIPSeg.

Work done during an internship at MedAI Technology (Wuxi) Co. Ltd.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Enhancing Label-Efficient Medical Image Segmentation with Text-Guided Diffusion Models

TP-DRSeg: Improving Diabetic Retinopathy Lesion Segmentation with Explicit Text-Prompts Assisted SAM

MAdapter: A Better Interaction Between Image and Language for Medical Image Segmentation

References

Zhou, Y., Xie, L., Shen, W., Wang, Y., Fishman, E.K., Yuille, A.L.: A fixed-point model for pancreas segmentation in abdominal CT scans. In: International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 693–701. (2017)
Google Scholar
Gering, D.T., Nabavi, A., Kikinis, R., Grimson, W.E.L., Hata, N., Everett, P., Jolesz, F., Wells, W.M.: An integrated visualization system for surgical planning and guidance using image fusion and interventional imaging. In: International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 809–819. (1999)
Google Scholar
Li, Z., Li, Y., Li, Q., Wang, P., Guo, D., Lu, L., Jin, D., Zhang, Y., Hong, Q.: LViT: Language meets vision Transformer in medical image segmentation. IEEE Transactions on Medical Imaging 43(1), 96–107 (2024)
Article Google Scholar
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. (2021)
Google Scholar
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.: Segment anything. In: IEEE/CVF International Conference on Computer Vision, pp. 4015–4026. (2023)
Google Scholar
Ali, M., Khan, S.: CLIP-Decoder: Zeroshot multilabel classification using multimodal CLIP aligned representations. In: IEEE/CVF International Conference on Computer Vision, pp. 4675–4679. (2023)
Google Scholar
Conde, M.V., Turgutlu, K.: CLIP-Art: Contrastive pre-training for fine-grained art classification. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3956–3960. (2021)
Google Scholar
Wang, M., Xing, J., Liu, Y.: ActionCLIP: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021)
Zhu, J., Jin, J., Yang, Z., Wu, X., Wang, X.: Learning CLIP guided visual-text fusion Transformer for video-based pedestrian attribute recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2625–2628. (2023)
Google Scholar
Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., Li, T.: CLIP4CLIP: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508, 293–304 (2022)
Article Google Scholar
Fang, H., Xiong, P., Xu, L., Chen, Y.: CLIP2Video: Mastering video-text retrieval via image CLIP. arXiv preprint arXiv:2106.11097 (2021)
Xie, Y., Liao, H., Zhang, D., Chen, F.: Uncertainty-aware cascade network for ultrasound image segmentation with ambiguous boundary. In: International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 268–278. (2022)
Google Scholar
Wang, J., Wei, L., Wang, L., Zhou, Q., Zhu, L., Qin, J.: Boundary-aware Transformers for skin lesion segmentation. In: International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 206–216. (2021)
Google Scholar
Wang, J., Yang, J., Zhou, Q., Wang, L.: Medical boundary diffusion model for skin lesion segmentation. In: International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 427–436. (2023)
Google Scholar
Pearl, J.: Causality. Cambridge University Press (2009)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Wang, J., Chen, K., Xu, R., Liu, Z., Loy, C.C., Lin, D.: CARAFE: Content-aware reassembly of features. In: IEEE/CVF International Conference on Computer Vision, pp. 3007–3016. (2019)
Google Scholar
Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision 111, 98–136 (2015)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 234–241. (2015)
Google Scholar
Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: UNet++: A nested U-Net architecture for medical image segmentation. In: International Conference on Medical Image Computing and Computer Assisted Intervention Workshops, pp. 3–11. (2018)
Google Scholar
Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B.: Attention U-Net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018)
Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nature Methods 18(2), 203–211 (2021)
Article Google Scholar
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: TransUNet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021)
Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin-UNet: UNet-like pure Transformer for medical image segmentation. In: European Conference on Computer Vision, pp. 205–218. (2022)
Google Scholar
Wang, H., Cao, P., Wang, J., Zaiane, O.R.: UCTransNet: Rethinking the skip connections in U-Net from a channel-wise perspective with Transformer. In: AAAI Conference on Artificial Intelligence, pp. 2441–2449. (2022)
Google Scholar
Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. In: Machine Learning for Healthcare Conference, pp. 2–25. (2022)
Google Scholar
Tomar, N.K., Jha, D., Bagci, U., Ali, S.: TGANet: Text-guided attention for improved polyp segmentation. In: International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 151–160. (2022)
Google Scholar
Huang, S., Shen, L., Lungren, M.P., Yeung, S.: GLoRIA: A multimodal global-local representation learning framework for label-efficient medical image recognition. In: IEEE/CVF International Conference on Computer Vision, pp. 3942–3951. (2021)
Google Scholar
Kim, W., Son, B., Kim, I.: ViLT: Vision-and-language Transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. (2021)
Google Scholar
Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: LAVT: Language-aware vision Transformer for referring image segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18155–18165. (2022)
Google Scholar

Download references

Acknowledgments

This work is supported in part by the National Key Research and Development Program of China (2022ZD0160604), in part by the Natural Science Foundation of China (62101393/62176194), in part by the High-Performance Computing Platform of YZBSTCACC, and in part by MindSpore (https://www.mindspore.cn), a new deep learning framework.

Author information

Authors and Affiliations

Wuhan University of Technology, Wuhan, China
Yaxiong Chen, Minghong Wei & Shengwu Xiong
Shanghai Artificial Intelligence Laboratory, Shanghai, China
Yaxiong Chen & Shengwu Xiong
MedAI Technology (Wuxi) Co. Ltd., Wuxi, China
Zixuan Zheng, Jingliang Hu, Yilei Shi & Lichao Mou
Technical University of Munich, Munich, Germany
Xiao Xiang Zhu

Authors

Yaxiong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Minghong Wei
View author publications
You can also search for this author in PubMed Google Scholar
Zixuan Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Jingliang Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yilei Shi
View author publications
You can also search for this author in PubMed Google Scholar
Shengwu Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Xiang Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Lichao Mou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lichao Mou .

Editor information

Editors and Affiliations

Children’s National Hospital/George Washington University, Washington, DC, USA
Marius George Linguraru
The Chinese University of Hong Kong, Hong Kong, China
Qi Dou
Technical University of Denmark, Kgs Lyngby, Denmark
Aasa Feragen
Imperial College London, London, UK
Stamatia Giannarou
Imperial College London, London, UK
Ben Glocker
Universitat de Barcelona, Barcelona, Spain
Karim Lekadir
Helmholtz Munich, Technical University of Munich and King’s College London, Munich, Germany
Julia A. Schnabel

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this paper.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Y. et al. (2024). CausalCLIPSeg: Unlocking CLIP’s Potential in Referring Medical Image Segmentation with Causal Intervention. In: Linguraru, M.G., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. MICCAI 2024. Lecture Notes in Computer Science, vol 15003. Springer, Cham. https://doi.org/10.1007/978-3-031-72384-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-72384-1_8
Published: 03 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72383-4
Online ISBN: 978-3-031-72384-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

CausalCLIPSeg: Unlocking CLIP’s Potential in Referring Medical Image Segmentation with Causal Intervention