Abstract
Referring medical image segmentation targets delineating lesions indicated by textual descriptions. Aligning visual and textual cues is challenging due to their distinct data properties. Inspired by large-scale pre-trained vision-language models, we propose CausalCLIPSeg, an end-to-end framework for referring medical image segmentation that leverages CLIP. Despite not being trained on medical data, we enforce CLIP’s rich semantic space onto the medical domain by a tailored cross-modal decoding method to achieve text-to-pixel alignment. Furthermore, to mitigate confounding bias that may cause the model to learn spurious correlations instead of meaningful causal relationships, CausalCLIPSeg introduces a causal intervention module which self-annotates confounders and excavates causal features from inputs for segmentation judgments. We also devise an adversarial min-max game to optimize causal features while penalizing confounding ones. Extensive experiments demonstrate the state-of-the-art performance of our proposed method. Code is available at https://github.com/WUTCM-Lab/CausalCLIPSeg.
Work done during an internship at MedAI Technology (Wuxi) Co. Ltd.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Zhou, Y., Xie, L., Shen, W., Wang, Y., Fishman, E.K., Yuille, A.L.: A fixed-point model for pancreas segmentation in abdominal CT scans. In: International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 693–701. (2017)
Gering, D.T., Nabavi, A., Kikinis, R., Grimson, W.E.L., Hata, N., Everett, P., Jolesz, F., Wells, W.M.: An integrated visualization system for surgical planning and guidance using image fusion and interventional imaging. In: International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 809–819. (1999)
Li, Z., Li, Y., Li, Q., Wang, P., Guo, D., Lu, L., Jin, D., Zhang, Y., Hong, Q.: LViT: Language meets vision Transformer in medical image segmentation. IEEE Transactions on Medical Imaging 43(1), 96–107 (2024)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. (2021)
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.: Segment anything. In: IEEE/CVF International Conference on Computer Vision, pp. 4015–4026. (2023)
Ali, M., Khan, S.: CLIP-Decoder: Zeroshot multilabel classification using multimodal CLIP aligned representations. In: IEEE/CVF International Conference on Computer Vision, pp. 4675–4679. (2023)
Conde, M.V., Turgutlu, K.: CLIP-Art: Contrastive pre-training for fine-grained art classification. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3956–3960. (2021)
Wang, M., Xing, J., Liu, Y.: ActionCLIP: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021)
Zhu, J., Jin, J., Yang, Z., Wu, X., Wang, X.: Learning CLIP guided visual-text fusion Transformer for video-based pedestrian attribute recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2625–2628. (2023)
Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., Li, T.: CLIP4CLIP: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508, 293–304 (2022)
Fang, H., Xiong, P., Xu, L., Chen, Y.: CLIP2Video: Mastering video-text retrieval via image CLIP. arXiv preprint arXiv:2106.11097 (2021)
Xie, Y., Liao, H., Zhang, D., Chen, F.: Uncertainty-aware cascade network for ultrasound image segmentation with ambiguous boundary. In: International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 268–278. (2022)
Wang, J., Wei, L., Wang, L., Zhou, Q., Zhu, L., Qin, J.: Boundary-aware Transformers for skin lesion segmentation. In: International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 206–216. (2021)
Wang, J., Yang, J., Zhou, Q., Wang, L.: Medical boundary diffusion model for skin lesion segmentation. In: International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 427–436. (2023)
Pearl, J.: Causality. Cambridge University Press (2009)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Wang, J., Chen, K., Xu, R., Liu, Z., Loy, C.C., Lin, D.: CARAFE: Content-aware reassembly of features. In: IEEE/CVF International Conference on Computer Vision, pp. 3007–3016. (2019)
Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision 111, 98–136 (2015)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 234–241. (2015)
Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: UNet++: A nested U-Net architecture for medical image segmentation. In: International Conference on Medical Image Computing and Computer Assisted Intervention Workshops, pp. 3–11. (2018)
Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B.: Attention U-Net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018)
Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nature Methods 18(2), 203–211 (2021)
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: TransUNet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021)
Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin-UNet: UNet-like pure Transformer for medical image segmentation. In: European Conference on Computer Vision, pp. 205–218. (2022)
Wang, H., Cao, P., Wang, J., Zaiane, O.R.: UCTransNet: Rethinking the skip connections in U-Net from a channel-wise perspective with Transformer. In: AAAI Conference on Artificial Intelligence, pp. 2441–2449. (2022)
Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. In: Machine Learning for Healthcare Conference, pp. 2–25. (2022)
Tomar, N.K., Jha, D., Bagci, U., Ali, S.: TGANet: Text-guided attention for improved polyp segmentation. In: International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 151–160. (2022)
Huang, S., Shen, L., Lungren, M.P., Yeung, S.: GLoRIA: A multimodal global-local representation learning framework for label-efficient medical image recognition. In: IEEE/CVF International Conference on Computer Vision, pp. 3942–3951. (2021)
Kim, W., Son, B., Kim, I.: ViLT: Vision-and-language Transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. (2021)
Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: LAVT: Language-aware vision Transformer for referring image segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18155–18165. (2022)
Acknowledgments
This work is supported in part by the National Key Research and Development Program of China (2022ZD0160604), in part by the Natural Science Foundation of China (62101393/62176194), in part by the High-Performance Computing Platform of YZBSTCACC, and in part by MindSpore (https://www.mindspore.cn), a new deep learning framework.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this paper.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, Y. et al. (2024). CausalCLIPSeg: Unlocking CLIP’s Potential in Referring Medical Image Segmentation with Causal Intervention. In: Linguraru, M.G., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. MICCAI 2024. Lecture Notes in Computer Science, vol 15003. Springer, Cham. https://doi.org/10.1007/978-3-031-72384-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-72384-1_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72383-4
Online ISBN: 978-3-031-72384-1
eBook Packages: Computer ScienceComputer Science (R0)