Skip to main content

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15105))

Included in the following conference series:

  • 660 Accesses

Abstract

Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. With a comparative analysis of statistical properties in the residual connection and the attention output across different pretrained models, we discover that CLIP’s image-text contrastive training paradigm emphasizes global features at the expense of local discriminability, leading to noisy segmentation results. In response, we propose ClearCLIP, a novel approach that decomposes CLIP’s representations to enhance open-vocabulary semantic segmentation. We introduce three simple modifications to the final layer: removing the residual connection, implementing the self-self attention, and discarding the feed-forward network. ClearCLIP consistently generates clearer and more accurate segmentation maps and outperforms existing approaches across multiple benchmarks, affirming the significance of our discoveries.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    SCLIP [40] could be roughly regarded as a special case of \(\alpha =2\), i.e., \(Proj ((Attn _{qq}+Attn _{kk})\cdot v)\approx Proj (2Attn _{qk}\cdot v)\approx 2X_{attn }\).

  2. 2.

    The final projection layer is omitted here for brevity.

References

  1. Alayrac, J.B., et al.: Self-supervised multimodal versatile networks. Adv. Neural. Inf. Process. Syst. 33, 25–37 (2020)

    Google Scholar 

  2. Antol, S., et al.: VQA: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)

    Google Scholar 

  3. Bousselham, W., Petersen, F., Ferrari, V., Kuehne, H.: Grounding everything: emerging localization properties in vision-language transformers. arXiv preprint arXiv:2312.00878 (2023)

  4. Caesar, H., Uijlings, J., Ferrari, V.: COCO-stuff: thing and stuff classes in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1209–1218 (2018)

    Google Scholar 

  5. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)

    Google Scholar 

  6. Cha, J., Mun, J., Roh, B.: Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11165–11174 (2023)

    Google Scholar 

  7. Chen, X., et al.: PaLI: a jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2022)

  8. Cherti, M., et al.: Reproducible scaling laws for contrastive language-image learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2829 (2023)

    Google Scholar 

  9. Cho, J., Lei, J., Tan, H., Bansal, M.: Unifying vision-and-language tasks via text generation. In: International Conference on Machine Learning, pp. 1931–1942. PMLR (2021)

    Google Scholar 

  10. Contributors, M.: MMSegmentation: OpenMMLab semantic segmentation toolbox and benchmark (2020)

    Google Scholar 

  11. Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)

    Google Scholar 

  12. Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need registers. arXiv preprint arXiv:2309.16588 (2023)

  13. Everingham, M., Winn, J.: The pascal visual object classes challenge 2012 (VOC2012) development kit. Pattern Anal. Stat. Model. Comput. Learn., Tech. Rep. 2007(1-45), 5 (2012)

    Google Scholar 

  14. Gandelsman, Y., Efros, A.A., Steinhardt, J.: Interpreting CLIP’s image representation via text-based decomposition. arXiv preprint arXiv:2310.05916 (2023)

  15. Gray, R.M.: Entropy and Information Theory. Springer, Heidelberg (2011). https://doi.org/10.1007/978-1-4419-7970-4

    Book  Google Scholar 

  16. Hamilton, M., Zhang, Z., Hariharan, B., Snavely, N., Freeman, W.T.: Unsupervised semantic segmentation by distilling feature correspondences. arXiv preprint arXiv:2203.08414 (2022)

  17. Han, C., Zhong, Y., Li, D., Han, K., Ma, L.: Open-vocabulary semantic segmentation with decoupled one-pass network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1086–1096 (2023)

    Google Scholar 

  18. He, W., Jamonnak, S., Gou, L., Ren, L.: CLIP-S4: language-guided self-supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11207–11216 (2023)

    Google Scholar 

  19. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)

    Google Scholar 

  20. Jiao, S., Wei, Y., Wang, Y., Zhao, Y., Shi, H.: Learning mask-aware CLIP representations for zero-shot segmentation. arXiv preprint arXiv:2310.00240 (2023)

  21. Khan, A.U., Kuehne, H., Gan, C., Lobo, N.D.V., Shah, M.: Weakly supervised grounding for VQA in vision-language transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 652–670. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_38

    Chapter  Google Scholar 

  22. Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021)

    Google Scholar 

  23. Lan, M., Wang, X., Ke, Y., Xu, J., Feng, L., Zhang, W.: SmooSeg: smoothness prior for unsupervised semantic segmentation. Adv. Neural Inf. Process. Syst. 36 (2024)

    Google Scholar 

  24. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)

    Google Scholar 

  25. Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. Adv. Neural. Inf. Process. Syst. 34, 9694–9705 (2021)

    Google Scholar 

  26. Li, Y., Wang, H., Duan, Y., Li, X.: CLIP surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653 (2023)

  27. Li, Y., Li, Z., Zeng, Q., Hou, Q., Cheng, M.M.: Cascade-CLIP: cascaded vision-language embeddings alignment for zero-shot semantic segmentation. arXiv preprint arXiv:2406.00670 (2024)

  28. Li, Z., Zhou, Q., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Open-vocabulary object segmentation with diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7667–7676 (2023)

    Google Scholar 

  29. Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7061–7070 (2023)

    Google Scholar 

  30. Luo, H., Bao, J., Wu, Y., He, X., Li, T.: SegCLIP: patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: International Conference on Machine Learning, pp. 23033–23044. PMLR (2023)

    Google Scholar 

  31. Melas-Kyriazi, L., Rupprecht, C., Laina, I., Vedaldi, A.: Deep spectral methods: a surprisingly strong baseline for unsupervised semantic segmentation and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8364–8375 (2022)

    Google Scholar 

  32. Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9879–9889 (2020)

    Google Scholar 

  33. Mishra, A., Alahari, K., Jawahar, C.: Image retrieval using textual cues. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3040–3047 (2013)

    Google Scholar 

  34. Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 891–898 (2014)

    Google Scholar 

  35. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  36. Ren, P., et al.: ViewCo: discovering text-supervised segmentation masks via multi-view semantic consistency. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=2XLRBjY46O6

  37. Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022)

    Google Scholar 

  38. Shin, G., Xie, W., Albanie, S.: ReCo: retrieve and co-segment for zero-shot transfer. Adv. Neural. Inf. Process. Syst. 35, 33754–33767 (2022)

    Google Scholar 

  39. Sun, S., Li, R., Torr, P., Gu, X., Li, S.: CLIP as RNN: segment countless visual concepts without training endeavor. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13171–13182 (2024)

    Google Scholar 

  40. Wang, F., Mei, J., Yuille, A.: SCLIP: rethinking self-attention for dense vision-language inference. arXiv preprint arXiv:2312.01597 (2023)

  41. Wu, S., et al.: CLIPSelf: vision transformer distills itself for open-vocabulary dense prediction. arXiv preprint arXiv:2310.01403 (2023)

  42. Xing, Y., Kang, J., Xiao, A., Nie, J., Shao, L., Lu, S.: Rewrite caption semantics: bridging semantic gaps for language-supervised semantic segmentation. In: Thirty-seventh Conference on Neural Information Processing Systems (2023). https://openreview.net/forum?id=9iafshF7s3

  43. Xu, H., et al.: Demystifying clip data. arXiv preprint arXiv:2309.16671 (2023)

  44. Xu, J., et al.: GroupViT: semantic segmentation emerges from text supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18134–18144 (2022)

    Google Scholar 

  45. Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2955–2966 (2023)

    Google Scholar 

  46. Xu, J., et al.: Learning open-vocabulary semantic segmentation models from natural language supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2935–2944 (2023)

    Google Scholar 

  47. Xu, M., Zhang, Z., Wei, F., Hu, H., Bai, X.: Side adapter network for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2945–2954 (2023)

    Google Scholar 

  48. Xu, M., et al.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13689, pp. 736–753. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_42

    Chapter  Google Scholar 

  49. Xu, X., Xiong, T., Ding, Z., Tu, Z.: MasQCLIP for open-vocabulary universal image segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 887–898 (2023)

    Google Scholar 

  50. Yao, L., et al.: FILIP: fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783 (2021)

  51. Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: CoCa: contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)

  52. Yu, Q., He, J., Deng, X., Shen, X., Chen, L.C.: Convolutions die hard: open-vocabulary segmentation with single frozen convolutional CLIP. arXiv preprint arXiv:2308.02487 (2023)

  53. Yuan, L., et al.: Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)

  54. Zhang, F., et al.: Uncovering prototypical knowledge for weakly open-vocabulary semantic segmentation. arXiv preprint arXiv:2310.19001 (2023)

  55. Zhou, B., et al.: Semantic understanding of scenes through the ADE20K dataset. Int. J. Comput. Vision 127, 302–321 (2019)

    Article  Google Scholar 

  56. Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from CLIP. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 696–712. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_40

    Chapter  Google Scholar 

Download references

Acknowledgments.

This study is supported under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Litong Feng .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 8797 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lan, M., Chen, C., Ke, Y., Wang, X., Feng, L., Zhang, W. (2025). ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15105. Springer, Cham. https://doi.org/10.1007/978-3-031-72970-6_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72970-6_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72969-0

  • Online ISBN: 978-3-031-72970-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics