ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

Lan, Mengcheng; Chen, Chaofeng; Ke, Yiping; Wang, Xinjiang; Feng, Litong; Zhang, Wayne

doi:10.1007/978-3-031-72970-6_9

Mengcheng Lan¹³,
Chaofeng Chen¹³,
Yiping Ke¹⁴,
Xinjiang Wang¹⁵,
Litong Feng¹⁵ &
…
Wayne Zhang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15105))

Included in the following conference series:

European Conference on Computer Vision

660 Accesses

Abstract

Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. With a comparative analysis of statistical properties in the residual connection and the attention output across different pretrained models, we discover that CLIP’s image-text contrastive training paradigm emphasizes global features at the expense of local discriminability, leading to noisy segmentation results. In response, we propose ClearCLIP, a novel approach that decomposes CLIP’s representations to enhance open-vocabulary semantic segmentation. We introduce three simple modifications to the final layer: removing the residual connection, implementing the self-self attention, and discarding the feed-forward network. ClearCLIP consistently generates clearer and more accurate segmentation maps and outperforms existing approaches across multiple benchmarks, affirming the significance of our discoveries.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

VLSM-Adapter: Finetuning Vision-Language Segmentation Efficiently with Lightweight Blocks

Notes

1.
SCLIP [40] could be roughly regarded as a special case of $\alpha =2$, i.e., $Proj ((Attn _{qq}+Attn _{kk})\cdot v)\approx Proj (2Attn _{qk}\cdot v)\approx 2X_{attn }$.
2.
The final projection layer is omitted here for brevity.

References

Alayrac, J.B., et al.: Self-supervised multimodal versatile networks. Adv. Neural. Inf. Process. Syst. 33, 25–37 (2020)
Google Scholar
Antol, S., et al.: VQA: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Bousselham, W., Petersen, F., Ferrari, V., Kuehne, H.: Grounding everything: emerging localization properties in vision-language transformers. arXiv preprint arXiv:2312.00878 (2023)
Caesar, H., Uijlings, J., Ferrari, V.: COCO-stuff: thing and stuff classes in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1209–1218 (2018)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Google Scholar
Cha, J., Mun, J., Roh, B.: Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11165–11174 (2023)
Google Scholar
Chen, X., et al.: PaLI: a jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2022)
Cherti, M., et al.: Reproducible scaling laws for contrastive language-image learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2829 (2023)
Google Scholar
Cho, J., Lei, J., Tan, H., Bansal, M.: Unifying vision-and-language tasks via text generation. In: International Conference on Machine Learning, pp. 1931–1942. PMLR (2021)
Google Scholar
Contributors, M.: MMSegmentation: OpenMMLab semantic segmentation toolbox and benchmark (2020)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)
Google Scholar
Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need registers. arXiv preprint arXiv:2309.16588 (2023)
Everingham, M., Winn, J.: The pascal visual object classes challenge 2012 (VOC2012) development kit. Pattern Anal. Stat. Model. Comput. Learn., Tech. Rep. 2007(1-45), 5 (2012)
Google Scholar
Gandelsman, Y., Efros, A.A., Steinhardt, J.: Interpreting CLIP’s image representation via text-based decomposition. arXiv preprint arXiv:2310.05916 (2023)
Gray, R.M.: Entropy and Information Theory. Springer, Heidelberg (2011). https://doi.org/10.1007/978-1-4419-7970-4
Book Google Scholar
Hamilton, M., Zhang, Z., Hariharan, B., Snavely, N., Freeman, W.T.: Unsupervised semantic segmentation by distilling feature correspondences. arXiv preprint arXiv:2203.08414 (2022)
Han, C., Zhong, Y., Li, D., Han, K., Ma, L.: Open-vocabulary semantic segmentation with decoupled one-pass network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1086–1096 (2023)
Google Scholar
He, W., Jamonnak, S., Gou, L., Ren, L.: CLIP-S4: language-guided self-supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11207–11216 (2023)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
Google Scholar
Jiao, S., Wei, Y., Wang, Y., Zhao, Y., Shi, H.: Learning mask-aware CLIP representations for zero-shot segmentation. arXiv preprint arXiv:2310.00240 (2023)
Khan, A.U., Kuehne, H., Gan, C., Lobo, N.D.V., Shah, M.: Weakly supervised grounding for VQA in vision-language transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 652–670. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_38
Chapter Google Scholar
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021)
Google Scholar
Lan, M., Wang, X., Ke, Y., Xu, J., Feng, L., Zhang, W.: SmooSeg: smoothness prior for unsupervised semantic segmentation. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Google Scholar
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. Adv. Neural. Inf. Process. Syst. 34, 9694–9705 (2021)
Google Scholar
Li, Y., Wang, H., Duan, Y., Li, X.: CLIP surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653 (2023)
Li, Y., Li, Z., Zeng, Q., Hou, Q., Cheng, M.M.: Cascade-CLIP: cascaded vision-language embeddings alignment for zero-shot semantic segmentation. arXiv preprint arXiv:2406.00670 (2024)
Li, Z., Zhou, Q., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Open-vocabulary object segmentation with diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7667–7676 (2023)
Google Scholar
Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7061–7070 (2023)
Google Scholar
Luo, H., Bao, J., Wu, Y., He, X., Li, T.: SegCLIP: patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: International Conference on Machine Learning, pp. 23033–23044. PMLR (2023)
Google Scholar
Melas-Kyriazi, L., Rupprecht, C., Laina, I., Vedaldi, A.: Deep spectral methods: a surprisingly strong baseline for unsupervised semantic segmentation and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8364–8375 (2022)
Google Scholar
Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9879–9889 (2020)
Google Scholar
Mishra, A., Alahari, K., Jawahar, C.: Image retrieval using textual cues. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3040–3047 (2013)
Google Scholar
Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 891–898 (2014)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ren, P., et al.: ViewCo: discovering text-supervised segmentation masks via multi-view semantic consistency. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=2XLRBjY46O6
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022)
Google Scholar
Shin, G., Xie, W., Albanie, S.: ReCo: retrieve and co-segment for zero-shot transfer. Adv. Neural. Inf. Process. Syst. 35, 33754–33767 (2022)
Google Scholar
Sun, S., Li, R., Torr, P., Gu, X., Li, S.: CLIP as RNN: segment countless visual concepts without training endeavor. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13171–13182 (2024)
Google Scholar
Wang, F., Mei, J., Yuille, A.: SCLIP: rethinking self-attention for dense vision-language inference. arXiv preprint arXiv:2312.01597 (2023)
Wu, S., et al.: CLIPSelf: vision transformer distills itself for open-vocabulary dense prediction. arXiv preprint arXiv:2310.01403 (2023)
Xing, Y., Kang, J., Xiao, A., Nie, J., Shao, L., Lu, S.: Rewrite caption semantics: bridging semantic gaps for language-supervised semantic segmentation. In: Thirty-seventh Conference on Neural Information Processing Systems (2023). https://openreview.net/forum?id=9iafshF7s3
Xu, H., et al.: Demystifying clip data. arXiv preprint arXiv:2309.16671 (2023)
Xu, J., et al.: GroupViT: semantic segmentation emerges from text supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18134–18144 (2022)
Google Scholar
Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2955–2966 (2023)
Google Scholar
Xu, J., et al.: Learning open-vocabulary semantic segmentation models from natural language supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2935–2944 (2023)
Google Scholar
Xu, M., Zhang, Z., Wei, F., Hu, H., Bai, X.: Side adapter network for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2945–2954 (2023)
Google Scholar
Xu, M., et al.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13689, pp. 736–753. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_42
Chapter Google Scholar
Xu, X., Xiong, T., Ding, Z., Tu, Z.: MasQCLIP for open-vocabulary universal image segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 887–898 (2023)
Google Scholar
Yao, L., et al.: FILIP: fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783 (2021)
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: CoCa: contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)
Yu, Q., He, J., Deng, X., Shen, X., Chen, L.C.: Convolutions die hard: open-vocabulary segmentation with single frozen convolutional CLIP. arXiv preprint arXiv:2308.02487 (2023)
Yuan, L., et al.: Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)
Zhang, F., et al.: Uncovering prototypical knowledge for weakly open-vocabulary semantic segmentation. arXiv preprint arXiv:2310.19001 (2023)
Zhou, B., et al.: Semantic understanding of scenes through the ADE20K dataset. Int. J. Comput. Vision 127, 302–321 (2019)
Article Google Scholar
Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from CLIP. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 696–712. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_40
Chapter Google Scholar

Download references

Acknowledgments.

This study is supported under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).

Author information

Authors and Affiliations

S-Lab, Nanyang Technological University, Singapore, Singapore
Mengcheng Lan & Chaofeng Chen
CCDS, Nanyang Technological University, Singapore, Singapore
Yiping Ke
SenseTime Research, Shenzhen, China
Xinjiang Wang, Litong Feng & Wayne Zhang

Authors

Mengcheng Lan
View author publications
You can also search for this author in PubMed Google Scholar
Chaofeng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yiping Ke
View author publications
You can also search for this author in PubMed Google Scholar
Xinjiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Litong Feng
View author publications
You can also search for this author in PubMed Google Scholar
Wayne Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Litong Feng .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 8797 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lan, M., Chen, C., Ke, Y., Wang, X., Feng, L., Zhang, W. (2025). ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15105. Springer, Cham. https://doi.org/10.1007/978-3-031-72970-6_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-72970-6_9
Published: 23 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72969-0
Online ISBN: 978-3-031-72970-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference