On the Effectiveness of ViT Features as Local Semantic Descriptors

Amir, Shir; Gandelsman, Yossi; Bagon, Shai; Dekel, Tali

doi:10.1007/978-3-031-25069-9_3

Shir Amir¹⁰,
Yossi Gandelsman¹¹,
Shai Bagon¹⁰ &
…
Tali Dekel¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13804))

Included in the following conference series:

European Conference on Computer Vision

1963 Accesses
5 Citations

Abstract

We study the use of deep features extracted from a pre-trained Vision Transformer (ViT) as dense visual descriptors. We observe and empirically demonstrate that such features, when extracted from a self-supervised ViT model (DINO-ViT), exhibit several striking properties, including: (i) the features encode powerful, well-localized semantic information, at high spatial granularity, such as object parts; (ii) the encoded semantic information is shared across related, yet different object categories, and (iii) positional bias changes gradually throughout the layers. These properties allow us to design simple methods for a variety of applications, including co-segmentation, part co-segmentation and semantic correspondences. To distill the power of ViT features from convoluted design choices, we restrict ourselves to lightweight zero-shot methodologies (e.g., binning and clustering) applied directly to the features. Since our methods require no additional training nor data, they are readily applicable across a variety of domains. We show by extensive qualitative and quantitative evaluation that our simple methodologies achieve competitive results with recent state-of-the-art supervised methods, and outperform previous unsupervised methods by a large margin. Code is available in https://dino-vit-features.github.io/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors

Semantic-Aware Fine-Grained Correspondence

Visual descriptors for scene categorization: experimental evaluation

Article Open access 17 November 2015

References

Aberman, K., Liao, J., Shi, M., Lischinski, D., Chen, B., Cohen-Or, D.: Neural best-buddies: sparse cross-domain correspondence. In: TOG (2018)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
Google Scholar
Carter, S., Armstrong, Z., Schubert, L., Johnson, I., Olah, C.: Activation atlas. Distill (2019)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)
Google Scholar
Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.: Detect what you can: detecting and representing objects using holistic models and body parts. In: CVPR (2014)
Google Scholar
Cho, S., Hong, S., Jeon, S., Lee, Y., Sohn, K., Kim, S.: Semantic correspondence with transformers. In: NeurIPS (2021)
Google Scholar
Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: Stargan v2: diverse image synthesis for multiple domains. In: CVPR (2020)
Google Scholar
Choudhury, S., Laina, I., Rupprecht, C., Vedaldi, A.: Unsupervised part discovery from contrastive reconstruction. In: NeurIPS (2021)
Google Scholar
Collins, E., Achanta, R., Süsstrunk, S.: Deep feature factorization for concept discovery. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 352–368. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_21
Chapter Google Scholar
Cordonnier, J.B., Loukas, A., Jaggi, M.: On the relationship between self-attention and convolutional layers. In: ICLR (2019)
Google Scholar
Dekel, T., Oron, S., Rubinstein, M., Avidan, S., Freeman, W.T.: Best-buddies similarity for robust template matching. In: CVPR (2015)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. In: IJCV (2015)
Google Scholar
Faktor, A., Irani, M.: Co-segmentation by composition. In: ICCV (2013)
Google Scholar
Fowlkes, C., Belongie, S., Chung, F., Malik, J.: Spectral grouping using the nyström method. In: TPAMI (2004)
Google Scholar
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: CVPR (2016)
Google Scholar
Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A., Brendel, W.: Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In: ICLR (2019)
Google Scholar
Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
Google Scholar
Hamilton, M., Zhang, Z., Hariharan, B., Snavely, N., Freeman, W.T.: Unsupervised semantic segmentation by distilling feature correspondences. In: ICLR (2022)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hsu, K.J., Lin, Y.Y., Chuang, Y.Y.: Co-attention CNNs for unsupervised object co-segmentation. In: IJCAI (2018)
Google Scholar
Huang, Z., Li, Y.: Interpretable and accurate fine-grained recognition via region grouping. In: CVPR (2020)
Google Scholar
Hung, W.C., Jampani, V., Liu, S., Molchanov, P., Yang, M.H., Kautz, J.: SCOPS: self-supervised co-part segmentation. In: CVPR (2019)
Google Scholar
Jiang, W., Trulls, E., Hosang, J., Tagliasacchi, A., Yi, K.M.: COTR: correspondence transformer for matching across images. In: ICCV (2021)
Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with gaussian edge potentials. In: NeurIPS (2011)
Google Scholar
Hsu, K.-J., Lin, Y.-Y., Chuang, Y.-Y.: DeepCO3: deep instance co-segmentation by co-peak search and co-saliency detection. In: CVPR (2019)
Google Scholar
Li, B., Sun, Z., Li, Q., Wu, Y., Hu, A.: Group-wise deep object co-segmentation with co-attention recurrent neural network. In: ICCV (2019)
Google Scholar
Li, G., Zhang, C., Lin, G.: CycleSegNet: object co-segmentation with cycle refinement and region correspondence. In: TIP (2021)
Google Scholar
Li, W., Hosseini Jafari, O., Rother, C.: Deep object co-segmentation. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11363, pp. 638–653. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20893-6_40
Chapter Google Scholar
Liu, S., Zhang, L., Yang, X., Su, H., Zhu, J.: Unsupervised part segmentation through disentangling appearance and shape. In: CVPR (2021)
Google Scholar
Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: ICCV (2015)
Google Scholar
Luo, W., Li, Y., Urtasun, R., Zemel, R.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS (2016)
Google Scholar
Mechrez, R., Talmi, I., Zelnik-Manor, L.: The contextual loss for image transformation with non-aligned data. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 800–815. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_47
Chapter Google Scholar
Min, J., Lee, J., Ponce, J., Cho, M.: Spair-71k: a large-scale benchmark for semantic correspondence. CoRR (2019)
Google Scholar
Naseer, M., Ranasinghe, K., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Intriguing properties of vision transformers. In: NeurIPS (2021)
Google Scholar
Ng, A.: Clustering with the k-means algorithm. In: Machine Learning (2012)
Google Scholar
Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: NeurIPS (2001)
Google Scholar
Olah, C., Mordvintsev, A., Schubert, L.: Feature visualization. In: Distill (2017)
Google Scholar
Poličar, P.G., Stražar, M., Zupan, B.: openTSNE: a modular python library for t-SNE dimensionality reduction and embedding. bioRxiv (2019)
Google Scholar
Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? In: NeurIPS (2021)
Google Scholar
Rother, C., Kolmogorov, V., Blake, A.: “GrabCut”: interactive foreground extraction using iterated graph cuts. In: TOG (2004)
Google Scholar
Rubinstein, M., Joulin, A., Kopf, J., Liu, C.: Unsupervised joint object discovery and segmentation in internet images. In: CVPR (2013)
Google Scholar
Rubio, J.C., Serrat, J., López, A., Paragios, N.: Unsupervised co-segmentation through region matching. In: CVPR (2012)
Google Scholar
Shocher, A., et al.: Semantic pyramid for image generation. In: CVPR (2020)
Google Scholar
Shotton, J., Winn, J., Rother, C., Criminisi, A.: TextonBoost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 1–15. Springer, Heidelberg (2006). https://doi.org/10.1007/11744023_1
Chapter Google Scholar
Siméoni, O., et al.: Localizing objects with self-supervised transformers and no labels. In: BMVC (2021)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: LoFTR: detector-free local feature matching with transformers. In: CVPR (2021)
Google Scholar
Thewlis, J., Bilen, H., Vedaldi, A.: Unsupervised learning of object landmarks by factorized spatial embeddings. In: ICCV (2017)
Google Scholar
Vaze, S., Han, K., Vedaldi, A., Zisserman, A.: Generalized category discovery. In: ICLR (2022)
Google Scholar
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)
Google Scholar
Wang, Y., Shen, X., Hu, S.X., Yuan, Y., Crowley, J., Vaufreydaz, D.: Self-supervised transformers for unsupervised object discovery using normalized cut. In: CVPR (2022)
Google Scholar
Welinder, P., et al.: Caltech-UCSD Birds 200. Tech. Rep. CNS-TR-2010-001, California Institute of Technology (2010)
Google Scholar
Zhang, K., Chen, J., Liu, B., Liu, Q.: Deep object co-segmentation via spatial-semantic network modulation. In: AAAI (2020)
Google Scholar
Zhang, R.: Making convolutional networks shift-invariant again. In: ICML (2019)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Google Scholar
Zhang, Y., Guo, Y., Jin, Y., Luo, Y., He, Z., Lee, H.: Unsupervised discovery of object landmarks as structural representations. In: CVPR (2018)
Google Scholar

Download references

Acknowledgments

We thank Miki Rubinstein, Meirav Galun, Kfir Aberman and Niv Haim for their insightful comments and discussion. This project received funding from the Israeli Science Foundation (grant 2303/20), and the Carolito Stiftung. Dr Bagon is a Robin Chemers Neustein Artificial Intelligence Fellow.

Author information

Authors and Affiliations

Department of Computer Science and Applied Mathematics, The Weizmann Institute of Science, Rehovot, Israel
Shir Amir, Shai Bagon & Tali Dekel
Berkeley Artificial Intelligence Research (BAIR), Berkeley, USA
Yossi Gandelsman

Authors

Shir Amir
View author publications
You can also search for this author in PubMed Google Scholar
Yossi Gandelsman
View author publications
You can also search for this author in PubMed Google Scholar
Shai Bagon
View author publications
You can also search for this author in PubMed Google Scholar
Tali Dekel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shir Amir .

Editor information

Editors and Affiliations

IBM Research - MIT-IBM Watson AI Lab, Massachusetts, USA
Leonid Karlinsky
Technion – Israel Institute of Technology, Haifa, Israel
Tomer Michaeli
Kyoto University, Kyoto, Japan
Ko Nishino

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 200 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Amir, S., Gandelsman, Y., Bagon, S., Dekel, T. (2023). On the Effectiveness of ViT Features as Local Semantic Descriptors. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13804. Springer, Cham. https://doi.org/10.1007/978-3-031-25069-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-25069-9_3
Published: 14 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25068-2
Online ISBN: 978-3-031-25069-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

On the Effectiveness of ViT Features as Local Semantic Descriptors