From One to Many Lorikeets: Discovering Image Analogies in the CLIP Space

Xing, Songlong; Peruzzo, Elia; Sangineto, Enver; Sebe, Nicu

doi:10.1007/978-3-031-78189-6_25

Songlong Xing¹³,
Elia Peruzzo¹³,
Enver Sangineto¹⁴ &
…
Nicu Sebe¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15309))

Included in the following conference series:

International Conference on Pattern Recognition

203 Accesses

Abstract

Drawing analogies between two pairs of entities in the form of A:B::C:D (i.e. A is to B as C is to D) is a hallmark of human intelligence, as evidenced by sufficient findings in cognitive science for the last decades. In recent years, this property has been found far beyond cognitive science. Notable examples are word2vec and GloVe models in natural language processing. Recent research in computer vision also found the property of analogies in the feature space of a pretrained ConvNet feature extractor. However, analogy mining in the semantic space of recent strong foundation models such as CLIP is still understudied, despite the fact that they have been successfully applied to a wide range of downstream tasks. In this work, we show that CLIP possesses the similar ability of analogical reasoning in the latent space, and propose a novel strategy to extract analogies between pairs of images in the CLIP space. We compute all the difference vectors of a pair of any two images that belong to the same class in the CLIP space, and employ k-means clustering to group the difference vectors into clusters irrespective of their classes. This procedure results in cluster centroids representative of class-agnostic semantic analogies between images. Through extensive analysis, we show that the property of drawing analogies between images also exists in the CLIP space, which are interpretable by humans through a visualisation of the learned clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Discovering Respects for Visual Similarity

Finding Spurious Correlations with Function-Semantic Contrast Analysis

Not Just a Matter of Semantics: The Relationship Between Visual and Semantic Similarity

Notes

1.
The code is available at https://github.com/Sxing2/CLIP-Analogy.

References

Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Allen, C., Hospedales, T.: Analogies explained: towards understanding word embeddings. In: International Conference on Machine Learning, pp. 223–231. PMLR (2019)
Google Scholar
Arora, S., Li, Y., Liang, Y., Ma, T., Risteski, A.: A latent variable model approach to PMI-based word embeddings. Trans. Assoc. Comput. Linguist. 4, 385–399 (2016)
Article Google Scholar
Azuma, H., Matsui, Y.: Defense-prefix for preventing typographic attacks on clip. arXiv preprint arXiv:2304.04512 (2023)
Bar, A., Gandelsman, Y., Darrell, T., Globerson, A., Efros, A.: Visual prompting via image inpainting. Adv. Neural. Inf. Process. Syst. 35, 25005–25017 (2022)
Google Scholar
Bitton, Y., Yosef, R., Strugo, E., Shahaf, D., Schwartz, R., Stanovsky, G.: VASR: visual analogies of situation recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 241–249 (2023)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020)
Google Scholar
Chen, D., Peterson, J.C., Griffiths, T.L.: Evaluating vector-space models of analogy. CoRR arXiv:abs/1705.04416 (2017)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Dosovitskiy, A., et al.: An image is worth 16$\times $16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Dunlap, L., et al.: Describing differences in image sets with natural language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24199–24208 (2024)
Google Scholar
Ethayarajh, K., Duvenaud, D., Hirst, G.: Towards understanding linear word analogies. In: Korhonen, A., Traum, D., Màrquez, L. (eds.) Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3253–3262. Association for Computational Linguistics, Florence, Italy (2019)
Google Scholar
Gandelsman, Y., Efros, A.A., Steinhardt, J.: Interpreting CLIP’s image representation via text-based decomposition. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=5Ca9sSzuDp
Gentner, D.: Structure-mapping: a theoretical framework for analogy. Cogn. Sci. 7(2), 155–170 (1983)
Google Scholar
Gittens, A., Achlioptas, D., Mahoney, M.W.: Skip-Gram – Zipf + uniform = vector additivity. In: Barzilay, R., Kan, M.Y. (eds.) Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 69–76. Association for Computational Linguistics, Vancouver, Canada (2017)
Google Scholar
Goh, G., et al.: Multimodal neurons in artificial neural networks. Distill 6(3), e30 (2021)
Article MathSciNet Google Scholar
Hariharan, B., Girshick, R.: Low-shot visual recognition by shrinking and hallucinating features. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3018–3027 (2017)
Google Scholar
Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 327–340. SIGGRAPH 2001, Association for Computing Machinery, New York, NY, USA (2001)
Google Scholar
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23(1), 2249–2281 (2022)
MathSciNet Google Scholar
Holyoak, K.J.: Analogy and Relational Reasoning. In: The Oxford Handbook of Thinking and Reasoning, pp. 234–259 (2012)
Google Scholar
Hummel, J.E., Doumas, L.A.A.: Analogy and Similarity, p. 451–473. Cambridge Handbooks in Psychology, Cambridge University Press, 2nd edn. (2023). https://doi.org/10.1017/9781108755610.018
Ikotun, A.M., Ezugwu, A.E., Abualigah, L., Abuhaija, B., Heming, J.: K-means clustering algorithms: a comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. 622, 178–210 (2023)
Article Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
Google Scholar
Lemesle, Y., Sawayama, M., Valle-Perez, G., Adolphe, M., Sauzéon, H., Oudeyer, P.Y.: Language-biased image classification: evaluation based on semantic representations. In: International Conference on Learning Representations (ICLR) (2022)
Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International conference on machine learning, pp. 19730–19742. PMLR (2023)
Google Scholar
Materzyńska, J., Torralba, A., Bau, D.: Disentangling visual and written concepts in clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16410–16419 (2022)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet Google Scholar
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha, Qatar (2014)
Google Scholar
Peterson, J.C., Chen, D., Griffiths, T.L.: Parallelograms revisited: exploring the limitations of vector space models for simple analogies. Cognition 205, 104440 (2020)
Article Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: International Conference on Learning Representations (ICLR) (2016)
Google Scholar
Reed, S.E., Zhang, Y., Zhang, Y., Lee, H.: Deep visual analogy-making. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015)
Google Scholar
Richland, L.E., Simms, N.: Analogy, higher order thinking, and education. WIREs Cognit. Sci. 6(2), 177–192 (2015)
Article Google Scholar
Rumelhart, D.E., Abrahamson, A.A.: A model for analogical reasoning. Cogn. Psychol. 5(1), 1–28 (1973)
Article Google Scholar
Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178 (2010)
Google Scholar
Šubrtová, A., Lukáč, M., Čech, J., Futschik, D., Shechtman, E., Sỳkora, D.: Diffusion image analogies. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–10 (2023)
Google Scholar
Ushio, A., Espinosa Anke, L., Schockaert, S., Camacho-Collados, J.: BERT is to NLP what AlexNet is to CV: can pre-trained language models identify analogies? In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.) Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3609–3624. Association for Computational Linguistics, Online (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Trento, Trento, Italy
Songlong Xing, Elia Peruzzo & Nicu Sebe
University of Modena and Reggio Emilia, Modena, Italy
Enver Sangineto

Authors

Songlong Xing
View author publications
You can also search for this author in PubMed Google Scholar
Elia Peruzzo
View author publications
You can also search for this author in PubMed Google Scholar
Enver Sangineto
View author publications
You can also search for this author in PubMed Google Scholar
Nicu Sebe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Songlong Xing .

Editor information

Editors and Affiliations

University of Salford, Salford, Lancashire, UK
Apostolos Antonacopoulos
IIT Bombay, Powai, Mumbai, Maharashtra, India
Subhasis Chaudhuri
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
IIT Kharagpur, Kharagpur, West Bengal, India
Saumik Bhattacharya
Indian Statistical Institute, Kolkata, West Bengal, India
Umapada Pal

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6903 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xing, S., Peruzzo, E., Sangineto, E., Sebe, N. (2025). From One to Many Lorikeets: Discovering Image Analogies in the CLIP Space. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15309. Springer, Cham. https://doi.org/10.1007/978-3-031-78189-6_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-78189-6_25
Published: 11 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78188-9
Online ISBN: 978-3-031-78189-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

From One to Many Lorikeets: Discovering Image Analogies in the CLIP Space