Abstract
Drawing analogies between two pairs of entities in the form of A:B::C:D (i.e. A is to B as C is to D) is a hallmark of human intelligence, as evidenced by sufficient findings in cognitive science for the last decades. In recent years, this property has been found far beyond cognitive science. Notable examples are word2vec and GloVe models in natural language processing. Recent research in computer vision also found the property of analogies in the feature space of a pretrained ConvNet feature extractor. However, analogy mining in the semantic space of recent strong foundation models such as CLIP is still understudied, despite the fact that they have been successfully applied to a wide range of downstream tasks. In this work, we show that CLIP possesses the similar ability of analogical reasoning in the latent space, and propose a novel strategy to extract analogies between pairs of images in the CLIP space. We compute all the difference vectors of a pair of any two images that belong to the same class in the CLIP space, and employ k-means clustering to group the difference vectors into clusters irrespective of their classes. This procedure results in cluster centroids representative of class-agnostic semantic analogies between images. Through extensive analysis, we show that the property of drawing analogies between images also exists in the CLIP space, which are interpretable by humans through a visualisation of the learned clusters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The code is available at https://github.com/Sxing2/CLIP-Analogy.
References
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Allen, C., Hospedales, T.: Analogies explained: towards understanding word embeddings. In: International Conference on Machine Learning, pp. 223–231. PMLR (2019)
Arora, S., Li, Y., Liang, Y., Ma, T., Risteski, A.: A latent variable model approach to PMI-based word embeddings. Trans. Assoc. Comput. Linguist. 4, 385–399 (2016)
Azuma, H., Matsui, Y.: Defense-prefix for preventing typographic attacks on clip. arXiv preprint arXiv:2304.04512 (2023)
Bar, A., Gandelsman, Y., Darrell, T., Globerson, A., Efros, A.: Visual prompting via image inpainting. Adv. Neural. Inf. Process. Syst. 35, 25005–25017 (2022)
Bitton, Y., Yosef, R., Strugo, E., Shahaf, D., Schwartz, R., Stanovsky, G.: VASR: visual analogies of situation recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 241–249 (2023)
Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020)
Chen, D., Peterson, J.C., Griffiths, T.L.: Evaluating vector-space models of analogy. CoRR arXiv:abs/1705.04416 (2017)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Dunlap, L., et al.: Describing differences in image sets with natural language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24199–24208 (2024)
Ethayarajh, K., Duvenaud, D., Hirst, G.: Towards understanding linear word analogies. In: Korhonen, A., Traum, D., Màrquez, L. (eds.) Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3253–3262. Association for Computational Linguistics, Florence, Italy (2019)
Gandelsman, Y., Efros, A.A., Steinhardt, J.: Interpreting CLIP’s image representation via text-based decomposition. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=5Ca9sSzuDp
Gentner, D.: Structure-mapping: a theoretical framework for analogy. Cogn. Sci. 7(2), 155–170 (1983)
Gittens, A., Achlioptas, D., Mahoney, M.W.: Skip-Gram – Zipf + uniform = vector additivity. In: Barzilay, R., Kan, M.Y. (eds.) Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 69–76. Association for Computational Linguistics, Vancouver, Canada (2017)
Goh, G., et al.: Multimodal neurons in artificial neural networks. Distill 6(3), e30 (2021)
Hariharan, B., Girshick, R.: Low-shot visual recognition by shrinking and hallucinating features. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3018–3027 (2017)
Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 327–340. SIGGRAPH 2001, Association for Computing Machinery, New York, NY, USA (2001)
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23(1), 2249–2281 (2022)
Holyoak, K.J.: Analogy and Relational Reasoning. In: The Oxford Handbook of Thinking and Reasoning, pp. 234–259 (2012)
Hummel, J.E., Doumas, L.A.A.: Analogy and Similarity, p. 451–473. Cambridge Handbooks in Psychology, Cambridge University Press, 2nd edn. (2023). https://doi.org/10.1017/9781108755610.018
Ikotun, A.M., Ezugwu, A.E., Abualigah, L., Abuhaija, B., Heming, J.: K-means clustering algorithms: a comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. 622, 178–210 (2023)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
Lemesle, Y., Sawayama, M., Valle-Perez, G., Adolphe, M., Sauzéon, H., Oudeyer, P.Y.: Language-biased image classification: evaluation based on semantic representations. In: International Conference on Learning Representations (ICLR) (2022)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International conference on machine learning, pp. 19730–19742. PMLR (2023)
Materzyńska, J., Torralba, A., Bau, D.: Disentangling visual and written concepts in clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16410–16419 (2022)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha, Qatar (2014)
Peterson, J.C., Chen, D., Griffiths, T.L.: Parallelograms revisited: exploring the limitations of vector space models for simple analogies. Cognition 205, 104440 (2020)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: International Conference on Learning Representations (ICLR) (2016)
Reed, S.E., Zhang, Y., Zhang, Y., Lee, H.: Deep visual analogy-making. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015)
Richland, L.E., Simms, N.: Analogy, higher order thinking, and education. WIREs Cognit. Sci. 6(2), 177–192 (2015)
Rumelhart, D.E., Abrahamson, A.A.: A model for analogical reasoning. Cogn. Psychol. 5(1), 1–28 (1973)
Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178 (2010)
Šubrtová, A., Lukáč, M., Čech, J., Futschik, D., Shechtman, E., Sỳkora, D.: Diffusion image analogies. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–10 (2023)
Ushio, A., Espinosa Anke, L., Schockaert, S., Camacho-Collados, J.: BERT is to NLP what AlexNet is to CV: can pre-trained language models identify analogies? In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.) Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3609–3624. Association for Computational Linguistics, Online (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xing, S., Peruzzo, E., Sangineto, E., Sebe, N. (2025). From One to Many Lorikeets: Discovering Image Analogies in the CLIP Space. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15309. Springer, Cham. https://doi.org/10.1007/978-3-031-78189-6_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-78189-6_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78188-9
Online ISBN: 978-3-031-78189-6
eBook Packages: Computer ScienceComputer Science (R0)