Skip to main content

From One to Many Lorikeets: Discovering Image Analogies in the CLIP Space

  • Conference paper
  • First Online:
Pattern Recognition (ICPR 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15309))

Included in the following conference series:

  • 203 Accesses

Abstract

Drawing analogies between two pairs of entities in the form of A:B::C:D (i.e. A is to B as C is to D) is a hallmark of human intelligence, as evidenced by sufficient findings in cognitive science for the last decades. In recent years, this property has been found far beyond cognitive science. Notable examples are word2vec and GloVe models in natural language processing. Recent research in computer vision also found the property of analogies in the feature space of a pretrained ConvNet feature extractor. However, analogy mining in the semantic space of recent strong foundation models such as CLIP is still understudied, despite the fact that they have been successfully applied to a wide range of downstream tasks. In this work, we show that CLIP possesses the similar ability of analogical reasoning in the latent space, and propose a novel strategy to extract analogies between pairs of images in the CLIP space. We compute all the difference vectors of a pair of any two images that belong to the same class in the CLIP space, and employ k-means clustering to group the difference vectors into clusters irrespective of their classes. This procedure results in cluster centroids representative of class-agnostic semantic analogies between images. Through extensive analysis, we show that the property of drawing analogies between images also exists in the CLIP space, which are interpretable by humans through a visualisation of the learned clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The code is available at https://github.com/Sxing2/CLIP-Analogy.

References

  1. Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. Allen, C., Hospedales, T.: Analogies explained: towards understanding word embeddings. In: International Conference on Machine Learning, pp. 223–231. PMLR (2019)

    Google Scholar 

  3. Arora, S., Li, Y., Liang, Y., Ma, T., Risteski, A.: A latent variable model approach to PMI-based word embeddings. Trans. Assoc. Comput. Linguist. 4, 385–399 (2016)

    Article  Google Scholar 

  4. Azuma, H., Matsui, Y.: Defense-prefix for preventing typographic attacks on clip. arXiv preprint arXiv:2304.04512 (2023)

  5. Bar, A., Gandelsman, Y., Darrell, T., Globerson, A., Efros, A.: Visual prompting via image inpainting. Adv. Neural. Inf. Process. Syst. 35, 25005–25017 (2022)

    Google Scholar 

  6. Bitton, Y., Yosef, R., Strugo, E., Shahaf, D., Schwartz, R., Stanovsky, G.: VASR: visual analogies of situation recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 241–249 (2023)

    Google Scholar 

  7. Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020)

    Google Scholar 

  8. Chen, D., Peterson, J.C., Griffiths, T.L.: Evaluating vector-space models of analogy. CoRR arXiv:abs/1705.04416 (2017)

  9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848

  10. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

    Google Scholar 

  11. Dunlap, L., et al.: Describing differences in image sets with natural language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24199–24208 (2024)

    Google Scholar 

  12. Ethayarajh, K., Duvenaud, D., Hirst, G.: Towards understanding linear word analogies. In: Korhonen, A., Traum, D., Màrquez, L. (eds.) Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3253–3262. Association for Computational Linguistics, Florence, Italy (2019)

    Google Scholar 

  13. Gandelsman, Y., Efros, A.A., Steinhardt, J.: Interpreting CLIP’s image representation via text-based decomposition. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=5Ca9sSzuDp

  14. Gentner, D.: Structure-mapping: a theoretical framework for analogy. Cogn. Sci. 7(2), 155–170 (1983)

    Google Scholar 

  15. Gittens, A., Achlioptas, D., Mahoney, M.W.: Skip-Gram – Zipf + uniform = vector additivity. In: Barzilay, R., Kan, M.Y. (eds.) Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 69–76. Association for Computational Linguistics, Vancouver, Canada (2017)

    Google Scholar 

  16. Goh, G., et al.: Multimodal neurons in artificial neural networks. Distill 6(3), e30 (2021)

    Article  MathSciNet  Google Scholar 

  17. Hariharan, B., Girshick, R.: Low-shot visual recognition by shrinking and hallucinating features. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3018–3027 (2017)

    Google Scholar 

  18. Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 327–340. SIGGRAPH 2001, Association for Computing Machinery, New York, NY, USA (2001)

    Google Scholar 

  19. Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23(1), 2249–2281 (2022)

    MathSciNet  Google Scholar 

  20. Holyoak, K.J.: Analogy and Relational Reasoning. In: The Oxford Handbook of Thinking and Reasoning, pp. 234–259 (2012)

    Google Scholar 

  21. Hummel, J.E., Doumas, L.A.A.: Analogy and Similarity, p. 451–473. Cambridge Handbooks in Psychology, Cambridge University Press, 2nd edn. (2023). https://doi.org/10.1017/9781108755610.018

  22. Ikotun, A.M., Ezugwu, A.E., Abualigah, L., Abuhaija, B., Heming, J.: K-means clustering algorithms: a comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. 622, 178–210 (2023)

    Article  Google Scholar 

  23. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)

    Google Scholar 

  24. Lemesle, Y., Sawayama, M., Valle-Perez, G., Adolphe, M., Sauzéon, H., Oudeyer, P.Y.: Language-biased image classification: evaluation based on semantic representations. In: International Conference on Learning Representations (ICLR) (2022)

    Google Scholar 

  25. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International conference on machine learning, pp. 19730–19742. PMLR (2023)

    Google Scholar 

  26. Materzyńska, J., Torralba, A., Bau, D.: Disentangling visual and written concepts in clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16410–16419 (2022)

    Google Scholar 

  27. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)

    Google Scholar 

  28. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  Google Scholar 

  29. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha, Qatar (2014)

    Google Scholar 

  30. Peterson, J.C., Chen, D., Griffiths, T.L.: Parallelograms revisited: exploring the limitations of vector space models for simple analogies. Cognition 205, 104440 (2020)

    Article  Google Scholar 

  31. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  32. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: International Conference on Learning Representations (ICLR) (2016)

    Google Scholar 

  33. Reed, S.E., Zhang, Y., Zhang, Y., Lee, H.: Deep visual analogy-making. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015)

    Google Scholar 

  34. Richland, L.E., Simms, N.: Analogy, higher order thinking, and education. WIREs Cognit. Sci. 6(2), 177–192 (2015)

    Article  Google Scholar 

  35. Rumelhart, D.E., Abrahamson, A.A.: A model for analogical reasoning. Cogn. Psychol. 5(1), 1–28 (1973)

    Article  Google Scholar 

  36. Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178 (2010)

    Google Scholar 

  37. Šubrtová, A., Lukáč, M., Čech, J., Futschik, D., Shechtman, E., Sỳkora, D.: Diffusion image analogies. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–10 (2023)

    Google Scholar 

  38. Ushio, A., Espinosa Anke, L., Schockaert, S., Camacho-Collados, J.: BERT is to NLP what AlexNet is to CV: can pre-trained language models identify analogies? In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.) Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3609–3624. Association for Computational Linguistics, Online (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Songlong Xing .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6903 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xing, S., Peruzzo, E., Sangineto, E., Sebe, N. (2025). From One to Many Lorikeets: Discovering Image Analogies in the CLIP Space. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15309. Springer, Cham. https://doi.org/10.1007/978-3-031-78189-6_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-78189-6_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-78188-9

  • Online ISBN: 978-3-031-78189-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics