Skip to main content

ConceptExpress: Harnessing Diffusion Models for Single-Image Unsupervised Concept Extraction

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

While personalized text-to-image generation has enabled the learning of a single concept from multiple images, a more practical yet challenging scenario involves learning multiple concepts within a single image. However, existing works tackling this scenario heavily rely on extensive human annotations. In this paper, we introduce a novel task named Unsupervised Concept Extraction (UCE) that considers an unsupervised setting without any human knowledge of the concepts. Given an image that contains multiple concepts, the task aims to extract and recreate individual concepts solely relying on the existing knowledge from pretrained diffusion models. To achieve this, we present ConceptExpress that tackles UCE by unleashing the inherent capabilities of pretrained diffusion models in two aspects. Specifically, a concept localization approach automatically locates and disentangles salient concepts by leveraging spatial correspondence from diffusion self-attention; and based on the lookup association between a concept and a conceptual token, a concept-wise optimization process learns discriminative tokens that represent each individual concept. Finally, we establish an evaluation protocol tailored for the UCE task. Extensive experiments demonstrate that ConceptExpress is a promising solution to the UCE task. Our code and data are available at: https://github.com/haoosz/ConceptExpress.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    96 is considerably large compared to the dataset sizes in the previous works, such as 30 in DreamBooth [55], 50 in Break-A-Scene [2], and 10 in DisenDiff [83].

  2. 2.

    https://unsplash.com/.

References

  1. Abdal, R., Zhu, P., Femiani, J., Mitra, N., Wonka, P.: CLIP2StyleGAN: unsupervised extraction of stylegan edit directions. In: ACM SIGGRAPH (2022)

    Google Scholar 

  2. Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., Lischinski, D.: Break-a-scene: extracting multiple concepts from a single image. In: SIGGRAPH Asia (2023)

    Google Scholar 

  3. Avrahami, O., Fried, O., Lischinski, D.: Blended latent diffusion. In: ACM SIGGRAPH (2023)

    Google Scholar 

  4. Avrahami, O., et al.: Spatext: spatio-textual representation for controllable image generation. In: CVPR (2023)

    Google Scholar 

  5. Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: CVPR (2022)

    Google Scholar 

  6. Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: Multidiffusion: fusing diffusion paths for controlled image generation. In: ICML (2023)

    Google Scholar 

  7. Baranchuk, D., Voynov, A., Rubachev, I., Khrulkov, V., Babenko, A.: Label-efficient semantic segmentation with diffusion models. In: ICLR (2022)

    Google Scholar 

  8. Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2019)

    Google Scholar 

  9. Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. In: CVPR (2023)

    Google Scholar 

  10. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)

    Google Scholar 

  11. Chefer, H., et al.: The hidden language of diffusion models. arXiv preprint arXiv:2306.00966 (2023)

  12. Chen, W., et al.: Subject-driven text-to-image generation via apprenticeship learning. In: NeurIPS (2023)

    Google Scholar 

  13. Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: diffusion-based semantic image editing with mask guidance. In: ICLR (2022)

    Google Scholar 

  14. Crowson, K., et al.: VQGAN-CLIP: open domain image generation and editing with natural language guidance. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13697, pp. 88–105. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19836-6_6

  15. Du, Y., Li, S., Mordatch, I.: Compositional visual generation with energy based models. In: NeurIPS (2020)

    Google Scholar 

  16. Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: ICLR (2023)

    Google Scholar 

  17. Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D.: Encoder-based domain tuning for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228 (2023)

  18. Gal, R., Patashnik, O., Maron, H., Bermano, A.H., Chechik, G., Cohen-Or, D.: Stylegan-nada: clip-guided domain adaptation of image generators. ACM Trans. Graph. (TOG) (2022)

    Google Scholar 

  19. Gandikota, R., Materzynska, J., Fiotto-Kaufman, J., Bau, D.: Erasing concepts from diffusion models. In: ICCV (2023)

    Google Scholar 

  20. Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS (2014)

    Google Scholar 

  21. Hao, S., Han, K., Zhao, S., Wong, K.Y.K.: ViCo: detail-preserving visual condition for personalized text-to-image generation. arXiv preprint arXiv:2306.00971 (2023)

  22. Hedlin, E., Sharma, G., Mahajan, S., Isack, H., Kar, A., Tagliasacchi, A., Yi, K.M.: Unsupervised semantic correspondence using stable diffusion. arXiv preprint arXiv:2305.15581 (2023)

  23. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

    Google Scholar 

  24. Ho, J., Salimans, T., Gritsenko, A.A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. In: NeurIPS (2022)

    Google Scholar 

  25. Jia, X., et al.: Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642 (2023)

  26. Jin, C., Tanno, R., Saseendran, A., Diethe, T., Teare, P.: An image is worth multiple words: learning object level concepts using multi-concept prompt learning. arXiv preprint arXiv:2310.12274 (2023)

  27. Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)

    Google Scholar 

  28. Karazija, L., Laina, I., Vedaldi, A., Rupprecht, C.: Diffusion models for zero-shot open-vocabulary segmentation. arXiv preprint arXiv:2306.09316 (2023)

  29. Karras, T., et al.: Alias-free generative adversarial networks. In: NeurIPS (2021)

    Google Scholar 

  30. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)

    Google Scholar 

  31. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: CVPR (2020)

    Google Scholar 

  32. Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: CVPR (2023)

    Google Scholar 

  33. Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)

  34. Kuhn, H.W.: The hungarian method for the assignment problem. Nav. Res. Logist. Q. (1955)

    Google Scholar 

  35. Kumari, N., Zhang, B., Wang, S.Y., Shechtman, E., Zhang, R., Zhu, J.Y.: Ablating concepts in text-to-image diffusion models. In: ICCV (2023)

    Google Scholar 

  36. Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: CVPR (2023)

    Google Scholar 

  37. Li, A.C., Prabhudesai, M., Duggal, S., Brown, E., Pathak, D.: Your diffusion model is secretly a zero-shot classifier. In: ICCV (2023)

    Google Scholar 

  38. Li, D., Li, J., Hoi, S.C.: Blip-diffusion: pre-trained subject representation for controllable text-to-image generation and editing. In: NeurIPS (2023)

    Google Scholar 

  39. Li, X., Lu, J., Han, K., Prisacariu, V.: Sd4match: learning to prompt stable diffusion model for semantic matching. arXiv preprint arXiv:2310.17569 (2023)

  40. Liu, N., Du, Y., Li, S., Tenenbaum, J.B., Torralba, A.: Unsupervised compositional concepts discovery with text-to-image generative models. In: ICCV (2023)

    Google Scholar 

  41. Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Repaint: inpainting using denoising diffusion probabilistic models. In: CVPR (2022)

    Google Scholar 

  42. Ma, Y., Yang, H., Wang, W., Fu, J., Liu, J.: Unified multi-modal latent diffusion for joint subject and text conditional image generation. arXiv preprint arXiv:2303.09319 (2023)

  43. Molad, E., et al.: Dreamix: video diffusion models are general video editors. arXiv preprint arXiv:2302.01329 (2023)

  44. Ni, M., Zhang, Y., Feng, K., Li, X., Guo, Y., Zuo, W.: Ref-diff: zero-shot referring image segmentation with generative models. arXiv preprint arXiv:2308.16777 (2023)

  45. Nichol, A.Q., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models. In: ICML (2022)

    Google Scholar 

  46. Patashnik, O., Garibi, D., Azuri, I., Averbuch-Elor, H., Cohen-Or, D.: Localizing object-level shape variations with text-to-image diffusion models. In: ICCV (2023)

    Google Scholar 

  47. Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: StyleCLIP: text-driven manipulation of stylegan imagery. In: ICCV (2021)

    Google Scholar 

  48. Qiu, Z., et al.: Controlling text-to-image diffusion by orthogonal finetuning. In: NeurIPS (2023)

    Google Scholar 

  49. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  50. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125 (2022)

  51. Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)

    Google Scholar 

  52. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: ICML (2016)

    Google Scholar 

  53. Richardson, E., Goldberg, K., Alaluf, Y., Cohen-Or, D.: Conceptlab: creative generation using diffusion prior constraints. arXiv preprint arXiv:2308.02669 (2023)

  54. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  55. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023)

    Google Scholar 

  56. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)

    Google Scholar 

  57. Sarfraz, S., Sharma, V., Stiefelhagen, R.: Efficient parameter-free clustering using first neighbor relations. In: CVPR (2019)

    Google Scholar 

  58. Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)

    Google Scholar 

  59. Shi, J., Xiong, W., Lin, Z., Jung, H.J.: Instantbooth: personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411 (2023)

  60. Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. In: ICLR (2022)

    Google Scholar 

  61. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)

    Google Scholar 

  62. Tang, R., et al.: What the daam: interpreting stable diffusion using cross attention. In: ACL (2023)

    Google Scholar 

  63. Tao, M., Tang, H., Wu, F., Jing, X.Y., Bao, B.K., Xu, C.: DF-GAN: a simple and effective baseline for text-to-image synthesis. In: CVPR (2022)

    Google Scholar 

  64. Tewel, Y., Gal, R., Chechik, G., Atzmon, Y.: Key-locked rank one editing for text-to-image personalization. In: ACM SIGGRAPH (2023)

    Google Scholar 

  65. Tian, J., Aggarwal, L., Colaco, A., Kira, Z., Gonzalez-Franco, M.: Diffuse, attend, and segment: unsupervised zero-shot segmentation using stable diffusion. arXiv preprint arXiv:2308.12469 (2023)

  66. Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. In: CVPR (2023)

    Google Scholar 

  67. Vinker, Y., Voynov, A., Cohen-Or, D., Shamir, A.: Concept decomposition for visual exploration and inspiration. arXiv preprint arXiv:2305.18203 (2023)

  68. Wang, J., et al.: Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773 (2023)

  69. Wang, S., et al.: Imagen editor and editbench: advancing and evaluating text-guided image inpainting. In: CVPR (2023)

    Google Scholar 

  70. Wang, X., Girdhar, R., Yu, S.X., Misra, I.: Cut and learn for unsupervised object detection and instance segmentation. In: CVPR (2023)

    Google Scholar 

  71. Wang, Z., Gui, L., Negrea, J., Veitch, V.: Concept algebra for text-controlled vision models. arXiv preprint arXiv:2302.03693 (2023)

  72. Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848 (2023)

  73. Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565 (2022)

  74. Xia, W., Yang, Y., Xue, J.H., Wu, B.: Tedigan: text-guided diverse face image generation and manipulation. In: CVPR (2021)

    Google Scholar 

  75. Xiao, C., Yang, Q., Zhou, F., Zhang, C.: From text to mask: localizing entities using the attention of text-to-image diffusion models. arXiv preprint arXiv:2309.04109 (2023)

  76. Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  77. Ye, H., Yang, X., Takac, M., Sunderraman, R., Ji, S.: Improving text-to-image synthesis using contrastive learning. arXiv preprint arXiv:2107.02423 (2021)

  78. Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. Trans. Mach. Learn. Res. (2022)

    Google Scholar 

  79. Zhang, H., Koh, J.Y., Baldridge, J., Lee, H., Yang, Y.: Cross-modal contrastive learning for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  80. Zhang, J., et al.: A tale of two features: stable diffusion complements dino for zero-shot semantic correspondence. In: NeurIPS (2023)

    Google Scholar 

  81. Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)

  82. Zhang, Y., Wei, Y., Jiang, D., Zhang, X., Zuo, W., Tian, Q.: Controlvideo: training-free controllable text-to-video generation. In: ICLR (2024)

    Google Scholar 

  83. Zhang, Y., Yang, M., Zhou, Q., Wang, Z.: Attention calibration for disentangled text-to-image personalization. In: CVPR (2024)

    Google Scholar 

  84. Zhao, S., et al.: Uni-ControlNet: all-in-one control to text-to-image diffusion models. In: NeurIPS (2023)

    Google Scholar 

  85. Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: CVPR (2019)

    Google Scholar 

Download references

Acknowledgement

This work is partially supported by the Hong Kong Research Grants Council - General Research Fund (Grant No.: 17211024).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Kai Han or Kwan-Yee K. Wong .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 12764 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hao, S., Han, K., Lv, Z., Zhao, S., Wong, KY.K. (2025). ConceptExpress: Harnessing Diffusion Models for Single-Image Unsupervised Concept Extraction. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15117. Springer, Cham. https://doi.org/10.1007/978-3-031-73202-7_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73202-7_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73201-0

  • Online ISBN: 978-3-031-73202-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics