ConceptExpress: Harnessing Diffusion Models for Single-Image Unsupervised Concept Extraction

Hao, Shaozhe; Han, Kai; Lv, Zhengyao; Zhao, Shihao; Wong, Kwan-Yee K.

doi:10.1007/978-3-031-73202-7_13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15117))

Included in the following conference series:

European Conference on Computer Vision

281 Accesses

Abstract

While personalized text-to-image generation has enabled the learning of a single concept from multiple images, a more practical yet challenging scenario involves learning multiple concepts within a single image. However, existing works tackling this scenario heavily rely on extensive human annotations. In this paper, we introduce a novel task named Unsupervised Concept Extraction (UCE) that considers an unsupervised setting without any human knowledge of the concepts. Given an image that contains multiple concepts, the task aims to extract and recreate individual concepts solely relying on the existing knowledge from pretrained diffusion models. To achieve this, we present ConceptExpress that tackles UCE by unleashing the inherent capabilities of pretrained diffusion models in two aspects. Specifically, a concept localization approach automatically locates and disentangles salient concepts by leveraging spatial correspondence from diffusion self-attention; and based on the lookup association between a concept and a conceptual token, a concept-wise optimization process learns discriminative tokens that represent each individual concept. Finally, we establish an evaluation protocol tailored for the UCE task. Extensive experiments demonstrate that ConceptExpress is a promising solution to the UCE task. Our code and data are available at: https://github.com/haoosz/ConceptExpress.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models

Explain via Any Concept: Concept Bottleneck Model with Open Vocabulary Concepts

Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images

Notes

1.
96 is considerably large compared to the dataset sizes in the previous works, such as 30 in DreamBooth [55], 50 in Break-A-Scene [2], and 10 in DisenDiff [83].
2.
https://unsplash.com/.

References

Abdal, R., Zhu, P., Femiani, J., Mitra, N., Wonka, P.: CLIP2StyleGAN: unsupervised extraction of stylegan edit directions. In: ACM SIGGRAPH (2022)
Google Scholar
Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., Lischinski, D.: Break-a-scene: extracting multiple concepts from a single image. In: SIGGRAPH Asia (2023)
Google Scholar
Avrahami, O., Fried, O., Lischinski, D.: Blended latent diffusion. In: ACM SIGGRAPH (2023)
Google Scholar
Avrahami, O., et al.: Spatext: spatio-textual representation for controllable image generation. In: CVPR (2023)
Google Scholar
Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: CVPR (2022)
Google Scholar
Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: Multidiffusion: fusing diffusion paths for controlled image generation. In: ICML (2023)
Google Scholar
Baranchuk, D., Voynov, A., Rubachev, I., Khrulkov, V., Babenko, A.: Label-efficient semantic segmentation with diffusion models. In: ICLR (2022)
Google Scholar
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2019)
Google Scholar
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. In: CVPR (2023)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
Google Scholar
Chefer, H., et al.: The hidden language of diffusion models. arXiv preprint arXiv:2306.00966 (2023)
Chen, W., et al.: Subject-driven text-to-image generation via apprenticeship learning. In: NeurIPS (2023)
Google Scholar
Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: diffusion-based semantic image editing with mask guidance. In: ICLR (2022)
Google Scholar
Crowson, K., et al.: VQGAN-CLIP: open domain image generation and editing with natural language guidance. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13697, pp. 88–105. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19836-6_6
Du, Y., Li, S., Mordatch, I.: Compositional visual generation with energy based models. In: NeurIPS (2020)
Google Scholar
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: ICLR (2023)
Google Scholar
Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D.: Encoder-based domain tuning for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228 (2023)
Gal, R., Patashnik, O., Maron, H., Bermano, A.H., Chechik, G., Cohen-Or, D.: Stylegan-nada: clip-guided domain adaptation of image generators. ACM Trans. Graph. (TOG) (2022)
Google Scholar
Gandikota, R., Materzynska, J., Fiotto-Kaufman, J., Bau, D.: Erasing concepts from diffusion models. In: ICCV (2023)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS (2014)
Google Scholar
Hao, S., Han, K., Zhao, S., Wong, K.Y.K.: ViCo: detail-preserving visual condition for personalized text-to-image generation. arXiv preprint arXiv:2306.00971 (2023)
Hedlin, E., Sharma, G., Mahajan, S., Isack, H., Kar, A., Tagliasacchi, A., Yi, K.M.: Unsupervised semantic correspondence using stable diffusion. arXiv preprint arXiv:2305.15581 (2023)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Google Scholar
Ho, J., Salimans, T., Gritsenko, A.A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. In: NeurIPS (2022)
Google Scholar
Jia, X., et al.: Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642 (2023)
Jin, C., Tanno, R., Saseendran, A., Diethe, T., Teare, P.: An image is worth multiple words: learning object level concepts using multi-concept prompt learning. arXiv preprint arXiv:2310.12274 (2023)
Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)
Google Scholar
Karazija, L., Laina, I., Vedaldi, A., Rupprecht, C.: Diffusion models for zero-shot open-vocabulary segmentation. arXiv preprint arXiv:2306.09316 (2023)
Karras, T., et al.: Alias-free generative adversarial networks. In: NeurIPS (2021)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)
Google Scholar
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: CVPR (2020)
Google Scholar
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: CVPR (2023)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Kuhn, H.W.: The hungarian method for the assignment problem. Nav. Res. Logist. Q. (1955)
Google Scholar
Kumari, N., Zhang, B., Wang, S.Y., Shechtman, E., Zhang, R., Zhu, J.Y.: Ablating concepts in text-to-image diffusion models. In: ICCV (2023)
Google Scholar
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: CVPR (2023)
Google Scholar
Li, A.C., Prabhudesai, M., Duggal, S., Brown, E., Pathak, D.: Your diffusion model is secretly a zero-shot classifier. In: ICCV (2023)
Google Scholar
Li, D., Li, J., Hoi, S.C.: Blip-diffusion: pre-trained subject representation for controllable text-to-image generation and editing. In: NeurIPS (2023)
Google Scholar
Li, X., Lu, J., Han, K., Prisacariu, V.: Sd4match: learning to prompt stable diffusion model for semantic matching. arXiv preprint arXiv:2310.17569 (2023)
Liu, N., Du, Y., Li, S., Tenenbaum, J.B., Torralba, A.: Unsupervised compositional concepts discovery with text-to-image generative models. In: ICCV (2023)
Google Scholar
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Repaint: inpainting using denoising diffusion probabilistic models. In: CVPR (2022)
Google Scholar
Ma, Y., Yang, H., Wang, W., Fu, J., Liu, J.: Unified multi-modal latent diffusion for joint subject and text conditional image generation. arXiv preprint arXiv:2303.09319 (2023)
Molad, E., et al.: Dreamix: video diffusion models are general video editors. arXiv preprint arXiv:2302.01329 (2023)
Ni, M., Zhang, Y., Feng, K., Li, X., Guo, Y., Zuo, W.: Ref-diff: zero-shot referring image segmentation with generative models. arXiv preprint arXiv:2308.16777 (2023)
Nichol, A.Q., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models. In: ICML (2022)
Google Scholar
Patashnik, O., Garibi, D., Azuri, I., Averbuch-Elor, H., Cohen-Or, D.: Localizing object-level shape variations with text-to-image diffusion models. In: ICCV (2023)
Google Scholar
Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: StyleCLIP: text-driven manipulation of stylegan imagery. In: ICCV (2021)
Google Scholar
Qiu, Z., et al.: Controlling text-to-image diffusion by orthogonal finetuning. In: NeurIPS (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125 (2022)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)
Google Scholar
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: ICML (2016)
Google Scholar
Richardson, E., Goldberg, K., Alaluf, Y., Cohen-Or, D.: Conceptlab: creative generation using diffusion prior constraints. arXiv preprint arXiv:2308.02669 (2023)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
Google Scholar
Sarfraz, S., Sharma, V., Stiefelhagen, R.: Efficient parameter-free clustering using first neighbor relations. In: CVPR (2019)
Google Scholar
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)
Google Scholar
Shi, J., Xiong, W., Lin, Z., Jung, H.J.: Instantbooth: personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411 (2023)
Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. In: ICLR (2022)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Google Scholar
Tang, R., et al.: What the daam: interpreting stable diffusion using cross attention. In: ACL (2023)
Google Scholar
Tao, M., Tang, H., Wu, F., Jing, X.Y., Bao, B.K., Xu, C.: DF-GAN: a simple and effective baseline for text-to-image synthesis. In: CVPR (2022)
Google Scholar
Tewel, Y., Gal, R., Chechik, G., Atzmon, Y.: Key-locked rank one editing for text-to-image personalization. In: ACM SIGGRAPH (2023)
Google Scholar
Tian, J., Aggarwal, L., Colaco, A., Kira, Z., Gonzalez-Franco, M.: Diffuse, attend, and segment: unsupervised zero-shot segmentation using stable diffusion. arXiv preprint arXiv:2308.12469 (2023)
Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. In: CVPR (2023)
Google Scholar
Vinker, Y., Voynov, A., Cohen-Or, D., Shamir, A.: Concept decomposition for visual exploration and inspiration. arXiv preprint arXiv:2305.18203 (2023)
Wang, J., et al.: Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773 (2023)
Wang, S., et al.: Imagen editor and editbench: advancing and evaluating text-guided image inpainting. In: CVPR (2023)
Google Scholar
Wang, X., Girdhar, R., Yu, S.X., Misra, I.: Cut and learn for unsupervised object detection and instance segmentation. In: CVPR (2023)
Google Scholar
Wang, Z., Gui, L., Negrea, J., Veitch, V.: Concept algebra for text-controlled vision models. arXiv preprint arXiv:2302.03693 (2023)
Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848 (2023)
Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565 (2022)
Xia, W., Yang, Y., Xue, J.H., Wu, B.: Tedigan: text-guided diverse face image generation and manipulation. In: CVPR (2021)
Google Scholar
Xiao, C., Yang, Q., Zhou, F., Zhang, C.: From text to mask: localizing entities using the attention of text-to-image diffusion models. arXiv preprint arXiv:2309.04109 (2023)
Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Ye, H., Yang, X., Takac, M., Sunderraman, R., Ji, S.: Improving text-to-image synthesis using contrastive learning. arXiv preprint arXiv:2107.02423 (2021)
Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. Trans. Mach. Learn. Res. (2022)
Google Scholar
Zhang, H., Koh, J.Y., Baldridge, J., Lee, H., Yang, Y.: Cross-modal contrastive learning for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Zhang, J., et al.: A tale of two features: stable diffusion complements dino for zero-shot semantic correspondence. In: NeurIPS (2023)
Google Scholar
Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)
Zhang, Y., Wei, Y., Jiang, D., Zhang, X., Zuo, W., Tian, Q.: Controlvideo: training-free controllable text-to-video generation. In: ICLR (2024)
Google Scholar
Zhang, Y., Yang, M., Zhou, Q., Wang, Z.: Attention calibration for disentangled text-to-image personalization. In: CVPR (2024)
Google Scholar
Zhao, S., et al.: Uni-ControlNet: all-in-one control to text-to-image diffusion models. In: NeurIPS (2023)
Google Scholar
Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: CVPR (2019)
Google Scholar

Download references

Acknowledgement

This work is partially supported by the Hong Kong Research Grants Council - General Research Fund (Grant No.: 17211024).

Author information

Authors and Affiliations

The University of Hong Kong, Pok Fu Lam, Hong Kong
Shaozhe Hao, Kai Han, Zhengyao Lv, Shihao Zhao & Kwan-Yee K. Wong

Authors

Shaozhe Hao
View author publications
You can also search for this author in PubMed Google Scholar
Kai Han
View author publications
You can also search for this author in PubMed Google Scholar
Zhengyao Lv
View author publications
You can also search for this author in PubMed Google Scholar
Shihao Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Kwan-Yee K. Wong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Kai Han or Kwan-Yee K. Wong .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 12764 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hao, S., Han, K., Lv, Z., Zhao, S., Wong, KY.K. (2025). ConceptExpress: Harnessing Diffusion Models for Single-Image Unsupervised Concept Extraction. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15117. Springer, Cham. https://doi.org/10.1007/978-3-031-73202-7_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-73202-7_13
Published: 21 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73201-0
Online ISBN: 978-3-031-73202-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ConceptExpress: Harnessing Diffusion Models for Single-Image Unsupervised Concept Extraction