ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement

Butt, Muhammad Atif; Wang, Kai; Vazquez-Corral, Javier; van de Weijer, Joost

doi:10.1007/978-3-031-72667-5_26

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15065))

Included in the following conference series:

European Conference on Computer Vision

515 Accesses

Abstract

Text-to-Image (T2I) generation has made significant advancements with the advent of diffusion models. These models exhibit remarkable abilities to produce images based on textual prompts. Current T2I models allow users to specify object colors using linguistic color names. However, these labels encompass broad color ranges, making it difficult to achieve precise color matching. To tackle this challenging task, named color prompt learning, we propose to learn specific color prompts tailored to user-selected colors. Existing T2I personalization methods tend to result in color-shape entanglement. To overcome this, we generate several basic geometric objects in the target color, allowing for color and shape disentanglement during the color prompt learning. Our method, denoted as ColorPeel, successfully assists the T2I models to peel off the novel color prompts from these colored shapes. In the experiments, we demonstrate the efficacy of ColorPeel in achieving precise color generation with T2I models. Furthermore, we generalize ColorPeel to effectively learn abstract attribute concepts, including textures, materials, etc. Our findings represent a significant step towards improving precision and versatility of T2I models, offering new opportunities for creative applications and design tasks. Our project is available at https://moatifbutt.github.io/colorpeel/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models

FHS-adapter: fine-grained hierarchical semantic adapter for Chinese landscape paintings generation

Article Open access 30 July 2024

Bridging the Domain Gap Towards Generalization in Automatic Colorization

References

Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., Lischinski, D.: Break-a-scene: Extracting multiple concepts from a single image. In: SIGGRAPH Asia 2023 (2023)
Google Scholar
Basu, S., et al.: Editval: benchmarking diffusion based text-guided image editing methods. arXiv preprint arXiv:2310.02426 (2023)
Berlin, B., Kay, P.: Basic color terms: their universality and evolution. University of California Press (1991)
Google Scholar
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. In: CVPR (2023)
Google Scholar
Chang, H., et al.: Muse: text-to-image generation via masked generative transformers. In: ICML (2023)
Google Scholar
Chen, S., Huang, J.: FEC: three finetuning-free methods to enhance consistency for real image editing. arXiv preprint arXiv:2309.14934 (2023)
Chen, W., et al.: Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186 (2023)
Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: diffusion-based semantic image editing with mask guidance. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=3lge0p5o-M-
Daras, G., Dimakis, A.: Multiresolution textual inversion. In: NeurIPS 2022 Workshop on Score-Based Methods (2022)
Google Scholar
Dong, Z., Wei, P., Lin, L.: Dreamartist: towards controllable one-shot text-to-image generation via contrastive prompt-tuning. arXiv preprint arXiv:2211.11337 (2022)
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-a-scene: scene-based text-to-image generation with human priors. In: ECCV, pp. 89–106. Springer (2022)
Google Scholar
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: ICLR (2023)
Google Scholar
Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D.: Designing an encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228 (2023)
Ge, S., Park, T., Zhu, J.Y., Huang, J.B.: Expressive text-to-image generation with rich text. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7545–7556 (2023)
Google Scholar
Han, I., Yang, S., Kwon, T., Ye, J.C.: Highly personalized text embedding for image manipulation by stable diffusion. arXiv preprint arXiv:2303.08767 (2023)
Han, L., Li, Y., Zhang, H., Milanfar, P., Metaxas, D., Yang, F.: Svdiff: compact parameter space for diffusion fine-tuning. In: ICCV (2023)
Google Scholar
Han, L., et al.: Improving negative-prompt inversion via proximal guidance. arXiv preprint arXiv:2306.05414 (2023)
Hertz, A., Aberman, K., Cohen-Or, D.: Delta denoising score. arXiv preprint arXiv:2304.07090 (2023)
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. In: ICLR (2023)
Google Scholar
Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)
Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2022)
Google Scholar
Hong, S., Lee, G., Jang, W., Kim, S.: Improving sample quality of diffusion models using self-attention guidance. In: ICCV (2023)
Google Scholar
Huang, Y., et al.: Diffusion model-based image editing: a survey. arXiv preprint arXiv:2402.17525 (2024)
Ju, X., Zeng, A., Bian, Y., Liu, S., Xu, Q.: Direct inversion: boosting diffusion-based editing with 3 lines of code. arXiv preprint arXiv:2310.01506 (2023)
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: CVPR (2023)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: CVPR (2023)
Google Scholar
Li, S., et al.: Stylediffusion: Prompt-Embedding Inversion for Text-Based Editing (2023)
Google Scholar
Liu, Z., et al.: Cones: concept neurons in diffusion models for customized generation. In: ICML (2023)
Google Scholar
Lopes, I., Pizzati, F., de Charette, R.: Material palette: extraction of materials from a single image. arXiv preprint arXiv:2311.17060 (2023)
Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=aBsCjcPu_tE
Miyake, D., Iohara, A., Saito, Y., Tanaka, T.: Negative-prompt inversion: fast image inversion for editing with text-guided diffusion models. arXiv preprint arXiv:2305.16807 (2023)
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: CVPR (2023)
Google Scholar
Montag, E.D.: Empirical formula for creating error bars for the method of paired comparison. J. Elec. Imag. 15(1), 010502 (2006)
Google Scholar
Motamed, S., Paudel, D.P., Van Gool, L.: Lego: learning to disentangle and invert concepts beyond object appearance in text-to-image diffusion models. arXiv preprint arXiv:2311.13833 (2023)
Parmar, G., Singh, K.K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y.: Zero-shot image-to-image translation. In: Proceedings of the ACM SIGGRAPH Conference on Computer Graphics (2023)
Google Scholar
Podell, D., et al.: Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)
Google Scholar
Rassin, R., Hirsch, E., Glickman, D., Ravfogel, S., Goldberg, Y., Chechik, G.: Linguistic binding in diffusion models: enhancing attribute correspondence through attention map alignment. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695 (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: MICCAI 2015, Part III 18, pp. 234–241. Springer (2015)
Google Scholar
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. (2022)
Google Scholar
Shi, J., Xiong, W., Lin, Z., Jung, H.J.: Instantbooth: personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411 (2023)
Shonenkov, A., Konstantinov, M., Bakshandaeva, D., Schuhmann, C., Ivanova, K., Klokova, N.: Deepfloyd-if (2023). https://github.com/deep-floyd/IF
Singh, S.: Impact of color on marketing. Manag. Decis. 44(6), 783–789 (2006)
Article Google Scholar
Tang, C., Wang, K., van de Weijer, J.: Iterinv: iterative inversion for pixel-level t2i models. In: Neurips 2023 Workshop on Diffusion Models (2023)
Google Scholar
Tang, C., Wang, K., Yang, F., van de Weijer, J.: Locinv: l-aware inversion for text-guided image editing. In: CVPR 2024 AI4CC Workshops (2024)
Google Scholar
Thurstone, L.L.: A law of comparative judgment. In: Scaling, pp. 81–92. Routledge (1927)
Google Scholar
Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. In: CVPR (2023)
Google Scholar
Van De Weijer, J., Schmid, C., Verbeek, J., Larlus, D.: Learning color names for real-world applications. IEEE TIP 18(7), 1512–1523 (2009)
Google Scholar
Vinker, Y., Voynov, A., Cohen-Or, D., Shamir, A.: Concept decomposition for visual exploration and inspiration. In: SIGGRAPH Asia 2023 (2023)
Google Scholar
Voynov, A., Chu, Q., Cohen-Or, D., Aberman, K.: $p+$: extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522 (2023)
Wang, K., Yang, F., Yang, S., Butt, M.A., van de Weijer, J.: Dynamic prompt learning: addressing cross-attention leakage for text-based image editing. Adv. Neural Inf. Process. Syst. (2023)
Google Scholar
Yeh, Y.Y., et al.: Texturedreamer: image-guided texture synthesis through geometry-aware diffusion. arXiv preprint arXiv:2401.09416 (2024)
Zhang, S., Xiao, S., Huang, W.: Forgedit: text guided image editing via learning and forgetting. arXiv preprint arXiv:2309.10556 (2023)
Zhang, Z., Han, L., Ghosh, A., Metaxas, D., Ren, J.: Sine: single image editing with text-to-image diffusion models. In: CVPR (2023)
Google Scholar

Download references

Acknowledgments

We acknowledge projects TED2021-132513B-I00, PID2021-128178OB-I00 and PID2022-143257NB-I00, financed by MCIN/AEI/10.13039/501100011033 and FSE+ by the European Union NextGenerationEU/PRTR, and by ERDF A Way of Making Europa, the Departament de Recerca i Universitats from Generalitat de Catalunya with reference 2021SGR01499, and the Generalitat de Catalunya CERCA Program.

Author information

Authors and Affiliations

Computer Vision Center, Barcelona, Spain
Muhammad Atif Butt, Kai Wang, Javier Vazquez-Corral & Joost van de Weijer
Universitat Autonoma de Barcelona, Barcelona, Spain
Javier Vazquez-Corral

Authors

Muhammad Atif Butt
View author publications
You can also search for this author in PubMed Google Scholar
Kai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Javier Vazquez-Corral
View author publications
You can also search for this author in PubMed Google Scholar
Joost van de Weijer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai Wang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 35326 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Butt, M.A., Wang, K., Vazquez-Corral, J., van de Weijer, J. (2025). ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15065. Springer, Cham. https://doi.org/10.1007/978-3-031-72667-5_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-72667-5_26
Published: 29 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72666-8
Online ISBN: 978-3-031-72667-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement