Skip to main content

ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Text-to-Image (T2I) generation has made significant advancements with the advent of diffusion models. These models exhibit remarkable abilities to produce images based on textual prompts. Current T2I models allow users to specify object colors using linguistic color names. However, these labels encompass broad color ranges, making it difficult to achieve precise color matching. To tackle this challenging task, named color prompt learning, we propose to learn specific color prompts tailored to user-selected colors. Existing T2I personalization methods tend to result in color-shape entanglement. To overcome this, we generate several basic geometric objects in the target color, allowing for color and shape disentanglement during the color prompt learning. Our method, denoted as ColorPeel, successfully assists the T2I models to peel off the novel color prompts from these colored shapes. In the experiments, we demonstrate the efficacy of ColorPeel in achieving precise color generation with T2I models. Furthermore, we generalize ColorPeel to effectively learn abstract attribute concepts, including textures, materials, etc. Our findings represent a significant step towards improving precision and versatility of T2I models, offering new opportunities for creative applications and design tasks. Our project is available at https://moatifbutt.github.io/colorpeel/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., Lischinski, D.: Break-a-scene: Extracting multiple concepts from a single image. In: SIGGRAPH Asia 2023 (2023)

    Google Scholar 

  2. Basu, S., et al.: Editval: benchmarking diffusion based text-guided image editing methods. arXiv preprint arXiv:2310.02426 (2023)

  3. Berlin, B., Kay, P.: Basic color terms: their universality and evolution. University of California Press (1991)

    Google Scholar 

  4. Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. In: CVPR (2023)

    Google Scholar 

  5. Chang, H., et al.: Muse: text-to-image generation via masked generative transformers. In: ICML (2023)

    Google Scholar 

  6. Chen, S., Huang, J.: FEC: three finetuning-free methods to enhance consistency for real image editing. arXiv preprint arXiv:2309.14934 (2023)

  7. Chen, W., et al.: Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186 (2023)

  8. Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: diffusion-based semantic image editing with mask guidance. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=3lge0p5o-M-

  9. Daras, G., Dimakis, A.: Multiresolution textual inversion. In: NeurIPS 2022 Workshop on Score-Based Methods (2022)

    Google Scholar 

  10. Dong, Z., Wei, P., Lin, L.: Dreamartist: towards controllable one-shot text-to-image generation via contrastive prompt-tuning. arXiv preprint arXiv:2211.11337 (2022)

  11. Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-a-scene: scene-based text-to-image generation with human priors. In: ECCV, pp. 89–106. Springer (2022)

    Google Scholar 

  12. Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: ICLR (2023)

    Google Scholar 

  13. Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D.: Designing an encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228 (2023)

  14. Ge, S., Park, T., Zhu, J.Y., Huang, J.B.: Expressive text-to-image generation with rich text. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7545–7556 (2023)

    Google Scholar 

  15. Han, I., Yang, S., Kwon, T., Ye, J.C.: Highly personalized text embedding for image manipulation by stable diffusion. arXiv preprint arXiv:2303.08767 (2023)

  16. Han, L., Li, Y., Zhang, H., Milanfar, P., Metaxas, D., Yang, F.: Svdiff: compact parameter space for diffusion fine-tuning. In: ICCV (2023)

    Google Scholar 

  17. Han, L., et al.: Improving negative-prompt inversion via proximal guidance. arXiv preprint arXiv:2306.05414 (2023)

  18. Hertz, A., Aberman, K., Cohen-Or, D.: Delta denoising score. arXiv preprint arXiv:2304.07090 (2023)

  19. Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. In: ICLR (2023)

    Google Scholar 

  20. Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

  21. Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2022)

    Google Scholar 

  22. Hong, S., Lee, G., Jang, W., Kim, S.: Improving sample quality of diffusion models using self-attention guidance. In: ICCV (2023)

    Google Scholar 

  23. Huang, Y., et al.: Diffusion model-based image editing: a survey. arXiv preprint arXiv:2402.17525 (2024)

  24. Ju, X., Zeng, A., Bian, Y., Liu, S., Xu, Q.: Direct inversion: boosting diffusion-based editing with 3 lines of code. arXiv preprint arXiv:2310.01506 (2023)

  25. Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: CVPR (2023)

    Google Scholar 

  26. Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)

  27. Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: CVPR (2023)

    Google Scholar 

  28. Li, S., et al.: Stylediffusion: Prompt-Embedding Inversion for Text-Based Editing (2023)

    Google Scholar 

  29. Liu, Z., et al.: Cones: concept neurons in diffusion models for customized generation. In: ICML (2023)

    Google Scholar 

  30. Lopes, I., Pizzati, F., de Charette, R.: Material palette: extraction of materials from a single image. arXiv preprint arXiv:2311.17060 (2023)

  31. Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=aBsCjcPu_tE

  32. Miyake, D., Iohara, A., Saito, Y., Tanaka, T.: Negative-prompt inversion: fast image inversion for editing with text-guided diffusion models. arXiv preprint arXiv:2305.16807 (2023)

  33. Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: CVPR (2023)

    Google Scholar 

  34. Montag, E.D.: Empirical formula for creating error bars for the method of paired comparison. J. Elec. Imag. 15(1), 010502 (2006)

    Google Scholar 

  35. Motamed, S., Paudel, D.P., Van Gool, L.: Lego: learning to disentangle and invert concepts beyond object appearance in text-to-image diffusion models. arXiv preprint arXiv:2311.13833 (2023)

  36. Parmar, G., Singh, K.K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y.: Zero-shot image-to-image translation. In: Proceedings of the ACM SIGGRAPH Conference on Computer Graphics (2023)

    Google Scholar 

  37. Podell, D., et al.: Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

  38. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)

  39. Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)

    Google Scholar 

  40. Rassin, R., Hirsch, E., Glickman, D., Ravfogel, S., Goldberg, Y., Chechik, G.: Linguistic binding in diffusion models: enhancing attribute correspondence through attention map alignment. Adv. Neural Inf. Process. Syst. 36 (2024)

    Google Scholar 

  41. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695 (2022)

    Google Scholar 

  42. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: MICCAI 2015, Part III 18, pp. 234–241. Springer (2015)

    Google Scholar 

  43. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023)

    Google Scholar 

  44. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. (2022)

    Google Scholar 

  45. Shi, J., Xiong, W., Lin, Z., Jung, H.J.: Instantbooth: personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411 (2023)

  46. Shonenkov, A., Konstantinov, M., Bakshandaeva, D., Schuhmann, C., Ivanova, K., Klokova, N.: Deepfloyd-if (2023). https://github.com/deep-floyd/IF

  47. Singh, S.: Impact of color on marketing. Manag. Decis. 44(6), 783–789 (2006)

    Article  Google Scholar 

  48. Tang, C., Wang, K., van de Weijer, J.: Iterinv: iterative inversion for pixel-level t2i models. In: Neurips 2023 Workshop on Diffusion Models (2023)

    Google Scholar 

  49. Tang, C., Wang, K., Yang, F., van de Weijer, J.: Locinv: l-aware inversion for text-guided image editing. In: CVPR 2024 AI4CC Workshops (2024)

    Google Scholar 

  50. Thurstone, L.L.: A law of comparative judgment. In: Scaling, pp. 81–92. Routledge (1927)

    Google Scholar 

  51. Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. In: CVPR (2023)

    Google Scholar 

  52. Van De Weijer, J., Schmid, C., Verbeek, J., Larlus, D.: Learning color names for real-world applications. IEEE TIP 18(7), 1512–1523 (2009)

    Google Scholar 

  53. Vinker, Y., Voynov, A., Cohen-Or, D., Shamir, A.: Concept decomposition for visual exploration and inspiration. In: SIGGRAPH Asia 2023 (2023)

    Google Scholar 

  54. Voynov, A., Chu, Q., Cohen-Or, D., Aberman, K.: \(p+\): extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522 (2023)

  55. Wang, K., Yang, F., Yang, S., Butt, M.A., van de Weijer, J.: Dynamic prompt learning: addressing cross-attention leakage for text-based image editing. Adv. Neural Inf. Process. Syst. (2023)

    Google Scholar 

  56. Yeh, Y.Y., et al.: Texturedreamer: image-guided texture synthesis through geometry-aware diffusion. arXiv preprint arXiv:2401.09416 (2024)

  57. Zhang, S., Xiao, S., Huang, W.: Forgedit: text guided image editing via learning and forgetting. arXiv preprint arXiv:2309.10556 (2023)

  58. Zhang, Z., Han, L., Ghosh, A., Metaxas, D., Ren, J.: Sine: single image editing with text-to-image diffusion models. In: CVPR (2023)

    Google Scholar 

Download references

Acknowledgments

We acknowledge projects TED2021-132513B-I00, PID2021-128178OB-I00 and PID2022-143257NB-I00, financed by MCIN/AEI/10.13039/501100011033 and FSE+ by the European Union NextGenerationEU/PRTR, and by ERDF A Way of Making Europa, the Departament de Recerca i Universitats from Generalitat de Catalunya with reference 2021SGR01499, and the Generalitat de Catalunya CERCA Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kai Wang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 35326 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Butt, M.A., Wang, K., Vazquez-Corral, J., van de Weijer, J. (2025). ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15065. Springer, Cham. https://doi.org/10.1007/978-3-031-72667-5_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72667-5_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72666-8

  • Online ISBN: 978-3-031-72667-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics