Abstract
Text-to-Image (T2I) generation has made significant advancements with the advent of diffusion models. These models exhibit remarkable abilities to produce images based on textual prompts. Current T2I models allow users to specify object colors using linguistic color names. However, these labels encompass broad color ranges, making it difficult to achieve precise color matching. To tackle this challenging task, named color prompt learning, we propose to learn specific color prompts tailored to user-selected colors. Existing T2I personalization methods tend to result in color-shape entanglement. To overcome this, we generate several basic geometric objects in the target color, allowing for color and shape disentanglement during the color prompt learning. Our method, denoted as ColorPeel, successfully assists the T2I models to peel off the novel color prompts from these colored shapes. In the experiments, we demonstrate the efficacy of ColorPeel in achieving precise color generation with T2I models. Furthermore, we generalize ColorPeel to effectively learn abstract attribute concepts, including textures, materials, etc. Our findings represent a significant step towards improving precision and versatility of T2I models, offering new opportunities for creative applications and design tasks. Our project is available at https://moatifbutt.github.io/colorpeel/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., Lischinski, D.: Break-a-scene: Extracting multiple concepts from a single image. In: SIGGRAPH Asia 2023 (2023)
Basu, S., et al.: Editval: benchmarking diffusion based text-guided image editing methods. arXiv preprint arXiv:2310.02426 (2023)
Berlin, B., Kay, P.: Basic color terms: their universality and evolution. University of California Press (1991)
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. In: CVPR (2023)
Chang, H., et al.: Muse: text-to-image generation via masked generative transformers. In: ICML (2023)
Chen, S., Huang, J.: FEC: three finetuning-free methods to enhance consistency for real image editing. arXiv preprint arXiv:2309.14934 (2023)
Chen, W., et al.: Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186 (2023)
Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: diffusion-based semantic image editing with mask guidance. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=3lge0p5o-M-
Daras, G., Dimakis, A.: Multiresolution textual inversion. In: NeurIPS 2022 Workshop on Score-Based Methods (2022)
Dong, Z., Wei, P., Lin, L.: Dreamartist: towards controllable one-shot text-to-image generation via contrastive prompt-tuning. arXiv preprint arXiv:2211.11337 (2022)
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-a-scene: scene-based text-to-image generation with human priors. In: ECCV, pp. 89–106. Springer (2022)
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: ICLR (2023)
Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D.: Designing an encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228 (2023)
Ge, S., Park, T., Zhu, J.Y., Huang, J.B.: Expressive text-to-image generation with rich text. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7545–7556 (2023)
Han, I., Yang, S., Kwon, T., Ye, J.C.: Highly personalized text embedding for image manipulation by stable diffusion. arXiv preprint arXiv:2303.08767 (2023)
Han, L., Li, Y., Zhang, H., Milanfar, P., Metaxas, D., Yang, F.: Svdiff: compact parameter space for diffusion fine-tuning. In: ICCV (2023)
Han, L., et al.: Improving negative-prompt inversion via proximal guidance. arXiv preprint arXiv:2306.05414 (2023)
Hertz, A., Aberman, K., Cohen-Or, D.: Delta denoising score. arXiv preprint arXiv:2304.07090 (2023)
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. In: ICLR (2023)
Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)
Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2022)
Hong, S., Lee, G., Jang, W., Kim, S.: Improving sample quality of diffusion models using self-attention guidance. In: ICCV (2023)
Huang, Y., et al.: Diffusion model-based image editing: a survey. arXiv preprint arXiv:2402.17525 (2024)
Ju, X., Zeng, A., Bian, Y., Liu, S., Xu, Q.: Direct inversion: boosting diffusion-based editing with 3 lines of code. arXiv preprint arXiv:2310.01506 (2023)
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: CVPR (2023)
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: CVPR (2023)
Li, S., et al.: Stylediffusion: Prompt-Embedding Inversion for Text-Based Editing (2023)
Liu, Z., et al.: Cones: concept neurons in diffusion models for customized generation. In: ICML (2023)
Lopes, I., Pizzati, F., de Charette, R.: Material palette: extraction of materials from a single image. arXiv preprint arXiv:2311.17060 (2023)
Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=aBsCjcPu_tE
Miyake, D., Iohara, A., Saito, Y., Tanaka, T.: Negative-prompt inversion: fast image inversion for editing with text-guided diffusion models. arXiv preprint arXiv:2305.16807 (2023)
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: CVPR (2023)
Montag, E.D.: Empirical formula for creating error bars for the method of paired comparison. J. Elec. Imag. 15(1), 010502 (2006)
Motamed, S., Paudel, D.P., Van Gool, L.: Lego: learning to disentangle and invert concepts beyond object appearance in text-to-image diffusion models. arXiv preprint arXiv:2311.13833 (2023)
Parmar, G., Singh, K.K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y.: Zero-shot image-to-image translation. In: Proceedings of the ACM SIGGRAPH Conference on Computer Graphics (2023)
Podell, D., et al.: Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)
Rassin, R., Hirsch, E., Glickman, D., Ravfogel, S., Goldberg, Y., Chechik, G.: Linguistic binding in diffusion models: enhancing attribute correspondence through attention map alignment. Adv. Neural Inf. Process. Syst. 36 (2024)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695 (2022)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: MICCAI 2015, Part III 18, pp. 234–241. Springer (2015)
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. (2022)
Shi, J., Xiong, W., Lin, Z., Jung, H.J.: Instantbooth: personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411 (2023)
Shonenkov, A., Konstantinov, M., Bakshandaeva, D., Schuhmann, C., Ivanova, K., Klokova, N.: Deepfloyd-if (2023). https://github.com/deep-floyd/IF
Singh, S.: Impact of color on marketing. Manag. Decis. 44(6), 783–789 (2006)
Tang, C., Wang, K., van de Weijer, J.: Iterinv: iterative inversion for pixel-level t2i models. In: Neurips 2023 Workshop on Diffusion Models (2023)
Tang, C., Wang, K., Yang, F., van de Weijer, J.: Locinv: l-aware inversion for text-guided image editing. In: CVPR 2024 AI4CC Workshops (2024)
Thurstone, L.L.: A law of comparative judgment. In: Scaling, pp. 81–92. Routledge (1927)
Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. In: CVPR (2023)
Van De Weijer, J., Schmid, C., Verbeek, J., Larlus, D.: Learning color names for real-world applications. IEEE TIP 18(7), 1512–1523 (2009)
Vinker, Y., Voynov, A., Cohen-Or, D., Shamir, A.: Concept decomposition for visual exploration and inspiration. In: SIGGRAPH Asia 2023 (2023)
Voynov, A., Chu, Q., Cohen-Or, D., Aberman, K.: \(p+\): extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522 (2023)
Wang, K., Yang, F., Yang, S., Butt, M.A., van de Weijer, J.: Dynamic prompt learning: addressing cross-attention leakage for text-based image editing. Adv. Neural Inf. Process. Syst. (2023)
Yeh, Y.Y., et al.: Texturedreamer: image-guided texture synthesis through geometry-aware diffusion. arXiv preprint arXiv:2401.09416 (2024)
Zhang, S., Xiao, S., Huang, W.: Forgedit: text guided image editing via learning and forgetting. arXiv preprint arXiv:2309.10556 (2023)
Zhang, Z., Han, L., Ghosh, A., Metaxas, D., Ren, J.: Sine: single image editing with text-to-image diffusion models. In: CVPR (2023)
Acknowledgments
We acknowledge projects TED2021-132513B-I00, PID2021-128178OB-I00 and PID2022-143257NB-I00, financed by MCIN/AEI/10.13039/501100011033 and FSE+ by the European Union NextGenerationEU/PRTR, and by ERDF A Way of Making Europa, the Departament de Recerca i Universitats from Generalitat de Catalunya with reference 2021SGR01499, and the Generalitat de Catalunya CERCA Program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Butt, M.A., Wang, K., Vazquez-Corral, J., van de Weijer, J. (2025). ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15065. Springer, Cham. https://doi.org/10.1007/978-3-031-72667-5_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-72667-5_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72666-8
Online ISBN: 978-3-031-72667-5
eBook Packages: Computer ScienceComputer Science (R0)