The Lottery Ticket Hypothesis in Denoising: Towards Semantic-Driven Initialization

Mao, Jiafeng; Wang, Xueting; Aizawa, Kiyoharu

doi:10.1007/978-3-031-72904-1_6

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15132))

Included in the following conference series:

European Conference on Computer Vision

279 Accesses

Abstract

Text-to-image diffusion models allow users control over the content of generated images. Still, text-to-image generation occasionally leads to generation failure requiring users to generate dozens of images under the same text prompt before they obtain a satisfying result. We formulate the lottery ticket hypothesis in denoising: randomly initialized Gaussian noise images contain special pixel blocks (winning tickets) that naturally tend to be denoised into specific content independently. The generation failure in standard text-to-image synthesis is caused by the gap between optimal and actual spatial distribution of winning tickets in initial noisy images. To this end, we implement semantic-driven initial image construction creating initial noise from known winning tickets for each concept mentioned in the prompt. We conduct a series of experiments that verify the properties of winning tickets and demonstrate their generalizability across images and prompts. Our results show that aggregating winning tickets into the initial noise image effectively induce the model to generate the specified object at the corresponding location.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Text to High Quality Image Generation Using Diffusion Model and Visual Transformer

MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask

Article 12 December 2024

Text-To-Image Generation Using Generative Adversarial Networks with Adaptive Attribute Modulation

References

Avrahami, O., et al.: SpaText: spatio-textual representation for controllable image generation. In: CVPR, pp. 18370–18380 (2023)
Google Scholar
Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: CVPR, pp. 18208–18218 (2022)
Google Scholar
Balaji, Y., et al.: eDiffi: text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)
Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: CVPR, pp. 18392–18402 (2023)
Google Scholar
Chen, K., Xie, E., Chen, Z., Hong, L., Li, Z., Yeung, D.Y.: Integrating geometric control into text-to-image diffusion models for high-quality detection data generation via text prompt. arXiv arXiv:2306.04607 (2023)
Cheng, J., Liang, X., Shi, X., He, T., Xiao, T., Li, M.: LayoutDiffuse: adapting foundational diffusion models for layout-to-image generation. arXiv arXiv:2302.08908 (2023)
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. NeurIPS 34, 8780–8794 (2021)
MATH Google Scholar
Ding, M., et al.: CogView: mastering text-to-image generation via transformers. NeurIPS 34, 19822–19835 (2021)
Google Scholar
Feng, W., et al.: Training-free structured diffusion guidance for compositional text-to-image synthesis. In: ICLR (2023)
Google Scholar
Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: International Conference on Learning Representations (2018)
Google Scholar
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-a-scene: scene-based text-to-image generation with human priors. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV, pp. 89–106. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19784-0_6
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. In: ICLR (2023)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS 33, 6840–6851 (2020)
Google Scholar
Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS Workshop (2021)
Google Scholar
Huang, L., Chen, D., Liu, Y., Shen, Y., Zhao, D., Zhou, J.: Composer: creative and controllable image synthesis with composable conditions. In: ICML (2023)
Google Scholar
Jia, C., et al.: SSMG: spatial-semantic map guided diffusion model for free-form layout-to-image generation. arXiv arXiv:2308.10156 (2023)
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: CVPR (2023)
Google Scholar
Kim, G., Kwon, T., Ye, J.C.: DiffusionCLIP: text-guided diffusion models for robust image manipulation. In: CVPR, pp. 2426–2435 (2022)
Google Scholar
Li, Y., et al.: GLIGEN: open-set grounded text-to-image generation. In: CVPR, pp. 22511–22521 (2023)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter MATH Google Scholar
Liu, L., Ren, Y., Lin, Z., Zhao, Z.: Pseudo numerical methods for diffusion models on manifolds. In: ICLR (2022)
Google Scholar
Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV, pp. 423–439. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_26
Mao, J., Wang, X.: Training-free location-aware text-to-image synthesis. In: ICIP (2023)
Google Scholar
Mao, J., Wang, X., Aizawa, K.: Guided image synthesis via initial image editing in diffusion model. In: ACM MM (2023)
Google Scholar
Mou, C., et al.: T2I-Adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv e-prints arXiv:2302.08453 (2023)
Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: ICML, pp. 8162–8171. PMLR (2021)
Google Scholar
Nichol, A.Q., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: ICML, pp. 16784–16804. PMLR (2022)
Google Scholar
Park, D.H., Azadi, S., Liu, X., Darrell, T., Rohrbach, A.: Benchmark for compositional text-to-image synthesis. In: NeurIPS Datasets and Benchmarks Track (Round 1) (2021)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML, pp. 8821–8831. PMLR (2021)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter MATH Google Scholar
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR, pp. 22500–22510 (2023)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)
Google Scholar
Voynov, A., Aberman, K., Cohen-Or, D.: Sketch-guided text-to-image diffusion models. In: SIGGRAPH (2023)
Google Scholar
Wang, C.Y., Yeh, I.H., Liao, H.Y.M.: You only learn one representation: unified network for multiple tasks. J. Inf. Sci. Eng. 39(3), 691–709 (2023)
MATH Google Scholar
Wang, T., et al.: Pretraining is all you need for image-to-image translation. In: arXiv arXiv:2205.12952 (2022)
Xie, J., et al.: BoxDiff: text-to-image synthesis with training-free box-constrained diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7452–7461 (2023)
Google Scholar
Xue, H., Huang, Z., Sun, Q., Song, L., Zhang, W.: Freestyle layout-to-image synthesis. In: CVPR, pp. 14256–14266 (2023)
Google Scholar
Yang, Z., et al.: ReCo: region-controlled text-to-image generation. In: CVPR, pp. 14246–14255 (2023)
Google Scholar
Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV (2023)
Google Scholar
Zheng, G., Zhou, X., Li, X., Qi, Z., Shan, Y., Li, X.: LayoutDiffusion: controllable diffusion model for layout-to-image generation. In: CVPR, pp. 22490–22499 (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

CyberAgent, Tokyo, Japan
Jiafeng Mao & Xueting Wang
The University of Tokyo, Tokyo, Japan
Jiafeng Mao & Kiyoharu Aizawa

Authors

Jiafeng Mao
View author publications
You can also search for this author in PubMed Google Scholar
Xueting Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kiyoharu Aizawa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiafeng Mao .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mao, J., Wang, X., Aizawa, K. (2025). The Lottery Ticket Hypothesis in Denoising: Towards Semantic-Driven Initialization. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15132. Springer, Cham. https://doi.org/10.1007/978-3-031-72904-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-72904-1_6
Published: 21 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72903-4
Online ISBN: 978-3-031-72904-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Lottery Ticket Hypothesis in Denoising: Towards Semantic-Driven Initialization