skip to main content
10.1145/3607827.3616838acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

SAT: Self-Attention Control for Diffusion Models Training

Published:29 October 2023Publication History

ABSTRACT

Recent text-to-image diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts. However, a persistent challenge lies in the generation of detailed images, especially human-related images, which often exhibit distorted faces and eyes. Existing approaches to address this issue either involve the utilization of more specific yet lengthy prompts or the direct application of restoration tools to the generated image. Besides, a few studies have shown that attention maps can enhance diffusion models' stability by guiding intermediate samples during the inference process. In this paper, we propose a novel training strategy (SAT) to improve the sample quality during the training process. To address this issue in a straightforward manner, we introduce blur guidance as a solution to refine intermediate samples, enabling diffusion models to produce higher-quality outputs with a moderate ratio of control. Improving upon this, SAT leverages the intermediate attention maps of diffusion models to further improve training sample quality. Specifically, SAT adversarially blurs only the regions that diffusion models attend to and guide them during the training process. We examine and compare both cross-attention mask control (CAC) and self-attention mask control (SAC) based on stable diffusion (SD) V-1.5, and our results show that our method under SAC (i.e SAT) improves the performance of stable diffusion.

References

  1. Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. 2022. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. CoRR , Vol. abs/2211.01324 (2022). https://doi.org/10.48550/arXiv.2211.01324 showeprint[arXiv]2211.01324Google ScholarGoogle ScholarCross RefCross Ref
  2. Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. 2023. Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. CoRR , Vol. abs/2301.13826 (2023). https://doi.org/10.48550/arXiv.2301.13826 showeprint[arXiv]2301.13826Google ScholarGoogle ScholarCross RefCross Ref
  3. Yuanqi Chen, Ge Li, Cece Jin, Shan Liu, and Thomas H. Li. 2021. SSD-GAN: Measuring the Realness in the Spatial and Spectral Domains. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 1105--1112. https://ojs.aaai.org/index.php/AAAI/article/view/16196Google ScholarGoogle Scholar
  4. Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. 2022. Diffusion Models in Vision: A Survey. CoRR , Vol. abs/2209.04747 (2022). https://doi.org/10.48550/arXiv.2209.04747 showeprint[arXiv]2209.04747Google ScholarGoogle ScholarCross RefCross Ref
  5. Carl Doersch. 2016. Tutorial on Variational Autoencoders. CoRR , Vol. abs/1606.05908 (2016). showeprint[arXiv]1606.05908 http://arxiv.org/abs/1606.05908Google ScholarGoogle Scholar
  6. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. (2021). https://openreview.net/forum?id=YicbFdNTTyGoogle ScholarGoogle Scholar
  7. Patrick Esser, Robin Rombach, and Bjö rn Ommer. 2021. Taming Transformers for High-Resolution Image Synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19--25, 2021. Computer Vision Foundation / IEEE, 12873--12883. https://doi.org/10.1109/CVPR46437.2021.01268Google ScholarGoogle ScholarCross RefCross Ref
  8. Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun R. Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. 2023. Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1--5, 2023. OpenReview.net. https://openreview.net/pdf?id=PUIqjT4rzq7Google ScholarGoogle Scholar
  9. Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. 2022. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors. , Vol. 13675 (2022), 89--106. https://doi.org/10.1007/978--3-031--19784-0_6Google ScholarGoogle ScholarCross RefCross Ref
  10. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM , Vol. 63, 11 (2020), 139--144. https://doi.org/10.1145/3422622Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Tao Gui, Qi Zhang, Haoran Huang, Minlong Peng, and Xuanjing Huang. 2017. Part-of-Speech Tagging for Twitter with Adversarial Neural Networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9--11, 2017, , Martha Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.). Association for Computational Linguistics, 2411--2420. https://doi.org/10.18653/v1/d17--1256Google ScholarGoogle ScholarCross RefCross Ref
  12. Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. (2020). https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.htmlGoogle ScholarGoogle Scholar
  13. Susung Hong, Gyuseong Lee, Wooseok Jang, and Seungryong Kim. 2022. Improving Sample Quality of Diffusion Models Using Self-Attention Guidance. CoRR , Vol. abs/2210.00939 (2022). https://doi.org/10.48550/arXiv.2210.00939 showeprint[arXiv]2210.00939Google ScholarGoogle ScholarCross RefCross Ref
  14. Emiel Hoogeboom and Tim Salimans. 2022. Blurring Diffusion Models. CoRR , Vol. abs/2209.05557 (2022). https://doi.org/10.48550/arXiv.2209.05557 showeprint[arXiv]2209.05557Google ScholarGoogle ScholarCross RefCross Ref
  15. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. (2022). https://openreview.net/forum?id=nZeVKeeFYf9Google ScholarGoogle Scholar
  16. Yuming Jiang, Shuai Yang, Haonan Qiu, Wayne Wu, Chen Change Loy, and Ziwei Liu. 2022. Text2Human: text-driven controllable human image generation. ACM Trans. Graph. , Vol. 41, 4 (2022), 162:1--162:11. https://doi.org/10.1145/3528223.3530104Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sangyun Lee, Hyungjin Chung, Jaehyeon Kim, and Jong Chul Ye. 2022a. Progressive Deblurring of Diffusion Models for Coarse-to-Fine Image Synthesis. CoRR , Vol. abs/2207.11192 (2022). https://doi.org/10.48550/arXiv.2207.11192 showeprint[arXiv]2207.11192Google ScholarGoogle ScholarCross RefCross Ref
  18. Sangyun Lee, Hyungjin Chung, Jaehyeon Kim, and Jong Chul Ye. 2022b. Progressive Deblurring of Diffusion Models for Coarse-to-Fine Image Synthesis. (2022). arxiv: 2207.11192 [cs.CV]Google ScholarGoogle Scholar
  19. Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, ICML 2022, 17--23 July 2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvá ri, Gang Niu, and Sivan Sabato (Eds.). PMLR, 12888--12900. https://proceedings.mlr.press/v162/li22n.htmlGoogle ScholarGoogle Scholar
  20. Vivian Liu and Lydia B. Chilton. 2022. Design Guidelines for Prompt Engineering Text-to-Image Generative Models. (2022), 384:1--384:23. https://doi.org/10.1145/3491102.3501825Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gustav Mü ller-Franzes, Jan Moritz Niehues, Firas Khader, Soroosh Tayebi Arasteh, Christoph Haarburger, Christiane Kuhl, Tianci Wang, Tianyu Han, Sven Nebelung, Jakob Nikolas Kather, and Daniel Truhn. 2022. Diffusion Probabilistic Models beat GANs on Medical Images. CoRR , Vol. abs/2212.07501 (2022). https://doi.org/10.48550/arXiv.2212.07501 showeprint[arXiv]2212.07501Google ScholarGoogle ScholarCross RefCross Ref
  22. Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. CoRR , Vol. abs/2112.10741 (2021). showeprint[arXiv]2112.10741 https://arxiv.org/abs/2112.10741Google ScholarGoogle Scholar
  23. Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved Denoising Diffusion Probabilistic Models. , Vol. 139 (2021), 8162--8171. http://proceedings.mlr.press/v139/nichol21a.htmlGoogle ScholarGoogle Scholar
  24. Bjö rn Ommer. 2007. Learning the Compositional Nature of Objects for Visual Recognition. Ph.,D. Dissertation. ETH Zurich, Zü rich, Switzerland. https://doi.org/10.3929/ethz-a-005506634Google ScholarGoogle ScholarCross RefCross Ref
  25. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. CoRR , Vol. abs/2204.06125 (2022). https://doi.org/10.48550/arXiv.2204.06125 showeprint[arXiv]2204.06125Google ScholarGoogle ScholarCross RefCross Ref
  26. Severi Rissanen, Markus Heinonen, and Arno Solin. 2023. Generative Modelling with Inverse Heat Dissipation. (2023). https://openreview.net/pdf?id=4PJUBT9f2OlGoogle ScholarGoogle Scholar
  27. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjö rn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. (2022), 10674--10685. https://doi.org/10.1109/CVPR52688.2022.01042Google ScholarGoogle ScholarCross RefCross Ref
  28. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. , Vol. 9351 (2015), 234--241. https://doi.org/10.1007/978--3--319--24574--4_28Google ScholarGoogle ScholarCross RefCross Ref
  29. Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. (2022). http://papers.nips.cc/paper_files/paper/2022/hash/ec795aeadae0b7d230fa35cbaf04c041-Abstract-Conference.htmlGoogle ScholarGoogle Scholar
  30. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA, , Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998--6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.htmlGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ruichen Wang, Zekang Chen, Chen Chen, Jian Ma, Haonan Lu, and Xiaodong Lin. 2023 a. Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models. CoRR , Vol. abs/2305.13921 (2023). https://doi.org/10.48550/arXiv.2305.13921 showeprint[arXiv]2305.13921Google ScholarGoogle ScholarCross RefCross Ref
  32. Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. 2023 b. DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models. (2023), 893--911. https://aclanthology.org/2023.acl-long.51Google ScholarGoogle Scholar
  33. Yiwen Xu, Maurice Pagnucco, and Yang Song. 2022. DHG-GAN: Diverse Image Outpainting via Decoupled High Frequency Semantics. In Computer Vision - ACCV 2022 - 16th Asian Conference on Computer Vision, Macao, China, December 4--8, 2022, Proceedings, Part VII (Lecture Notes in Computer Science, Vol. 13847), , Lei Wang, Juergen Gall, Tat-Jun Chin, Imari Sato, and Rama Chellappa (Eds.). Springer, 168--184. https://doi.org/10.1007/978--3-031--26293--7_11Google ScholarGoogle ScholarCross RefCross Ref
  34. Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Yingxia Shao, Wentao Zhang, Ming-Hsuan Yang, and Bin Cui. 2022b. Diffusion Models: A Comprehensive Survey of Methods and Applications. CoRR , Vol. abs/2209.00796 (2022). https://doi.org/10.48550/arXiv.2209.00796 showeprint[arXiv]2209.00796Google ScholarGoogle ScholarCross RefCross Ref
  35. Mengping Yang, Zhe Wang, Ziqiu Chi, and Wenyi Feng. 2022a. WaveGAN: Frequency-Aware GAN for High-Fidelity Few-Shot Image Generation. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XV (Lecture Notes in Computer Science, Vol. 13675), , Shai Avidan, Gabriel J. Brostow, Moustapha Cissé , Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer, 1--17. https://doi.org/10.1007/978--3-031--19784-0_1Google ScholarGoogle ScholarCross RefCross Ref
  36. Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and Augustus Odena. 2019. Self-Attention Generative Adversarial Networks. , bibinfonumpages7354--7363 pages. http://proceedings.mlr.press/v97/zhang19d.html ioGoogle ScholarGoogle Scholar

Index Terms

  1. SAT: Self-Attention Control for Diffusion Models Training

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      LGM3A '23: Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications
      November 2023
      84 pages
      ISBN:9798400702839
      DOI:10.1145/3607827

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 October 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia
    • Article Metrics

      • Downloads (Last 12 months)93
      • Downloads (Last 6 weeks)23

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader