research-article

SAT: Self-Attention Control for Diffusion Models Training

Authors:
Jing Huang

Huawei Singapore Research Center, Singapore, Singapore

Huawei Singapore Research Center, Singapore, Singapore

0009-0005-5651-9033
View Profile

,
Tianyi Zhang

Huawei Singapore Research Center, Singapore, Singapore

Huawei Singapore Research Center, Singapore, Singapore

0009-0000-8140-2199
View Profile

,
Wei Shi

Huawei Singapore Research Center, Singapore, Singapore

Huawei Singapore Research Center, Singapore, Singapore

0009-0006-2717-4192
View Profile

LGM3A '23: Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal ApplicationsNovember 2023Pages 15–22https://doi.org/10.1145/3607827.3616838

Published:29 October 2023Publication History

LGM3A '23: Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications

Pages 15–22

ABSTRACT

Recent text-to-image diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts. However, a persistent challenge lies in the generation of detailed images, especially human-related images, which often exhibit distorted faces and eyes. Existing approaches to address this issue either involve the utilization of more specific yet lengthy prompts or the direct application of restoration tools to the generated image. Besides, a few studies have shown that attention maps can enhance diffusion models' stability by guiding intermediate samples during the inference process. In this paper, we propose a novel training strategy (SAT) to improve the sample quality during the training process. To address this issue in a straightforward manner, we introduce blur guidance as a solution to refine intermediate samples, enabling diffusion models to produce higher-quality outputs with a moderate ratio of control. Improving upon this, SAT leverages the intermediate attention maps of diffusion models to further improve training sample quality. Specifically, SAT adversarially blurs only the regions that diffusion models attend to and guide them during the training process. We examine and compare both cross-attention mask control (CAC) and self-attention mask control (SAC) based on stable diffusion (SD) V-1.5, and our results show that our method under SAC (i.e SAT) improves the performance of stable diffusion.

References

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. 2022. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. CoRR , Vol. abs/2211.01324 (2022). https://doi.org/10.48550/arXiv.2211.01324 showeprint[arXiv]2211.01324Google ScholarCross Ref
Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. 2023. Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. CoRR , Vol. abs/2301.13826 (2023). https://doi.org/10.48550/arXiv.2301.13826 showeprint[arXiv]2301.13826Google ScholarCross Ref
Yuanqi Chen, Ge Li, Cece Jin, Shan Liu, and Thomas H. Li. 2021. SSD-GAN: Measuring the Realness in the Spatial and Spectral Domains. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 1105--1112. https://ojs.aaai.org/index.php/AAAI/article/view/16196Google Scholar
Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. 2022. Diffusion Models in Vision: A Survey. CoRR , Vol. abs/2209.04747 (2022). https://doi.org/10.48550/arXiv.2209.04747 showeprint[arXiv]2209.04747Google ScholarCross Ref
Carl Doersch. 2016. Tutorial on Variational Autoencoders. CoRR , Vol. abs/1606.05908 (2016). showeprint[arXiv]1606.05908 http://arxiv.org/abs/1606.05908Google Scholar
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. (2021). https://openreview.net/forum?id=YicbFdNTTyGoogle Scholar
Patrick Esser, Robin Rombach, and Bjö rn Ommer. 2021. Taming Transformers for High-Resolution Image Synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19--25, 2021. Computer Vision Foundation / IEEE, 12873--12883. https://doi.org/10.1109/CVPR46437.2021.01268Google ScholarCross Ref
Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun R. Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. 2023. Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1--5, 2023. OpenReview.net. https://openreview.net/pdf?id=PUIqjT4rzq7Google Scholar
Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. 2022. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors. , Vol. 13675 (2022), 89--106. https://doi.org/10.1007/978--3-031--19784-0_6Google ScholarCross Ref
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM , Vol. 63, 11 (2020), 139--144. https://doi.org/10.1145/3422622Google ScholarDigital Library
Tao Gui, Qi Zhang, Haoran Huang, Minlong Peng, and Xuanjing Huang. 2017. Part-of-Speech Tagging for Twitter with Adversarial Neural Networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9--11, 2017, , Martha Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.). Association for Computational Linguistics, 2411--2420. https://doi.org/10.18653/v1/d17--1256Google ScholarCross Ref
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. (2020). https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.htmlGoogle Scholar
Susung Hong, Gyuseong Lee, Wooseok Jang, and Seungryong Kim. 2022. Improving Sample Quality of Diffusion Models Using Self-Attention Guidance. CoRR , Vol. abs/2210.00939 (2022). https://doi.org/10.48550/arXiv.2210.00939 showeprint[arXiv]2210.00939Google ScholarCross Ref
Emiel Hoogeboom and Tim Salimans. 2022. Blurring Diffusion Models. CoRR , Vol. abs/2209.05557 (2022). https://doi.org/10.48550/arXiv.2209.05557 showeprint[arXiv]2209.05557Google ScholarCross Ref
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. (2022). https://openreview.net/forum?id=nZeVKeeFYf9Google Scholar
Yuming Jiang, Shuai Yang, Haonan Qiu, Wayne Wu, Chen Change Loy, and Ziwei Liu. 2022. Text2Human: text-driven controllable human image generation. ACM Trans. Graph. , Vol. 41, 4 (2022), 162:1--162:11. https://doi.org/10.1145/3528223.3530104Google ScholarDigital Library
Sangyun Lee, Hyungjin Chung, Jaehyeon Kim, and Jong Chul Ye. 2022a. Progressive Deblurring of Diffusion Models for Coarse-to-Fine Image Synthesis. CoRR , Vol. abs/2207.11192 (2022). https://doi.org/10.48550/arXiv.2207.11192 showeprint[arXiv]2207.11192Google ScholarCross Ref
Sangyun Lee, Hyungjin Chung, Jaehyeon Kim, and Jong Chul Ye. 2022b. Progressive Deblurring of Diffusion Models for Coarse-to-Fine Image Synthesis. (2022). arxiv: 2207.11192 [cs.CV]Google Scholar
Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, ICML 2022, 17--23 July 2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvá ri, Gang Niu, and Sivan Sabato (Eds.). PMLR, 12888--12900. https://proceedings.mlr.press/v162/li22n.htmlGoogle Scholar
Vivian Liu and Lydia B. Chilton. 2022. Design Guidelines for Prompt Engineering Text-to-Image Generative Models. (2022), 384:1--384:23. https://doi.org/10.1145/3491102.3501825Google ScholarDigital Library
Gustav Mü ller-Franzes, Jan Moritz Niehues, Firas Khader, Soroosh Tayebi Arasteh, Christoph Haarburger, Christiane Kuhl, Tianci Wang, Tianyu Han, Sven Nebelung, Jakob Nikolas Kather, and Daniel Truhn. 2022. Diffusion Probabilistic Models beat GANs on Medical Images. CoRR , Vol. abs/2212.07501 (2022). https://doi.org/10.48550/arXiv.2212.07501 showeprint[arXiv]2212.07501Google ScholarCross Ref
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. CoRR , Vol. abs/2112.10741 (2021). showeprint[arXiv]2112.10741 https://arxiv.org/abs/2112.10741Google Scholar
Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved Denoising Diffusion Probabilistic Models. , Vol. 139 (2021), 8162--8171. http://proceedings.mlr.press/v139/nichol21a.htmlGoogle Scholar
Bjö rn Ommer. 2007. Learning the Compositional Nature of Objects for Visual Recognition. Ph.,D. Dissertation. ETH Zurich, Zü rich, Switzerland. https://doi.org/10.3929/ethz-a-005506634Google ScholarCross Ref
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. CoRR , Vol. abs/2204.06125 (2022). https://doi.org/10.48550/arXiv.2204.06125 showeprint[arXiv]2204.06125Google ScholarCross Ref
Severi Rissanen, Markus Heinonen, and Arno Solin. 2023. Generative Modelling with Inverse Heat Dissipation. (2023). https://openreview.net/pdf?id=4PJUBT9f2OlGoogle Scholar
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjö rn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. (2022), 10674--10685. https://doi.org/10.1109/CVPR52688.2022.01042Google ScholarCross Ref
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. , Vol. 9351 (2015), 234--241. https://doi.org/10.1007/978--3--319--24574--4_28Google ScholarCross Ref
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. (2022). http://papers.nips.cc/paper_files/paper/2022/hash/ec795aeadae0b7d230fa35cbaf04c041-Abstract-Conference.htmlGoogle Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA, , Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998--6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.htmlGoogle ScholarDigital Library
Ruichen Wang, Zekang Chen, Chen Chen, Jian Ma, Haonan Lu, and Xiaodong Lin. 2023 a. Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models. CoRR , Vol. abs/2305.13921 (2023). https://doi.org/10.48550/arXiv.2305.13921 showeprint[arXiv]2305.13921Google ScholarCross Ref
Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. 2023 b. DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models. (2023), 893--911. https://aclanthology.org/2023.acl-long.51Google Scholar
Yiwen Xu, Maurice Pagnucco, and Yang Song. 2022. DHG-GAN: Diverse Image Outpainting via Decoupled High Frequency Semantics. In Computer Vision - ACCV 2022 - 16th Asian Conference on Computer Vision, Macao, China, December 4--8, 2022, Proceedings, Part VII (Lecture Notes in Computer Science, Vol. 13847), , Lei Wang, Juergen Gall, Tat-Jun Chin, Imari Sato, and Rama Chellappa (Eds.). Springer, 168--184. https://doi.org/10.1007/978--3-031--26293--7_11Google ScholarCross Ref
Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Yingxia Shao, Wentao Zhang, Ming-Hsuan Yang, and Bin Cui. 2022b. Diffusion Models: A Comprehensive Survey of Methods and Applications. CoRR , Vol. abs/2209.00796 (2022). https://doi.org/10.48550/arXiv.2209.00796 showeprint[arXiv]2209.00796Google ScholarCross Ref
Mengping Yang, Zhe Wang, Ziqiu Chi, and Wenyi Feng. 2022a. WaveGAN: Frequency-Aware GAN for High-Fidelity Few-Shot Image Generation. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XV (Lecture Notes in Computer Science, Vol. 13675), , Shai Avidan, Gabriel J. Brostow, Moustapha Cissé , Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer, 1--17. https://doi.org/10.1007/978--3-031--19784-0_1Google ScholarCross Ref
Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and Augustus Odena. 2019. Self-Attention Generative Adversarial Networks. , bibinfonumpages7354--7363 pages. http://proceedings.mlr.press/v97/zhang19d.html ioGoogle Scholar

Index Terms

SAT: Self-Attention Control for Diffusion Models Training
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Self-Training with Selection-by-Rejection
ICDM '12: Proceedings of the 2012 IEEE 12th International Conference on Data Mining

Practical machine learning and data mining problems often face shortage of labeled training data. Self-training algorithms are among the earliest attempts of using unlabeled data to enhance learning. Traditional self-training algorithms label unlabeled ...
Read More
Self-scaled conjugate gradient training algorithms

This article presents some efficient training algorithms, based on conjugate gradient optimization methods. In addition to the existing conjugate gradient training algorithms, we introduce Perry's conjugate gradient method as a training algorithm [A. ...
Read More
Self-paced multi-label co-training
Abstract
Multi-label learning aims to solve classification problems where instances are associated with a set of labels. In reality, it is generally easy to acquire unlabeled data but expensive or time-consuming to label them, and this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
LGM3A '23: Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications
November 2023
84 pages
ISBN:9798400702839
DOI:10.1145/3607827
General Chairs:
Zheng Wang
Huawei Singapore Research Center
,
Cheng Long
Nanyang Technological University
,
Shihao Xu
Huawei Singapore Research Center
,
Bingzheng Gan
Huawei Singapore Research Center
,
Wei Shi
Huawei Singapore Research Center
,
Zhao Cao
Huawei Technologies Co., Ltd
,
Tat-Seng Chua
National University of Singapore
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
attention mask control
lora
text-to-image diffusion model
training strategy
Qualifiers
- research-article
Conference
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 93
  Total Downloads
- Downloads (Last 12 months)93
- Downloads (Last 6 weeks)23
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SAT: Self-Attention Control for Diffusion Models Training

LGM3A '23: Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Self-Training with Selection-by-Rejection

Self-scaled conjugate gradient training algorithms

Self-paced multi-label co-training

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SAT: Self-Attention Control for Diffusion Models Training

LGM3A '23: Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Self-Training with Selection-by-Rejection

Self-scaled conjugate gradient training algorithms

Self-paced multi-label co-training

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media