ABSTRACT
Recent text-to-image diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts. However, a persistent challenge lies in the generation of detailed images, especially human-related images, which often exhibit distorted faces and eyes. Existing approaches to address this issue either involve the utilization of more specific yet lengthy prompts or the direct application of restoration tools to the generated image. Besides, a few studies have shown that attention maps can enhance diffusion models' stability by guiding intermediate samples during the inference process. In this paper, we propose a novel training strategy (SAT) to improve the sample quality during the training process. To address this issue in a straightforward manner, we introduce blur guidance as a solution to refine intermediate samples, enabling diffusion models to produce higher-quality outputs with a moderate ratio of control. Improving upon this, SAT leverages the intermediate attention maps of diffusion models to further improve training sample quality. Specifically, SAT adversarially blurs only the regions that diffusion models attend to and guide them during the training process. We examine and compare both cross-attention mask control (CAC) and self-attention mask control (SAC) based on stable diffusion (SD) V-1.5, and our results show that our method under SAC (i.e SAT) improves the performance of stable diffusion.
- Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. 2022. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. CoRR , Vol. abs/2211.01324 (2022). https://doi.org/10.48550/arXiv.2211.01324 showeprint[arXiv]2211.01324Google ScholarCross Ref
- Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. 2023. Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. CoRR , Vol. abs/2301.13826 (2023). https://doi.org/10.48550/arXiv.2301.13826 showeprint[arXiv]2301.13826Google ScholarCross Ref
- Yuanqi Chen, Ge Li, Cece Jin, Shan Liu, and Thomas H. Li. 2021. SSD-GAN: Measuring the Realness in the Spatial and Spectral Domains. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 1105--1112. https://ojs.aaai.org/index.php/AAAI/article/view/16196Google Scholar
- Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. 2022. Diffusion Models in Vision: A Survey. CoRR , Vol. abs/2209.04747 (2022). https://doi.org/10.48550/arXiv.2209.04747 showeprint[arXiv]2209.04747Google ScholarCross Ref
- Carl Doersch. 2016. Tutorial on Variational Autoencoders. CoRR , Vol. abs/1606.05908 (2016). showeprint[arXiv]1606.05908 http://arxiv.org/abs/1606.05908Google Scholar
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. (2021). https://openreview.net/forum?id=YicbFdNTTyGoogle Scholar
- Patrick Esser, Robin Rombach, and Bjö rn Ommer. 2021. Taming Transformers for High-Resolution Image Synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19--25, 2021. Computer Vision Foundation / IEEE, 12873--12883. https://doi.org/10.1109/CVPR46437.2021.01268Google ScholarCross Ref
- Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun R. Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. 2023. Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1--5, 2023. OpenReview.net. https://openreview.net/pdf?id=PUIqjT4rzq7Google Scholar
- Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. 2022. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors. , Vol. 13675 (2022), 89--106. https://doi.org/10.1007/978--3-031--19784-0_6Google ScholarCross Ref
- Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM , Vol. 63, 11 (2020), 139--144. https://doi.org/10.1145/3422622Google ScholarDigital Library
- Tao Gui, Qi Zhang, Haoran Huang, Minlong Peng, and Xuanjing Huang. 2017. Part-of-Speech Tagging for Twitter with Adversarial Neural Networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9--11, 2017, , Martha Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.). Association for Computational Linguistics, 2411--2420. https://doi.org/10.18653/v1/d17--1256Google ScholarCross Ref
- Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. (2020). https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.htmlGoogle Scholar
- Susung Hong, Gyuseong Lee, Wooseok Jang, and Seungryong Kim. 2022. Improving Sample Quality of Diffusion Models Using Self-Attention Guidance. CoRR , Vol. abs/2210.00939 (2022). https://doi.org/10.48550/arXiv.2210.00939 showeprint[arXiv]2210.00939Google ScholarCross Ref
- Emiel Hoogeboom and Tim Salimans. 2022. Blurring Diffusion Models. CoRR , Vol. abs/2209.05557 (2022). https://doi.org/10.48550/arXiv.2209.05557 showeprint[arXiv]2209.05557Google ScholarCross Ref
- Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. (2022). https://openreview.net/forum?id=nZeVKeeFYf9Google Scholar
- Yuming Jiang, Shuai Yang, Haonan Qiu, Wayne Wu, Chen Change Loy, and Ziwei Liu. 2022. Text2Human: text-driven controllable human image generation. ACM Trans. Graph. , Vol. 41, 4 (2022), 162:1--162:11. https://doi.org/10.1145/3528223.3530104Google ScholarDigital Library
- Sangyun Lee, Hyungjin Chung, Jaehyeon Kim, and Jong Chul Ye. 2022a. Progressive Deblurring of Diffusion Models for Coarse-to-Fine Image Synthesis. CoRR , Vol. abs/2207.11192 (2022). https://doi.org/10.48550/arXiv.2207.11192 showeprint[arXiv]2207.11192Google ScholarCross Ref
- Sangyun Lee, Hyungjin Chung, Jaehyeon Kim, and Jong Chul Ye. 2022b. Progressive Deblurring of Diffusion Models for Coarse-to-Fine Image Synthesis. (2022). arxiv: 2207.11192 [cs.CV]Google Scholar
- Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, ICML 2022, 17--23 July 2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvá ri, Gang Niu, and Sivan Sabato (Eds.). PMLR, 12888--12900. https://proceedings.mlr.press/v162/li22n.htmlGoogle Scholar
- Vivian Liu and Lydia B. Chilton. 2022. Design Guidelines for Prompt Engineering Text-to-Image Generative Models. (2022), 384:1--384:23. https://doi.org/10.1145/3491102.3501825Google ScholarDigital Library
- Gustav Mü ller-Franzes, Jan Moritz Niehues, Firas Khader, Soroosh Tayebi Arasteh, Christoph Haarburger, Christiane Kuhl, Tianci Wang, Tianyu Han, Sven Nebelung, Jakob Nikolas Kather, and Daniel Truhn. 2022. Diffusion Probabilistic Models beat GANs on Medical Images. CoRR , Vol. abs/2212.07501 (2022). https://doi.org/10.48550/arXiv.2212.07501 showeprint[arXiv]2212.07501Google ScholarCross Ref
- Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. CoRR , Vol. abs/2112.10741 (2021). showeprint[arXiv]2112.10741 https://arxiv.org/abs/2112.10741Google Scholar
- Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved Denoising Diffusion Probabilistic Models. , Vol. 139 (2021), 8162--8171. http://proceedings.mlr.press/v139/nichol21a.htmlGoogle Scholar
- Bjö rn Ommer. 2007. Learning the Compositional Nature of Objects for Visual Recognition. Ph.,D. Dissertation. ETH Zurich, Zü rich, Switzerland. https://doi.org/10.3929/ethz-a-005506634Google ScholarCross Ref
- Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. CoRR , Vol. abs/2204.06125 (2022). https://doi.org/10.48550/arXiv.2204.06125 showeprint[arXiv]2204.06125Google ScholarCross Ref
- Severi Rissanen, Markus Heinonen, and Arno Solin. 2023. Generative Modelling with Inverse Heat Dissipation. (2023). https://openreview.net/pdf?id=4PJUBT9f2OlGoogle Scholar
- Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjö rn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. (2022), 10674--10685. https://doi.org/10.1109/CVPR52688.2022.01042Google ScholarCross Ref
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. , Vol. 9351 (2015), 234--241. https://doi.org/10.1007/978--3--319--24574--4_28Google ScholarCross Ref
- Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. (2022). http://papers.nips.cc/paper_files/paper/2022/hash/ec795aeadae0b7d230fa35cbaf04c041-Abstract-Conference.htmlGoogle Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA, , Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998--6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.htmlGoogle ScholarDigital Library
- Ruichen Wang, Zekang Chen, Chen Chen, Jian Ma, Haonan Lu, and Xiaodong Lin. 2023 a. Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models. CoRR , Vol. abs/2305.13921 (2023). https://doi.org/10.48550/arXiv.2305.13921 showeprint[arXiv]2305.13921Google ScholarCross Ref
- Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. 2023 b. DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models. (2023), 893--911. https://aclanthology.org/2023.acl-long.51Google Scholar
- Yiwen Xu, Maurice Pagnucco, and Yang Song. 2022. DHG-GAN: Diverse Image Outpainting via Decoupled High Frequency Semantics. In Computer Vision - ACCV 2022 - 16th Asian Conference on Computer Vision, Macao, China, December 4--8, 2022, Proceedings, Part VII (Lecture Notes in Computer Science, Vol. 13847), , Lei Wang, Juergen Gall, Tat-Jun Chin, Imari Sato, and Rama Chellappa (Eds.). Springer, 168--184. https://doi.org/10.1007/978--3-031--26293--7_11Google ScholarCross Ref
- Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Yingxia Shao, Wentao Zhang, Ming-Hsuan Yang, and Bin Cui. 2022b. Diffusion Models: A Comprehensive Survey of Methods and Applications. CoRR , Vol. abs/2209.00796 (2022). https://doi.org/10.48550/arXiv.2209.00796 showeprint[arXiv]2209.00796Google ScholarCross Ref
- Mengping Yang, Zhe Wang, Ziqiu Chi, and Wenyi Feng. 2022a. WaveGAN: Frequency-Aware GAN for High-Fidelity Few-Shot Image Generation. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XV (Lecture Notes in Computer Science, Vol. 13675), , Shai Avidan, Gabriel J. Brostow, Moustapha Cissé , Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer, 1--17. https://doi.org/10.1007/978--3-031--19784-0_1Google ScholarCross Ref
- Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and Augustus Odena. 2019. Self-Attention Generative Adversarial Networks. , bibinfonumpages7354--7363 pages. http://proceedings.mlr.press/v97/zhang19d.html ioGoogle Scholar
Index Terms
- SAT: Self-Attention Control for Diffusion Models Training
Recommendations
Self-Training with Selection-by-Rejection
ICDM '12: Proceedings of the 2012 IEEE 12th International Conference on Data MiningPractical machine learning and data mining problems often face shortage of labeled training data. Self-training algorithms are among the earliest attempts of using unlabeled data to enhance learning. Traditional self-training algorithms label unlabeled ...
Self-scaled conjugate gradient training algorithms
This article presents some efficient training algorithms, based on conjugate gradient optimization methods. In addition to the existing conjugate gradient training algorithms, we introduce Perry's conjugate gradient method as a training algorithm [A. ...
Self-paced multi-label co-training
AbstractMulti-label learning aims to solve classification problems where instances are associated with a set of labels. In reality, it is generally easy to acquire unlabeled data but expensive or time-consuming to label them, and this ...
Comments