摘要
目前文本生成图像的研究已显示出与普通画家类似的水平,但与艺术家绘画水平相比仍有很大改进空间;艺术家水平的绘画通常将多个意象的特征融合到一个意象中,以表示多层次语义信息。在预实验中,我们证实了这一点,并咨询了3个具有不同艺术欣赏能力的群体的意见,以确定画家和艺术家之间绘画水平的区别。之后,利用这些观点帮助人工智能绘画系统从普通画家水平的图像生成改进为艺术家水平的图像生成。具体来说,提出一种无需任何进一步预训练的、基于文本的多阶段引导方法,帮助扩散模型在生成的图像中向多层次语义表示迈进。实验中的机器和人工评估都验证了所提方法的有效性。此外,与之前单阶段引导方法不同,该方法能够通过控制不同阶段之间的指导步数来控制各个意象特征在绘画中的表现程度。
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Arjovsky M, Chintala S, Bottou L, 2017. Wasserstein GAN. https://arxiv.org/abs/1701.07875
Brock A, Donahue J, Simonyan K, 2019. Large scale GAN training for high fidelity natural image synthesis. Proc 7th Int Conf on Learning Representations.
Chen M, Radford A, Child R, et al., 2020. Generative pretraining from pixels. Proc 37th Int Conf on Machine Learning, p.1691–1703.
Chen N, Zhang Y, Zen H, et al., 2021. WaveGrad: estimating gradients for waveform generation. Proc 9th Int Conf on Learning Representations.
Child R, Gray S, Radford A, et al., 2019. Generating long sequences with sparse transformers. https://arxiv.org/abs/1904.10509
Dinh L, Krueger D, Bengio Y, 2015. NICE: non-linear independent components estimation. Proc 3rd Int Conf on Learning Representations.
Dinh L, Sohl-Dickstein J, Bengio S, 2017. Density estimation using real NVP. Proc 5th Int Conf on Learning Representations.
Goodfellow I, Pouget-Abadie J, Mirza M, et al., 2020. Generative adversarial networks. Commun ACM, 63(11):139–144. https://doi.org/10.1145/3422622
Gulrajani I, Ahmed F, Arjovsky M, et al., 2017. Improved training of wasserstein GANs. Proc 31st Int Conf on Neural Information Processing Systems, p.5767–5777.
Ho J, Salimans T, 2021. Classifier-free diffusion guidance. Proc Workshop on Deep Generative Models and Downstream Applications.
Ho J, Jain A, Abbeel P, 2020. Denoising diffusion probabilistic models. Proc 34th Int Conf on Neural Information Processing Systems, Article 574.
Karras T, Laine S, Aittala M, et al., 2020. Analyzing and improving the image quality of StyleGAN. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8107–8116. https://doi.org/10.1109/CVPR42600.2020.00813
Karras T, Laine S, Aila T, 2021. A style-based generator architecture for generative adversarial networks. IEEE Trans Patt Anal Mach Intell, 43(12):4217–4228. https://doi.org/10.1109/TPAMI.2020.2970919
Kingma DP, Welling M, 2014. Auto-encoding variational Bayes. Proc 2nd Int Conf on Learning Representations.
Kingma DP, Salimans T, Poole B, et al., 2021. Variational diffusion models. https://arxiv.org/abs/2107.00630
Kong ZF, Ping W, Huang JJ, et al., 2021. DiffWave: a versatile diffusion model for audio synthesis. Proc 9th Int Conf on Learning Representations.
Mescheder L, 2018. On the convergence properties of GAN training. https://arxiv.org/abs/1801.04406v1
Metz L, Poole B, Pfau D, et al., 2017. Unrolled generative adversarial networks. Proc 5th Int Conf on Learning Representations.
Mittal G, Engel JH, Hawthorne C, et al., 2021. Symbolic music generation with diffusion models. Proc 22nd Int Society for Music Information Retrieval Conf, p.468–475.
Nichol AQ, Dhariwal P, 2021. Improved denoising diffusion probabilistic models. Proc 38th Int Conf on Machine Learning, p.8162–8171.
Nichol AQ, Dhariwal P, Ramesh A, et al., 2022. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. Proc 39th Int Conf on Machine Learning, p.16784–16804.
Ramesh A, Pavlov M, Goh G, et al., 2021. Zero-shot text-to-image generation. Proc 38th Int Conf on Machine Learning, p.8821–8831.
Ramesh A, Dhariwal P, Nichol A, et al., 2022. Hierarchical text-conditional image generation with clip latents. https://arxiv.org/abs/2204.06125
Razavi A, van den Oord A, Vinyals O, 2019. Generating diverse high-fidelity images with VQ-VAE-2. Proc 33rd Int Conf on Neural Information Processing Systems, Article 1331.
Rombach R, Blattmann A, Lorenz D, et al., 2022. High-resolution image synthesis with latent diffusion models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10684–10695. https://doi.org/10.1109/CVPR52688.2022.01042
Saharia C, Chan W, Saxena S, et al., 2022. Photorealistic text-to-image diffusion models with deep language understanding. Proc 36th Int Conf on Neural Information Processing Systems, p.36479–36494.
Sohl-Dickstein J, Weiss EA, Maheswaranathan N, et al., 2015. Deep unsupervised learning using nonequilibrium thermodynamics. Proc 32nd Int Conf on Machine Learning, p.2256–2265.
Song J, Meng C, Ermon S, 2021. Denoising diffusion implicit models. Proc 9th Int Conf on Learning Representations.
Song Y, Sohl-Dickstein J, Kingma DP, et al., 2021. Score-based generative modeling through stochastic differential equations. Proc 9th Int Conf on Learning Representations.
van den Oord A, Kalchbrenner N, Espeholt L, et al., 2016a. Conditional image generation with pixelcnn decoders. Proc 30th Int Conf on Neural Information Processing Systems, p.4797–4805.
van den Oord A, Kalchbrenner N, Kavukcuoglu K, 2016b. Pixel recurrent neural networks. Proc 33rd Int Conf on Machine Learning, p.1747–1756.
Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000–6010.
Author information
Authors and Affiliations
Contributions
Taihao LI designed the research. Wang QI and Huanghuang DENG developed the methodology, collected the data, and worked on the software. Wang QI drafted the paper. Huanghuang DENG helped organize the paper. All the authors revised and finalized the paper.
Corresponding author
Ethics declarations
Wang QI, Huanghuang DENG, and Taihao LI declare that they have no conflict of interest.
Rights and permissions
About this article
Cite this article
Qi, W., Deng, H. & Li, T. Multistage guidance on the diffusion model inspired by human artists’ creative thinking. Front Inform Technol Electron Eng 25, 170–178 (2024). https://doi.org/10.1631/FITEE.2300313
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.2300313