1 Introduction

Image generation task is one of the most important research topics in the field of artificial intelligence, which aims to generate realistic image content through algorithms and models. As shown in Fig. 1, this task ranges from simple image restoration (Bao et al. 2024; Fei et al. 2023; et al. 2023) and style transfer (Huang et al. 2024; Li et al. 2024; Wang et al. 2023) to complex scene generation (Liu and Liu 2024; Scene generation with hierarchical latent diffusion models 2023; Yang et al. 2023) and human generation (Ju et al. 2023; Wang et al. 2024; Liu et al. 2024). The advent of sophisticated technologies, including Variational Autoencoders (VAE) (Kingma et al. 2014), Normalizing Flows (Papamakarios et al. 2021), Generative Adversarial Networks (GANs) (Goceri 2024), and wavelet-based augmentation methods (Goceri 2023), has enabled the creation of unprecedented and imaginative visual effects in image generation.

Fig. 1
figure 1

Images generated by diffusion models in different visual tasks. The first row of images shows a combined coarse RGB, depth, normal, and high-resolution images generated conditioned on text and skeleton (Liu et al. 2024). The second row of images shows the generation results of several image restoration subtasks (Fei et al. 2023). The third row of images shows examples of the 256 \(\times\) 256 images generated by layout-to-image methods on COCO-Stuff and Visual Genome (Yang et al. 2023). The last row of the images shows the style transfer results generated by the StyleDiffusion (Wang et al. 2023)

Diffusion models, originally inspired by the process of diffusion in physics, have been adapted to describe the stochastic process of gradually transforming simple noise distributions into complex data distributions, such as images (Neal 2001; Jarzynski 1997). Diffusion models offer a distinctive approach to image generation, whereby the data distribution is conceptualised as a sequence of conditional distributions (Ho et al. 2020; Song and Ermon 2020; Song et al. 2021). In contrast to traditional generative models, such as GANs and VAEs, which sample directly from a learned latent space, diffusion models capture intricate data dependencies through the diffusion process (Yang et al. 2024; Luo 2022, 2023). Despite their considerable successes, diffusion models remain constrained by certain limitations, including prolonged training periods, elevated computational requirements, and difficulties in scaling to high-resolution imagery (Cao et al. 2024; Yang et al. 2024; Chang et al. 2023).

Diffusion models have recently become a prominent area of interest within the field of artificial intelligence, due to their remarkable ability to generate images of exceptional quality, which often rival those created by humans (Fan et al. 2023; Zhang et al. 2023; et al. 2024; Yang et al. 2024; Croitoru et al. 2023a; Lin et al. 2024). Diffusion models have also been applied to various areas of image generation, including Style Transfer (Zhang et al. 2023; Wang et al. 2023, 2024), Image Restoration (Luo et al. 2023; Qiu et al. 2023; Ren et al. 2023), Image Editing (Pang et al. 2023; Wang et al. 2023; Gu et al. 2024), Super-Resolution (Zhao et al. 2023; Wang et al. 2024; Gandikota et al. 2024), Text-to-image Generation (Nichol et al. 2022; Saharia et al. 2022; Structured prediction for efficient text-to-image generation 2024), and other tasks (Zeng et al. 2024; Zhao et al. 2024; Hudson et al. 2024). The application of diffusion modelling in the field of image generation has profoundly impacted many aspects of society, driving content innovation in industries such as advertising, gaming, film and television, reducing creation costs and increasing productivity by generating high-quality, photorealistic images. However, it also raises issues such as copyright and originality (Wang et al. 2024; Dubinski et al. 2024; Gu et al. 2024; Zhang et al. 2023), and poses challenges to traditional artistic creation and market patterns. Meanwhile, at the social and ethical level, images generated by diffusion models may be used for improper purposes such as misleading and false propaganda, posing a potential threat to public perception and information authenticity (Hu et al. 2023; Zhu et al. 2023; Linet al. 2023; Carlini et al. 2023; Ni et al. 2023; Seunghoo and Juhun 2024; Qu et al. 2023; Brack et al. 2023).

In light of the above, this paper focuses on recent applications of diffusion models in the field of image generation and the socio-ethical implications of diffusion model-based image generation. The following key aspects are addressed in this survey: firstly, in-depth research has been conducted on the historical background and theoretical basis of diffusion models (Ho et al. 2020; Alexander et al. 2021; Song and Ermon 2019), thereby providing a solid foundation for subsequent discussions. Secondly, the practical applications of diffusion models in image generation tasks, including image inpainting (Zhang et al. 2023; Anciukevicius et al. 2023; Liu et al. 2024; Lee et al. 2024), style transfer (Huang et al. 2024; Li et al. 2024; Wang et al. 2023) and super-resolution (Luo et al. 2024; Ma et al. 2024; Gao et al. 2023; Metzger et al. 2023), will be investigated. Thirdly, this study examines the challenges and potential solutions to the social problems encountered by diffusion models when utilised for image generation.

The aim of this work is to synthesise the extensive body of knowledge scattered across a wide range of publications, distil the key findings and present them in a coherent manner that will facilitate future research efforts. By providing a comprehensive overview of diffusion models in image generation, this survey aims to serve as a valuable resource for both novice researchers and experienced professionals wishing to deepen their knowledge or explore new avenues of research in this dynamic and evolving field.

The paper is divided into seven sections to provide a structured exploration of diffusion models in image generation. The introduction (Sect. 1) outlines the aims and significance of the study. Sect. 2, Related Works, places our research in the context of previous studies. Sect. 3, Background on diffusion models, explains the theoretical basis of diffusion models. Sect. 4, Diffusion Models in Image Generation, describes recent applications of these models. Ethical and social implications (Sect. 5) discusses the potential impact on society. Sect. 6, challenges and future directions, identifies current obstacles and suggests avenues for future research. The conclusion (Sect. 7) summarises the main findings and contributions of the paper.

2 Related works

In recent years, diffusion models have emerged as a promising approach to image generation. These models, inspired by the principles of non-equilibrium thermodynamics, employ an iterative process to refine the noise, ultimately resulting in the generation of coherent images. Notably, studies such as those conducted by Sohl-Dickstein et al. (2015) and Song and Ermon (2019) have illustrated the efficacy of diffusion models in producing high-quality images. The advent of diffusion models has significantly accelerated the advancement of the field of image generation. A considerable number of researchers have made significant contributions to the advancement of knowledge in the various subfields of image generation. Additionally, comprehensive survey papers on the application of diffusion generation models have also emerged, providing valuable synthesis and summarisation for the continuous progress in this field (Po et al. 2024; Cao et al. 2024; Zhang et al. 2023; Fan et al. 2024; et al. 2024; Yang et al. 2024; Li et al. 2023; Croitoru et al. 2023b; Moser et al. 2024; Ulhaq et al. 2022; Fan et al. 2024).

The objective of text-to-image (T2I) generation is to transform natural language text descriptions into corresponding visual images. This task requires the model to have both language understanding and visual representation in order to generate images that match the textual description. The previous survey literature (Cao et al. 2024; Zhang et al. 2023) provides a comprehensive review of T2I using the diffusion model, covering both the theoretical foundations and practical progress in the field. In addition to natural images, magnetic resonance imaging (MRI), as an important medical imaging modality, also offers unique application opportunities for diffusion models (Fan et al. 2024). Other survey literature (Cao et al. 2024; Yang et al. 2024) aims to provide a comprehensive and in-depth understanding of diffusion models, from basic formulae and algorithmic improvements to a variety of applications, revealing their development history and future trends.

In the field of computer vision, there is a substantial body of survey literature (Ulhaq et al. 2022; Croitoru et al. 2023b) that provides a comprehensive review of denoising diffusion models. This includes theoretical and practical contributions to the field. Additionally, there is a significant corpus of survey literature that addresses more specific visual subfields, such as image restoration and enhancement (Li et al. 2023), video generation ( et al. 2024), and image super-resolution (Moser et al. 2024).

The field of diffusion models is undergoing a period of rapid development, with a considerable amount of new research emerging in the area of image generation. It is therefore of great importance to conduct a comprehensive literature survey on the application of the latest diffusion model in image generation. However, the majority of existing surveys are flawed in two ways. Firstly, due to time constraints, they fail to cover the latest advances in diffusion-based image generation (Ulhaq et al. 2022; Li et al. 2023). Secondly, they rarely consider the potential social impact of diffusion models in the field of image generation (Croitoru et al. 2023b; Moser et al. 2024). Therefore, the aim of this study is to provide insight into the development of diffusion models in the field of image generation by conducting a large number of surveys in the field of image generation and related social ethics literature.

In order to guarantee that the research is both pioneering and pertinent, it is essential to document the most recent developments and discourses within the field. Consequently, our work has concentrated primarily on diffusion model methodologies for image generation over the past three years. In addition, we have investigated the ethical implications and potential mitigating strategies associated with these approaches. The selected papers not only demonstrate significant technological advances, but also address the broader impact of these technologies on society, thereby ensuring a comprehensive analysis of the technical and ethical dimensions of diffusion models in image generation.

3 Background on diffusion models

The core idea of the diffusion model comes from sequential Monte Carlo (Neal 2001) and non-equilibrium statistical physics (Jarzynski 1997), which uses a Markov chain to transform one distribution into another. The diffusion model has two main components: the forward diffusion process and the backward denoising process. The forward process gradually adds noise to the original image and eventually converts the image into a pure noise image that conforms to a Gaussian distribution. The corresponding backward process is exactly the opposite, converting a pure noise image into a realistic image that conforms to the original distribution through several steps of denoising operations. To further demonstrate the intuition of diffusion models, we will discuss the three main formulaic representations of diffusion models currently being studied: denoised diffusion probability models (Ho et al. 2020; Alexander et al. 2021), score-based generation models (Song and Ermon 2019, 2020), and stochastic differential equations (Song et al. 2020, 2021). The following three subsections will independently elaborate on each formulaic representation, while discussing their connections and differences.

3.1 Denoising diffusion probilistic models (DDPMs)

Suppose we sample the initial data \(x_0\sim q (x)\) from a real data distribution q(x). By using the forward diffusion process to gradually add noise to the initial data \(x_0\), a series of noisy data \(x_1,..., x_T\) are obtained. According to the properties of Markov processes and the chain rule of probability, we can represent the joint distribution of all data as \(q(x_1,..., x_T|x_0)\),

$$\begin{aligned} q(x_1,..., x_T|x_0)=\prod _{t=1}^{T}q(x_t|x_{t-1}). \end{aligned}$$
(1)
Fig. 2
figure 2

The directed graphical model of DDPM. Ho et al. (2020)

The transition kernel \(q(x_t|x_{t-1})\) is designed manually in DDPM (Ho et al. 2020), and noise is gradually added to the initial data at each step of the transition. Additionally, hyperparameters \(\beta\) can be set to establish a schedule for noise introduction as a variance control term.

$$\begin{aligned} q(x_t|x_{t-1})=\mathcal {N}(x_t; \sqrt{1-\beta _t}x_{t-1}, \beta _t I), \end{aligned}$$
(2)

where \(\beta _t \in (0,1)\).

By leveraging the properties of the Gaussian distribution and Eq.(2), we can streamline the calculation of the forward diffusion process, thereby directly obtaining the analytical form of \(q (x_t | x_0)\),

$$\begin{aligned} \begin{aligned} \alpha _t=1-\beta _t, \bar{\alpha }_t=\prod _{i=1}^{t}\alpha _t \\ q(x_t|x_0)=\mathcal {N}(x_t; \sqrt{\bar{\alpha }_t}x_0, (1-\bar{\alpha }_t)I) \end{aligned}. \end{aligned}$$
(3)

As t approaches infinity and \(\bar{\alpha }_t\) approaches 0, there is \(x_ T\sim \mathcal {N}(0,1)\). By employing initial data \(x_0\) and the sampled Gaussian vector \(\epsilon\), it is a straightforward process to utilise Eq.(3) to calculate sample \(x_t\),

$$\begin{aligned} x_t=\sqrt{\bar{\alpha }_t}x_0+(1-\bar{\alpha }_t)\epsilon . \end{aligned}$$
(4)

A new sample can be generated from the initial distribution by starting from a sample \(x_T\sim \mathcal {N}(0, I)\) and employing the reverse denoising process. A learnable transfer kernel based on \(x_0\) for a reverse step can be expressed in the following form:

$$\begin{aligned} \begin{aligned} q(x_{t-1}|x_t, x_0)=\frac{q(x_t|x_{t-1},x_0)q(x_{t-1}|x_0))}{q(x_t|x_0))}\\ =\frac{\mathcal {N}(x_t; \sqrt{\alpha }_t x_{t-1}, (1-\alpha _t)I)\mathcal {N}(x_{t-1}; \sqrt{\bar{\alpha }}_{t-1}x_0, (1-\bar{\alpha }_{t-1})I)}{\mathcal {N}(x_t;\sqrt{\bar{\alpha }}_t x_0, (1-\bar{\alpha }_t)I)}\\ =\mathcal {N}(x_{t-1}; \mu _q(x_t,x_0), \Sigma _q(t)I), \end{aligned} \end{aligned}$$
(5)

It can be observed that the \(x_{t-1}\) obtained from each reverse step follows a normal distribution, with the mean \(\mu _q(x_t,x_0)\) being a function of \(x_t\) and \(x_0\), and the variance \(\Sigma _q(t)\) being a function of the coefficient \(\alpha\). The coefficient \(\alpha\) is fixed and known at each step. Therefore, the key to learning the transfer kernel is to fit the distribution, which is to predict the mean \(\mu _q(x_t,x_0)\) and variance \(\Sigma _q(t)\),

$$\begin{aligned} p_{\theta }(x_{t-1}|x_t)=\mathcal {N}(x_{t-1}; \mu _{\theta }(x_t, t), \Sigma _q(t)I), \end{aligned}$$
(6)

where \(\theta\) is a learnable neural network parameter.

During the training process, we need to make the approximate denoising transformation \(p_{\theta }(x_{t-1}|x_t)\) as closely as possible to the actual denoising transformation \(q(x_{t-1}|x_t,x_0)\). According to Eq.(5) and Eq.(6), it is known that their variance terms are fixed and can be accurately matched. Therefore, when using KL-divergence to calculate the difference between two distributions, we only need to consider their means. Due to the difficulty in calculating \(p_{\theta }(x_0)\), it can be processed by minimizing the variational lower bound of negative log-likelihood. For the matching of the entire trajectory, the objective function can be formulated as follows:

$$\begin{aligned} L_{vlb}=L_0+\sum _{1<t<T}L_t + L_T, \end{aligned}$$
(7)

where \(L_0=-\log {p_{\theta }(x_0|x_1)}\), \(L_T=KL(q(x_T|x_0)||\pi (x_T))\), and \(L_t=KL(q(x_{t-1}|x_t,x_0)||p_{\theta }(x_{t-1}|x_t))\). KL() represents the KL-divergence of two distributions. Furthermore, since \(q(x_{t-1}|x_t,x_0)\) follows a normal distribution, we can calculate the closed form of KL-divergence.

As proposed in DDPM (Ho et al. 2020), the optimization of the function \(L_{vlb}\) can be achieved by minimizing the deviation between the model’s true noise and approximate noise at random time steps in the trajectory:

$$\begin{aligned} L_s=\mathbb {E}_{t\sim [1,T],x_0 \sim q(x_0), \epsilon _t \sim \mathcal {N}(0,I)}[||\epsilon _t - \epsilon _{\theta }(x_t, t)||^2]. \end{aligned}$$
(8)

where \(\mathbb {E}\) is the expected value calculation, \(\epsilon _{\theta }(x_t, t)\) is predict noise in step t by neural network, \(x_0\) is the initial data.

3.2 Score-based generative models (SGMs)

The objective of an explicit generative model is to model the probability distribution of data, and then use this distribution to generate samples through sampling. Score-Based Models (Song and Ermon 2019) do not directly learn the probability distribution p(x) of the data, but rather learn score function, which is the logarithmic gradient of the probability distribution. It is defined as follows:

$$\begin{aligned} \nabla _x \log {p(x)}=\frac{\partial \log {p(x)}}{\partial x}. \end{aligned}$$
(9)

The score function learned by Score-Based Models can be denoted as \(s_{\theta }(x)\), and by learning to approximate the logarithmic gradient of the probability distribution, \(s_{\theta }(x)\approx \nabla _x \log {p(x)}\). Furthermore, \(s_{\theta }(x)\) can be parameterized using an energy-based model,

$$\begin{aligned} \begin{aligned} p_{\theta }(x)=\frac{e^{-f_{\theta }(x)}}{Z_{\theta }}\\ s_{\theta }(x)=\nabla _x \log {p_{\theta }(x)}=-\nabla _x f_{\theta }(x)-\nabla _x \log {Z_{\theta }}, \end{aligned} \end{aligned}$$
(10)

where \(Z_{\theta }=\int e^{-f_{\theta }(x)} \textrm{d}x\) is normalizing constant. This also ensures that \(p_{\theta }(x)\) is a density function. Since \(Z_{\theta }\) being a constant, thus \(\nabla _x \log {Z_{\theta }}\) equals zero. Therefore, Score-Based Models are not related to normalization terms, which greatly expands the range of models or architectures that can be used.

Similar to other explicit generative models, Score-Based Models can calculate the Fisher divergence between the predicted distribution of the model and the ground-truth data distribution to measure the difference between the two distributions. Then, the model is trained by minimizing Fisher divergence. Therefore, the objective function can be defined as:

$$\begin{aligned} \mathbb {E}_{p(x)}[||\nabla _x \log {p(x)} - s_{\theta }(x)||_2^2] \end{aligned}$$
(11)

However, it is difficult to calculate Fisher’s divergence directly because the score of real data cannot be calculated. Fortunately, we can use the score matching method (Hyvarinen and Dayan 2005; Vincent 2011; Song et al. 2020) to bypass the calculation of p(x) and minimize divergence. The optimization objective function can be rewritten as

$$\begin{aligned} \mathbb {E}_{p(x)}[2tr(\nabla _x s_{\theta }(x))+ ||s_{\theta }(x)||_2^2] \end{aligned}$$
(12)

Train the model using the above equation to make \(s_{\theta }(x)\approx \nabla _x \log {p(x)}\), thus obtaining the trained Score-Based Models. Then, we can use Langevin dynamics (Parisi 1981; Ulf and Miller 1994) to iteratively generate samples from the predicted score model. The process of sampling through Langevin dynamics can be expressed as the following iterations:

$$\begin{aligned} x_{i+1}\leftarrow x_i + \eta \nabla _x \log {p(x)} + \sqrt{2\eta }z_i, i=0,1,...,K, \end{aligned}$$
(13)

where \(z_i \sim \mathcal {N}(0, I)\), \(\eta\) is a sufficiently small quantity. As K approaches infinity, \(x_K\) converges to a sample that follows the data distribution p(x). Furthermore, it can be seen from the above equation that the unknown variable \(\nabla _x \log {p(x)}\) during the iterative calculation process has also been approximated through the prediction of the trained model \(s_{\theta }(x)\).

We have discussed the training and sampling of Score-Based Models. However, the performance in practical applications is not ideal, mainly due to the limited amount of data used for score matching in low-density areas, which leads to inaccurate estimation of the learned model in low-density areas, thereby limiting the model’s ability of the model to generate high-quality samples. Noise Conditional Score-Based Models (NCSMs) (Song and Ermon 2020; Song et al. 2020) propose the use of multi-scale noise to perturb data points and fill low-density areas, thereby improving the accuracy of estimated scores. The distribution after noise disturbance can be expressed as

$$\begin{aligned} p_{\sigma _i}(x)=\int p(y)\mathcal {N}(x;y,\sigma _i^2 I)\textrm{d}y, \end{aligned}$$
(14)

where \(\sigma _1< \sigma _2<...<\sigma _L\) are a set of increasing standard deviations. As a result, the goal of Score-Based Models also becomes to compute the score function \(s_{\theta }(x, i)\) of each noise disturbance distribution. Therefore, we can use the score matching to train the parameterized NCSN. The overall optimization objective function is a weighted Fisher divergence:

$$\begin{aligned} \sum _{i=1}^L \lambda (i) \mathbb {E}_{p_{\sigma _i}(x)} [||\nabla _x \log {p_{\sigma _i}(x)} - s_{\theta }(x,i)||_2^2], \end{aligned}$$
(15)

where \(\lambda (i)=\sigma _i^2\) is weighting term. The training process is consistent with \(s_{\theta }(x)\) without disturbance. After training the model \(s_{\theta }(x,i)\), use the annealed Langevin dynamics (Song and Ermon 2019, 2020; Jolicoeur-Martineau et al. 2020) to iteratively generate samples, where \(i=L, L-1,..., 1\).

3.3 Stochastic differential equations (SDEs)

As the noise scale and time step approach infinity, the noise disturbance process of Score-Based Models and the noise addition process of diffusion models (i.e. the forward diffusion process) can be summarized as a continuous time stochastic process. Many stochastic processes can be represented as solutions of stochastic differential equations (SDEs), and a stochastic process represented by SDEs can be expressed in the following form:

$$\begin{aligned} \textrm{d}x = f(x,t)\textrm{d}t + g(t)\textrm{d}w, \end{aligned}$$
(16)

where \(f(\centerdot ,t): \mathcal {R}^d \rightarrow \mathcal {R}^d\) and \(g(t)\in \mathcal {R}\) are vector-valued function and real-valued function respectively. \(\textrm{d}w\) is an infinitesimal Gaussian white noise, and w is a standard Brownian motion. It should be noted that Anderson (Anderson 1982) proposes that there exists a corresponding reverse SDEs for any SDEs, and it has the following closed form:

$$\begin{aligned} \textrm{d}x=[f(x,t)-g^2 (t)\nabla _x \log {p_t(x)}]\textrm{d}t + g(t)\textrm{d}w, \end{aligned}$$
(17)

where \(\textrm{d}t\) is an negative infinitesimal quantity of time. To compute the inverse SDEs, we need to estimate the score function of \(p_t(x)\).

To train \(s_{\theta }(x, i)\), the Fisher divergence under continuous time can be calculated as follows:

$$\begin{aligned} \mathbb {E}_{t\in \mathcal {U}(0,T)}\mathbb {E}_{p_t(x)}[\lambda (t)||\nabla _x \log {p_t(x)} - s_{\theta }(x,t)||_2^2] \end{aligned}$$
(18)

where \(\mathcal {U}(0,T)\) represents a uniform distribution over time [0, T]. \(\lambda\) is consistent with Score-Based Models and is a weight function.

Specifically, DDPMs and SGMs correspond to two special SDEs, and their ways of adding noise in the forward process are different. As proposed by Song et al. (2020) (Song et al. 2020), the DDPM corresponds to the Variance Preserving SDE (VP SDE) and has the following form:

$$\begin{aligned} \textrm{d}x = -\frac{1}{2}\beta (t)x \textrm{d}t + \sqrt{\beta (t)}\textrm{d}w, \end{aligned}$$
(19)

where \(\beta (t)\) is a predefined schedule function. SGMs correspond to Variance Exploring SDE (VE SDE) and have the following form:

$$\begin{aligned} \textrm{d}x = \sqrt{\frac{\textrm{d}[\sigma ^2(t)]}{\textrm{d}t}}\textrm{d}w, \end{aligned}$$
(20)

where \(\sigma (t)\) represents the disturbance noise corresponding to time T approaching infinity. Both SGMs and DDPMs can be regarded as discretization of stochastic differential equations determined by fractional functions. Therefore, the score based generative model and the diffusion probability model can be summarized into a unified framework of SDEs.

3.4 Controllable generation for diffusion models

The main purpose of the generative models we discussed earlier is to fit the data distribution p(x), to generate samples with the same distribution as the initial data. These models can be summarized as unconditional generative models. However, in practical applications, we prefer to generate samples with certain characteristics according to our ideas, and these corresponding models are called conditional controlled diffusion models. Therefore, at this point, the fitting target of the model also correspondingly becomes the conditional data distribution p(x|y). According to the Bayes theorem, we can express it as

$$\begin{aligned} p(x|y) = \frac{p(x)p(y|x)}{p(y)} = \frac{p(x)p(y|x)}{\int p(x)p(y|x) \textrm{d}x}. \end{aligned}$$
(21)

Similar to Score-Based Models, by taking the gradients of x on both sides of this expression, we can obtain the score function of the conditional data distribution

$$\begin{aligned} \nabla _x \log {p(x|y)} = \nabla _x \log {p(x)} + \nabla _x \log {p(y|x)}. \end{aligned}$$
(22)

From the above expression, it can be seen that the score function of the conditional data distribution consists of the score function of the initial data and the known forward process. Similarly, conditional controlled diffusion models can be divided into Classifier Guidance and Classifier Free based on whether existing unconditional generation models are used.

For Classifier Guidance (Dhariwal and Nichol 2021; Liu et al. 2023), the conditional control generation model does not require retraining the diffusion model, and simple control can be achieved at a low cost by training a classifier. Therefore, in a conditional sampling process, the probability of state transition can be rewritten as (Dhariwal and Nichol 2021):

$$\begin{aligned} p_{\theta , \phi }(x_{t}|x_{t+1}, y) = Zp_{\theta }(x_{t}|x_{t+1})p_{\phi }(y|x_{t}), \end{aligned}$$
(23)

where Z denotes the normalization term, \(\theta\) is the diffusion model parameter, and \(\phi\) is the classifier parameter.

By taking the logarithm of both sides of the above equation and expanding it, we can obtain

$$\begin{aligned} \begin{aligned} \log {p_{\theta , \phi }(x_{t}|x_{t+1}, y)} = \log {p_{\theta }(x_{t}|x_{t+1})} + \log {p_{\phi }(y|x_{t})} + C \\ = -\frac{1}{2}(x_t - \mu - \Sigma g)^{\top }\Sigma ^{-1}(x_t - \mu - \Sigma g) + \frac{1}{2}g^{\top }\Sigma g + C \end{aligned} \end{aligned}$$
(24)

where \(g=\nabla _{x_t}\log {p_{\phi }(y|x_t)}|_{x_t=\mu }\). It can be seen that compared to the unconditional transition distribution, the conditional transition distribution also follows a Gaussian distribution, and the variance \(\Sigma\) is the same as the unconditional transition distribution \(p_{\theta }(x_{t}|x_{t+1})\). At the same time, the mean has a shift of \(\Sigma g\), which also includes gradient information from the classifier.

By using a hyperparameter s to control the degree of classifier guidance, the sampling algorithm with classifier guidance is represented as

$$\begin{aligned} x_{t-1} \sim \mathcal {N}(\mu + s\Sigma \nabla _{x_t}\log {p_{\phi }(y|x_t)}, \Sigma ), \end{aligned}$$
(25)

where \(\mu = \mu _{\theta }(x_t), \Sigma = \Sigma _{\theta }(x_t)\).

In the guided diffusion model, the diffusion model was not retrained, but the guidance information of the classifier was added during the sampling process. This type of method makes the model training process cumbersome and complex, and cannot fully exploit the performance of the diffusion model. Jonathan and Salimans (2022) has improved DDPM and proposed a classifier-free guidance technique, also known as Classifier Free Guidance (CFG). Its core idea is very simple: the conditional input is fed into the training process of the diffusion model and fitted directly to the model. There is no need for a classifier. This is fundamentally different from Classifier Guidance, which did not involve the training process of dynamic diffusion models, but only guided the sampling process.

In the training process of the CFG model, the conditional input y is introduced, and the input parameters for the noise prediction are \(x_t\), t, and y. Therefore, the loss function becomes as follows:

$$\begin{aligned} L_s(\theta )=\mathbb {E}_{t\sim [1,T],x_0 \sim q(x_0), \epsilon \sim \mathcal {N}(0,I)}[||\epsilon - \epsilon _{\theta }(x_t, t, y)||^2]. \end{aligned}$$
(26)

3.5 Improvements of diffusion models

Compared to VAEs (Higgins et al. 2017; Kingma et al. 2014; van et al. 2017) and GANs (Goodfellow et al. 2014; Arjovsky et al. 2017; Reed et al. 2016; Zhang et al. 2017), diffusion models theoretically require more time in the sampling process and may require thousands of evaluation steps to extract a single sample. This is because when using SDE or Markov processes to iteratively transform prior distributions into complex data distributions, a large number of function evaluations are involved in the reverse process. In addition, diffusion models also face the instability of the reverse process, as well as the computational requirements and constraint challenges required for training models in high-dimensional Euclidean space. Maximum likelihood estimation is not comparable to likelihood-based models, which is also a challenge for diffusion models.

Researchers have proposed various methods to address these challenges. For example, to improve sampling efficiency, many advanced SDE solution methods have been applied to diffusion models (Lu et al. 2022; Zhang et al. 2023; et al. 2023; Huang et al. 2024; Zheng et al. 2023). Meanwhile, diffusion distillation can also be used, which trains a large model but uses a small model to accelerate the sampling algorithm (Salimans et al. 2022; Li et al. 2023; Wizadwongsa et al. 2023; Wenliang et al. 2023). In summary, these efficient sampling methods can be divided into two main categories: learning-free sampling and learning-based sampling, with the difference between them being whether additional learning processes are required after the training of the diffusion model has been completed (Yang et al. 2024; Luo 2023). In addition, the design of new forward processes can also be used to improve sampling stability and reduce dimensionality (Yilun et al. 2023; Vahdat et al. 2021; Rombach et al. 2022).

When training a diffusion model, the (negative) variational lower bound (VLB) on the log-likelihood is used as the target, which may not be tight in many cases. This can lead to suboptimal log-likelihood in the diffusion model (Kingma et al. 2021; Maggiora et al. 2024). Zheng et al. (2023) proposed an improved diffusion ODE maximum likelihood estimation technique from both training and evaluation perspectives. In addition, many methods have been proposed to further maximise VLB and log-likelihood values from different aspects, including Noise Schedule Optimization (Kingma et al. 2021; Nichol et al. 2022), Reverse Variance Learning (Bao et al. 2022), Exact Likelihood Computation (Song et al. 2020; Cheng et al. 2022). To overcome the likelihood optimization ignored in diffusion models due to the intractability of log-likelihood, MLE training (Song et al. 2021; Kingma et al. 2021; Huang et al. 2021) and hybrid loss (Nichol et al. 2022; Cheng et al. 2022) are proposed to improve likelihood training.

Almost all diffusion models use the Convolutional U-Net (Ronneberger et al. 2015) as their backbone, but Peebles and Xie (2023); Ma et al. (2024) introduced Transformer to diffusion models, further enhancing their generation capabilities and resulting in the recent high-performance video generation model Sora Liuet al. (2024). The diffusion model assumes that the data exists in Euclidean space, which initially only handles continuous data such as images. To improve performance on discrete data or other data types, Feature Space Unification (Vahdat et al. 2021) and Data Dependent Transition Kernels (Austin et al. 2021) have been proposed to extend the application scope of diffusion models.

Table 1 Summary of the application of diffusion models in the field of image generation

4 Diffusion models in image generation

Fig. 3
figure 3

Stylized images generated by Puff-Net (Zheng et al. 2024)

4.1 Style transfer

The task of image style transfer has received widespread attention in the research community, to transform images of one style (source style) into another style (target style) (Gatys et al. 2015; Johnson et al. 2016; Dumoulin et al. 2017; Xun et al. 2017), as shown in Fig. 3. This transformation can be achieved by training a model to learn the image features of the source and target styles and then generating new images based on these features. Recently, diffusion models have been introduced into this area, achieving better performance and high-fidelity style image generation (Wang et al. 2023; Yang et al. 2023; Wang et al. 2023; Qi et al. 2024).

To solve the problem of preserving image content in the diffusion model, Huang et al. (2022) propose a text-driven image stylization framework based on dual diffusion to control the balance between content and style. This method integrates the multimodal style information as a guide into the step-by-step diffusion process, and performs the reverse denoising process on this basis, so that the styled results can better retain the structural information of the content image.

Brack et al. (2022) propose the Stable Artist, an iterative approach to guiding the generated images to the desired output. It achieves control by allowing the artist to guide the diffusion process along a variable number of semantic directions. This semantic guidance (SEGA) provides fine-grained control over the image generation process by exploiting complex operations in the underlying space of the model, allowing for tiny edits to the image, changes in composition and style, and optimization of the overall artistic concept.

Deng et al. (2023) propose a zero-shot (i.e. no training) training method through attention rearrangement, namely Z-STAR. This is a zero-shot image style conversion method that uses generative prior knowledge to transform image styles without retraining or adaptation. This approach can better solve the problem that the text prompt is too coarse to effectively express the required style details.

Chung et al. (2023) propose a style transfer approach that uses large-scale pre-trained diffusion models to simply manipulate features of self-attention, replacing the key and value of content with style, without the need for optimization or supervision (such as text). This method can effectively solve the problem that the existing style transfer methods based on diffusion model need to optimize the inference stage (such as fine-tuning or style text inversion), or cannot take advantage of the generation ability of large-scale diffusion model.

Zhang et al. (2023) propose a style transfer method based on inversion, namely InST. This method can effectively and accurately learn the key information of images to capture and transfer the style of painting art. This method is a good solution to the problems that specific artistic elements are difficult to transfer, the textual prompts of the target style can only be misdescribed, and it is difficult to reproduce the key ideas of specific paintings in the result.

To address the high cost of fine-tuning the diffusion model or additional neural network, Yang et al. (2023) propose a zero-shot contrastive (ZeCon) loss of the diffusion model without additional fine-tuning or auxiliary network to transfer the style of a given image and preserve its semantic content in the way of zero-shot. In addition, this method not only preserves the content but also realizes texture modification.

Wang et al. (2023) propose a new C-S disentanglement framework for style transfer. This framework can explicitly extract content information and implicitly learn complementary style information, which achieves interpretable and controllable C-S disentanglement and high-quality stylized results. They also further introduced the diffusion model into the C-S disentanglement framework, making the C-S disentanglement framework achieve SOTA results.

Pan (2023) propose an innovative style guidance approach that can improve the existing text-to-image diffusion model, while also supporting the use of reference images to guide arbitrary styles of generated images. This method optimizes the style guidance function to reduce the influence of noise input and maximize the guidance efficiency. As a result, the supervised style guidance and self-style guidance achieve effective results in generating the desired style images while maintaining a high correlation between the generated images and the text input.

Chen et al. (2023) propose a model called ArtFusion, which aims to provide a flexible balance between content and style for AST. The model utilizes a dual-condition latent diffusion probability model, which breaks the limitation of paired data in cDM training and promoting the progress of other multi-condition generation tasks. During the model training phase, the model transforms the style transfer task into a self-reconstruction task while maintaining robust stylization ability during inference.

Wang et al. (2024) propose a novel AST method called Highly Controllable Arbitrary Style Transfer (HiAST) to address the demand for flexible and customized stylized results. This model introduces a Style Adapter that allows users to flexibly manipulate the output stylized results by aligning multi-level style information and intrinsic knowledge in LDM.

Fig. 4
figure 4

Visual images of a variety of synthetic and real-world restoration tasks (Luo et al. 2023)

4.2 Image restoration

Image Restoration (IR) is a long-standing problem due to its broad applicability and ill-defined nature. The goal of IR is to recover a high-quality (HQ) image from its low-quality (LQ) counterpart, which has been damaged by various degradation factors (e.g., blur, mask, downsampling), as shown in Fig. 4. The role of different image restoration methods is shown below:

Lugmayr et al. (2022) propose a redrawing method based on the denoising diffusion probability model (DDPM) that does not require specific mask training. The reverse diffusion iteration is modified by sampling the unmasked region with the given image information. Instead of learning the mask condition generation model, this model samples the condition generation process from a given pixel in the reverse diffusion iteration and is not trained for the internal spray-painting task itself.

Luo et al. (2023) propose to use a latent diffusion model to achieve realistic image recovery at large scale. A U-net-based latent diffusion strategy is proposed, which allows image restoration to be performed in a compressed and low-resolution latent space, thus speeding up training and inference.

Lin et al. (2023) propose DiffBIR, which applies a pre-trained text-to-image diffusion model to the problem of blind image restoration. This method outperforms state-of-the-art approaches in blind image super-resolution and face restoration tasks on both synthetic and real-world datasets. Furthermore, DiffBIR can effectively handle severe degradation and restore both realistic and vivid semantic content.

Qiu et al. (2023) propose a bootstrap diffusion model, DiffBFR, for blind face recovery. DiffBRF effectively employs the diffusion model to solve the problem of blind face recovery, which not only reduces the training difficulty and training time of the whole model but also provides a less degraded input truncated sampling module with severe conditions. Moreover, this method outperforms GANs in terms of avoiding training collapse and generating long-tailed distributions.

Ren et al. (2023) introduce a simple and effective multiscale structure guide as an implicit bias to inform the icDPM about the coarse structure of sharp images in the middle layer. This guide leads to significant improvements in deblurring results, especially in invisible regions. The model can recover clean images more accurately and effectively.

Wang et al. (2023) propose a Coarse-to-Fine Diffusion Transformer (C2F-DFT). The C2F-DFT is built from a diffusion transformer block containing Diffusion Self-Attention (DFSA) and Diffusion Feedforward Network (DFN).C2F-DFT can well embed diffusion in transformers, allowing them not only to model long dependencies, but also to take full advantage of the generative power of diffusion models to facilitate better image recovery.

Liu et al. (2023) propose a residual denoising diffusion model (RDDM). It decouples the traditional single denoising diffusion process into residual diffusion and noise diffusion. Residual diffusion represents directional diffusion from the target image to the degraded input image and explicitly guides the reverse generation process of image recovery, while noise diffusion represents random disturbances in the diffusion process. RDDM can solve the uninterpretability problem of a single denoising process and unify different tasks that require different levels of certainty or diversity.

Chen et al. (2024) propose a hierarchical integrated diffusion model (HI-Diff) for real-world image deblurring. HI-Diff exploits the power of diffusion models to generate informative priors for better results and generalization in complex blurred scenes.

Fig. 5
figure 5

The resulting images obtained by editing the animal images (Nguyen et al. 2024; Ling et al. 2024)

4.3 Image editing

As shown in Fig. 5, image editing is a technology that modifies, enhances, or synthesizes images, which can be used in a variety of application scenarios, such as artistic creation, entertainment, education, medical treatment, etc. The goal of image editing is to produce images of high quality, variety, and control while maintaining the naturalness and semantic correctness of the images. In recent years, image editing methods based on generative models have made remarkable progress, especially those based on diffusion models.

Pang et al. (2023) propose a new initialization method called cross initialization, which can significantly reduce the gap between initial and learned embeddings, thereby improving the reconstruction quality and editability of images. It also introduces a regularization term to make the learned embedding stay close to the initial one, further improving editability.

Choi et al. (2023) propose a new editing method called Custom-Edit, which consists of two steps: (i) using a small number of reference images to customize the diffusion model, and (ii) using effective text-guided editing methods to edit the images. They found that customizing only language-related parameters and using enhanced text prompts significantly improved the similarity of the reference image while maintaining the similarity of the source image.

Li et al. (2023) propose two novel subject-driven image editing subtasks, namely subject replacement and subject addition, which can realize more refined and flexible image editing. A new iterative generation method, called DreamEditor, is designed to achieve high-quality theme replacement and addition by gradually adapting to changes from the source theme to the target theme.

Wang et al. (2023) propose a general framework called MDP, which can explain various operations suitable for editing in diffusion models. They found five different operations, including intermediate latent variables, conditional embeddings, cross-attention graphs, guided and predictive noise, and analyzed the parameters and operation plans corresponding to these operations. They also demonstrated a new control method that can achieve higher-quality local and global editing than previous work by manipulating predictive noise.

Huang et al. (2023) propose a new sampling method called KV Inversion, which enables high-quality image editing without the need for fine-tuning. This method can change the action of the object in the image according to the textual prompt while maintaining the texture and identity of the object in the image. The advantage of KV inversion is that there is no need to train the diffusion model itself, nor does it need to scan large data sets for time-consuming training, but only to use the pre-trained diffusion model and text encoder to achieve image action editing.

Lin et al. (2023) propose an image editing method based on learnable regions that can specify the editing target and content through text prompts without requiring the user to provide masks or sketches. This method uses a pre-trained text-to-image model and introduces a bounding box generator to find the editing region aligned with the text prompt. Therefore, the mask-based text-to-image model can perform local image editing without masks or other user-provided guidance.

Aiming at the problem of reconstruction failure in real image editing, Chen et al. (2023) propose three sampling methods: FEC-ref, FEC-noise, and FEC-kv-reuse for different editing types and settings. The goal of these three methods is to ensure the success of reconstruction, that is, the result of sampling can retain the texture and features of the original real image, and cooperate with multiple editing methods to improve the performance of the editing task. All three methods do not require fine-tuning of diffusion models or training on large datasets, thus saving time and computational resources.

Nguyen et al. (2023) propose a new image editing method, Visual Instruction Inversion, which can guide the text-to-image diffusion model through visual prompts. A new visual cue generator is designed to generate a short text instruction to describe the changes between images based on the given source image and target image.

Kim et al. (2023) propose a simple and effective way to make the process of writing text prompts more user-friendly by incorporating a text generation framework. Specifically, they first classify text prompts into three categories based on the level of semantic detail: simple, medium, and complex. They then use existing text generation frameworks, such as T5 (Raffel et al. 2020) and DALL-E (Ramesh et al. 2021), to generate medium or complex text prompts based on the target words entered by the user.

Dong et al. (2023) propose a method for image editing by text prompt. Given an original image and a target text prompt, the goal is to generate an edited image that is similar to the original image but conforms to the text prompt. The method uses a pre-trained text-to-image diffusion model and achieves editing through the technique of prompting adjustment inversion. It can realize flexible and diverse image editing without losing the details and quality of the original image.

Gu et al. (2024) propose a novel approach to an immersive image editing experience through personalized subject swapping. Photoswap first learns the visual concept of the subject from the reference image and then replaces it untrained into the target image using a pre-trained diffusion model. The effectiveness and controllability of Photoswap in personalized subject replacement show its wide application potential in entertainment and professional editing.

Han et al. (2024) propose a new editing method called Proximal Negative-Prompt Inversion (ProxNPI), It is an extension of the concepts of Negative-prompt inversion (NPI) Miyake et al. (2023) and Null-text inversion (NTI) (Mokady et al. 2023). ProxNPI reduces artifacts by introducing a regularized term and reconstructing the boot while preserving the untrained property.

Fig. 6
figure 6

Super-resolution results on real-world sample images (Wang et al. 2024)

4.4 Super-resolution

As shown in Fig. 6, the super-resolution task aims to enhance a low-resolution image or video to a high resolution through algorithms and models, while preserving and recovering as much detail information as possible in the image and reducing blurring and distortion.

Saharia et al. (2022) introduce a method called SR3 for image super-resolution. This method aims to achieve efficient and realistic image super-resolution processing by combining a denoising diffusion probability model and a U-Net model. The innovation of SR3 lies in the use of the diffusion probability model and the implementation of the reverse process using the U-Net architecture, effectively avoiding the complex intrinsic optimization problems of traditional autoregressive models.

Li et al. (2022) investigate a single image super-resolution method based on diffusion models - SRDiff. This method adopts the U-Net structure and achieves a stable and reliable training process by introducing multi-scale skip connections and diffusion models, thus generating high-quality and diverse super-resolution images. SRDiff performs excellently on various datasets, avoiding the excessive smoothness and mode collapse problems of traditional methods, while also supporting flexible image operations.

Zhao et al. (2023) introduce a novel image super-resolution method based on diffusion probability models - Partial Diffusion Models (PartDiff). This method gradually diffuses the low-resolution input image into an intermediate latent state of the high-resolution image and performs partial denoising operations, to achieve high-quality image super-resolution. PartDiff achieves good performance on both magnetic resonance imaging and natural images.

The Diffusion Rectification and Estimation-Adaptive Models (DREAM) framework, proposed by Zhou et al. (2023), aims to address the problem of training and sampling inconsistency in conditional diffusion models for super-resolution tasks. By introducing two key components, diffusion rectification, and estimation adaptation, the DREAM framework effectively improves the quality of generated images while accelerating the training convergence speed and sampling efficiency.

Wang et al. (2024) propose a simple and effective single-step SR generation method, SinSR, to overcome the limitation of the number of inferences faced by recent methods to improve the inference speed. This method reduces the number of inference steps for mapping between random noise at training time and generating high-resolution images by deriving deterministic sampling and proves that this deterministic mapping can use only one inference step to perform SR’s student model.

Gandikota et al. (2024) introduce the zero-shot text guidance problem of an open-domain image super-resolution solution. This approach allows users to explore diverse, semantically accurate reconstructions to maintain data consistency with low-resolution inputs with different large downsampling factors, without the need for explicit training for these specific downgrades.

To eliminate the problem of artifacts in the iterative process of traditional diffusion-based SR techniques,Qingping et al. (2024) propose a training-free method, Adaptive Reality-Guided Diffusion (SARGD), which can effectively identify and mitigate the propagation of artifacts in latent spaces. The method first uses an artifact detector to identify untrustworthy pixels to create a binary mask that highlights artifacts and then uses Reality Guided Optimization (RGR) to integrate this mask with a realistic latent representation to optimize artifacts to improve alignment with the original image.

Fig. 7
figure 7

Several examples of text-to-image generation (Liang et al. 2024)

4.5 Text-to-image generation

As shown in Fig. 7, the text-to-image generation task refers to the process of automatically converting the input text description into the corresponding visual image through natural language processing technology, aiming to achieve seamless conversion and fusion between language and image ( Müller et al. 2013; Yarom et al. 2023; Zhang et al. 2023). With the current development of deep learning technology (LeCun et al. 2015; Bengio et al. 2021), especially diffusion models, the quality of generated images has been greatly improved, and text-to-image generation has become the most attractive application in the field of computer vision (Reed et al. 2016; et al. 2018; Yu et al. 2022; Nichol et al. 2022; Jiayi et al. 2024).

Nichol et al. (2022) explored the application of guided diffusion to textual conditional image synthesis problems and compared two different guidance strategies: CLIP guidance (Radford et al. 2021) and classifier-free guidance (Jonathan and Salimans 2022). This method is the first attempt to apply a diffusion model to text-to-image generation, and it intuitively replaces the class labels in the class-conditioned diffusion model (i.e., ADM (Dhariwal et al. 2021)) with text, so that the sample generation is limited by text conditions. Compared to CLIP guidance, classifier-free guidance is preferred by human evaluators in terms of realism and text similarity.

Saharia et al. (2022) proposed Imagen, which abandons the cumbersome steps of GLIDE that require pre-trained visual-language models, and directly uses large language models such as T5 (Raffel et al. 2020) as text encoders, combined with diffusion models, to complete the direct association mapping from text to images. More importantly, they found that the general large language model is a very efficient text encoder for text-to-image generation, and increasing the size of the text encoder can effectively improve the quality of the generated samples and the alignment between text and image compared to increasing the size of the image diffusion model. In addition, the latest Imagen 3 (Jason et al. 2024), based on the latent diffusion model (Rombach et al. 2022), not only greatly improves text-image alignment to produce high-quality images, but also discusses security and representation issues.

Ramesh et al. (2022) proposes a two-stage model, in which a text is first given, an image embedding similar to CLIP is generated by the prior model, and then the image is generated by a decoder under the condition of image embedding. Both the prior and the decoder use the diffusion model. This method develops a method for training the diffusion prior in latent space and shows that it has comparable performance to the autoregressive Prior but with higher computational efficiency. [13] proposed DALL-E3 to address the difficulty of text-to-image models in following detailed image descriptions. By training the highly descriptive generated image titles, this method can greatly improve the prompt following ability of the text-to-image model.

In addition, diffusion-based text-to-image technologies, such as Stable Diffusion and Midjourney [1], show great potential for commercial applications. These models can transform simple text descriptions into high-quality images, greatly accelerating up the content creation process.

To solve the problem of high computational cost required by modern text-to-image models to generate high quality images, Jayasumana et al. propose a lightweight method Structured prediction for efficient text-to-image generation (2024) to optimize image region compatibility, reduce computational cost, and improve image quality by using Markov Random Field (MRF). Based on the Muse latent marker text-to-image model, the MarkovGen model combined with MRF speeds up the generation process and reduces artifacts to improve image quality.

Due to the shortcomings of the current diffusion model used in image generation to identify abstract continuous attributes, Cheng et al. (2024) propose a continuous 3D words technology to realize the fine control of multiple attributes in the image by users of the text-to-image model and realize efficient and burden-free image generation adjustment.

Inspired by the success of reinforcement learning with human feedback (RLHF) in large language models, Liang et al. (2024) proposed a variety of ways to enrich human feedback information and train a multimodal transformer to automatically predict these feedbacks to improve image generation.

4.6 Other tasks

A considerable number of notable works have employed diffusion models to address a range of subtasks associated with image generation. This paper will examine these applications and contributions in detail.

Low-light Image Enhancement Shang et al. (2024) propose a multi-domain multi-scale (MDMS) diffusion model for low-light image enhancement to address the limitations of the diffusion model and thus improve the quality of the generated images. Yi et al. (2023) proposed a physically interpretable diffusion model for low-light image enhancement, and solved various degradation problems in the image generation process by designing a multipath generative diffusion network, including noise, color bias, and dark illumination.

Image Denoising Zeng et al. (2024) introduce the diffusion model into the hyperspectral image denoising scene, and proposed a method Diff-Unmix that uses the denoising diffusion model to perform self-supervised denoising, which solves the problem that the current supervised denoising method is limited by the dataset.

Camouflaged Vision Perception Fan et al. (2022, 2023). To solve the problem that the current camouflage image generation methods require humans to specify the background and lead to high cost, Zhao et al. (2024) propose a Latent Background Knowledge Retrieval-Augmented Diffusion (LAKE-RED) to generate camouflage images, to expand the diversity of camouflage image samples at low cost.

Medical Generative Modeling Chenlu et al. (2024) unify the medical generation task and the unified generation task of the medical model, and proposed to align, extract and generate the multimodal medical model MedM2G in the unified model, which greatly enhanced the comprehensive diagnostic ability of the multimodal medical model.

Monocular Depth Estimation Ke et al. (2024) explore the role of the diffusion model’s ability to capture a wide range of priors in the depth estimation task, and proposed an affine-invariant monocular depth estimation method based on the use of stable diffusion to retain prior knowledge, which improved the model’s ability to understand new scenarios. Patni et al. (2024) explored the use of a global prior of images generated by a pre-trained ViT-based diffusion model to provide richer contextual information.

Representation Learning Hudson et al. (2024) propose a self-supervised diffusion model for representation learning. This method uses the diffusion model as a powerful representation learner to realize semantic learning in an unsupervised manner, and clarifies the potential of the diffusion model for learning rich representations.

Texture Synthesis Yeh et al. (2024) introduce a new image-guided texture synthesis method based on the diffusion model, which overcomes the limitations of traditional methods by using densely sampled views and precisely aligned geometric images, and greatly improves the quality of texture generation.

Causal Attribution Asnani et al. (2024) propose a causal attribution technique, ProMark, that allows the images generated by the generative model to be attributed to the model’s training data, such as images, objects, artists, etc.. This approach can transform creative workflows by enabling creators to generate relevant content for model training in order to earn rewards.

Image Classification Wang et al. (2024) propose a new inter-class data augmentation method, Diff-Mix, which can generate images that conform to the diversity of foreground objects and backgrounds for specific concepts, thereby improving image classification performance.

Image Morphing (Wolberg 1998) Zhang et al. (2024) propose DiffMorpher, a method for smoothing and interpolating images using prior knowledge of pre-trained diffusion models. This method uses two LoRA (Hu et al. 2022) fitting images to capture the semantics, and then interpolates between the LoRA parameters and the latent noise to achieve a smooth transition of the semantics, which solves the problem that the latent space of the current diffusion model is highly unstructured.

Image Inpainting Liu et al. (2024) explore the problem of semantic differences between masked and unmasked regions in the diffusion model for image inpainting, and proposed StrDiffusion, an image inpainting diffusion model that reconstructs texture denoising under structural guidance.

Scene Completion Nunes et al. (2024) explore the application of the diffusion model in 3D point cloud generation, and instead of the previous work directly using image-based diffusion methods Lee et al. (2023); Luo and Wei (2021); Lyu et al. (2022), it proposes to operate directly on the points, and redesigns the forward and backward processes of the diffusion model to make it work effectively within the 3D point cloud scene.

Intrinsic Image Decomposition The ambiguity between illumination and material properties, as well as the lack of real-world data sets, make the appearance decomposition task quite challenging. Kocsis et al. (2024)use the powerful prior knowledge of the latest diffusion models to sample from the solution space of the conditional generation model, which greatly improves the generalization of the model to the real image.

Image Deblurring To construct an efficient training dataset based on the generated realistic blurred images, et al. (2024) propose reBLurring AUgmentation (ID-Blau), which uses clear images and controllable blur conditions to pair to generate corresponding blurred images, to realize the diversified generation of blurred images.

From the above discussion, it can be seen that diffusion models are widely used in all corners of the field of image generation, and the above discussion is only a part of the application of tasks. In addition, the diffusion model is also applied to Image Rectangling (Zhou et al. 2024), Image Segmentation (Baranchuk et al. 2022; et al. 2023), Semantic Matching (Li et al. 2024), Visual Emotion Analysis (Yang et al. 2024), Face Recognition (Boutros et al. 2023), Anomaly Detection (Zhang et al. 2023), etc.

Table 2 Summary of the ethical and social implications of image generation based on diffusion models and corresponding countermeasures

5 Ethical and social implications

As technology progresses, it is imperative to contemplate the harmonization of artistic innovation with ethical responsibility. For instance, in the case of diffusion models used for image generation, it is imperative to adhere to the principles of copyright and refrain from encroaching upon the intellectual property rights of others. Concurrently, it is imperative to consider the influence of technology on societal culture and values in order to attain sustainable and responsible technological advancement. The generation of images by diffusion models has the potential to result in the dissemination of misleading information. Such occurrences have the potential to influence public perception and decision-making processes. It is, therefore, incumbent upon developers and users to ensure that generated images are accompanied by clear labels and instructions, in order to obviate any potential for misdirection on the part of the user.

5.1 User privacy data leakage

Although diffusion model-based image generation technology has shown great potential in the fields of creativity, entertainment, and design, the problem of user privacy leakage cannot be ignored. If this technology is used improperly, it may unknowingly leak sensitive information such as personal identity and living habits by collecting and analyzing image data uploaded by users, posing a serious threat to individual privacy (Hu et al. 2023; Zhu et al. 2023; Linet al. 2023; Carlini et al. 2023; Ni et al. 2023).In addition, privacy leakage may also lead to public distrust of new technologies, hinder the healthy development of technological innovation and application, and affect the reputation and prospects of the entire industry. Therefore, strengthening data protection and ensuring users’ privacy security are key issues that must be taken seriously in the promotion and application of such technologies.

Carlini et al. (2023) study the memory capacity of image diffusion models (such as stable diffusion) in the training data and the privacy problems it causes. By proposing a new data extraction method, the paper successfully extracts a large number of training examples from the diffusion model and shows how to identify these memory-generating samples by generating and filtering them. The researchers also thoroughly analyzed the effects of different types of models and parameter settings on the degree of privacy leakage, and found that the diffusion model has a higher risk of privacy leakage than other generative models. Zhu et al. (2023) investigate the privacy leakage problem of diffusion model in image generation, propose a reconstruction-based member inference attack method, and conduct experimental verification on several pre-trained diffusion models. This method realizes member inference by reconstructing images and calculating reconstruction errors. Compared with traditional gradient-based attack methods, this method has higher efficiency and stability, and is more difficult to defend. The experimental results show that the attack method can effectively infer the information of the members in the training dataset, and reveal the potential risks of the diffusion model in privacy protection.

Currently, many researchers have also conducted in-depth research and proposed solutions. Hu et al. (2023) explores the effectiveness of differential privacy technology as a potential defense measure, and points out that future research should focus on how to strike a balance between protecting privacy and improving model quality. Linet al. (2023) explores a solution for generating privacy-protected synthetic data via API interfaces using generative adversarial networks such as Stable Diffusion. They propose a framework called Private Evolution (PE), which uses evolutionary algorithms and models supported by existing APIs to generate synthetic data similar to the distribution of private data through an iterative process that ensures differential privacy protection. Ni et al. (2023) proposes a new method called Degeneration-Tuning (DT). The core idea of the method is to create a degraded dataset by disrupting low-frequency visual content and retuning the stable diffusion model to mask unwanted concepts when generating images, thus protecting certain concepts from attacks or leaks.

5.2 Copyright issues

The rapid development of image generation technology based on diffusion models has greatly enriched the opportunities for creative expression and visual content production, but it also raises serious copyright issues (Wang et al. 2024; Dubinski et al. 2024). These technologies can imitate or even create original works with high fidelity, making copyright ownership ambiguous and affecting the protection of rights for original creators (Gu et al. 2024; Zhang et al. 2023). Unauthorized use or reproduction of other people’s work styles for creation may infringe the copyrights of original creators, dampen their enthusiasm for creativity and hinder the healthy development of cultural industries. At the same time, it also increases the difficulty of copyright supervision and rights protection, and poses new challenges to the legal system.

In this regard, the solutions not only include improving laws and regulations, strengthening industry self-discipline, improving public awareness of copyright, and exploring new copyright protection modes, but researchers have also explored a large number of technical solutions from the technical aspect. For the copyright challenges of the diffusion model, Gandikota et al. (2024) proposes a method called Unified Concept Editing (UCE), which precisely edits models through closed-form solutions to eliminate copyrighted content, correct bias, and control inappropriate concepts. Kumari et al. (2023) propose a concept ablation method based on a diffusion model, including two schemes based on noise and anchor point, which avoid generating target concepts by adjusting the KL divergence between distributions. This method can effectively remove the target concept while retaining the related concept, which proves its practicality in copyright protection and privacy enhancement. Zhang et al. (2023) addressed a part of the infringement problem, namely the generation of infringing content using queries not directly related to the copyrighted subject matter. The authors develop a data generation pipeline for generating copyright investigation datasets for diffusion models, and through this pipeline generate datasets containing infringement examples for different diffusion models. Casper et al. (2023) proposed and implemented a simple and quantitative method to measure the performance of the model when imitating a particular artist by combining CLIP encoders and standard techniques. The results of the experiment showed that Stable Diffusion was able to successfully mimic the style of most professional digital artists, a finding that has important implications for addressing the correlation between AI-generated images and copyright law. In addition, the paper discusses how image classification techniques can be used to analyze legal claims and test defense strategies against AI imitations of copyrighted works.

In addition, for the commercial application of the diffusion model, researchers also proposed an evaluation framework and an attack method to evaluate the copyright security of the model. Wang et al. (2024) proposed a data pollution attack method called SilentBadDiffusion. This method uses multimodal large language models and text-guided image-filling techniques to generate images with specific prompts, and then injects this “poisoned” data into the training process of the diffusion model. The experimental results show that only a small amount of poisoning data is needed to enable the fine-tuned diffusion model to generate infringing content under specific trigger prompts. Dubinski et al. (2024) propose a new evaluation framework and attack method to evaluate the effectiveness of members’ inferred attacks. By designing a new dataset, LAION-mi, the authors find that previous evaluation schemes fail to fully reflect the true impact of members’ inferred attacks, revealing the serious privacy and copyright issues that large diffusion models face when processing copyrighted images. At the same time, this paper also highlights the challenges of shadow model attacks, such as high computational cost and difficult sampling.

5.3 Bias and fairness problem

The rapid development of the diffusion model in the field of image generation has greatly enriched the possibilities of content creation and visual presentation, but has also brought with it problems of prejudice and fairness that cannot be ignored (Luccioni et al. 2023; Naik and Nushi 2023; Jiang et al. 2023; De Simone et al. 2023). These issues can not only exacerbate existing inequalities in society, such as the automatic reproduction and propagation of gender, racial, or cultural stereotypes, but also limit the diversity of innovation and affect the inclusiveness and ethical acceptability of technology. In the long run, without effective governance, it will hinder the healthy development of technology, damage public trust, and pose a threat to the cultural diversity and inclusiveness of society.

Bansal et al. (2022) study the effect of moral natural language interventions on text-to-image generation models, specifically the performance of the stable diffusion model. Using the ENTIGEN dataset, the authors evaluated the impact of ethical interventions on image-generating diversity across three social axes: gender, skin color, and culture.

Naik and Nushi (2023) aim to systematically investigate and quantify the social biases in text-to-image generation models. Using the stable diffusion model as an example, this paper examines its performance in terms of gender, race, age, and geographic location. By designing a series of experiments, including the use of different prompt words and automated and human scoring methods, the study found that the Stable Diffusion model has a significant bias in image generation.

Luccioni et al. (2023) aim to explore the problem of social bias in machine learning driven text-to-image (TTI) systems, specifically for the performance of stable diffusion models. By proposing an evaluation method based on social attributes, combined with an analysis of occupational and social attributes, this paper reveals the gender and racial biases that exist in TTI systems when generating images.

Friedrich et al. (2023) mainly discuss the bias problem of artificial intelligence in text-to-image generation and proposes a new strategy called Fair Diffusion. It is designed to reduce or eliminate bias by controlling the direction and proportion of model output. The research focuses on the Stable Diffusion model, and reveals the model’s gender bias in image generation through a series of experiments.

Schramowski et al. (2023) focus on solving the problems of bias and misbehavior in real-world applications of image generation models under textual conditions. Specifically for stable diffusion models. By introducing a “secure latent diffusion” (SLD) approach, the paper aims to filter and balance the training data to eliminate or suppress inappropriate parts of the image.

Shen et al. (2023) focus on solving the fairness problem of text-to-image diffusion models, especially the bias of stable diffusion models. By introducing the distribution alignment loss function and fine-tuning the sampling process, this paper aims to control the distribution of certain attributes of the generated image to achieve fairness and diversity.

Jiang et al. (2023) aim to correct for racial stereotyping in image generation models such as stable diffusion. By introducing a framework called RS-corrector, this method adjusts the hidden code in latent space to eliminate racial bias while maintaining the integrity of the original model.

Li et al. (2023) focus on solving the fairness problem in text-to-image diffusion models, especially for the bias that can arise when generating human-related descriptions. To this end, the authors propose the Fair Mapping method, which is a lightweight, general-purpose solution that does not depend on a specific model. With well-designed prompts to control sensitive attributes and adjust offsets in embedded spaces to correct semantic features of the original language, Fair Mapping aims to achieve fairer image generation.

De Simone et al. (2023) aim to solve the problem of bias in the generation of text-to-image (GTTI) models by designing and implementing a tool called the fair diffusion model to improve the fairness and transparency of the model. Through the interactive interface and editing options, the tool allows users to analyze and adjust the worldview of the model to ensure that the generated image meets the fairness standards expected by users.

Gandikota et al. (2024) mainly study various security problems in text-to-image models and propose a solution called “Unified concept editing (UCE)”. The method uses a closed-form solution to accurately edit the model and supports simultaneous processing of multiple concepts such as bias, copyright, and offensive content without retraining the model. UCE enables targeted bias correction for multiple attributes while removing potentially copyrighted content and controlling inappropriate concepts.

5.4 Inappropriate image generation

The problem of inappropriate image generation faced by image generation technology based on diffusion models cannot be ignored. This problem may not only lead to the generation of violent, pornographic, or discriminatory images, which seriously violate social ethics, laws, and regulations, but also mislead the public and affect the healthy dissemination of information (Seunghoo and Juhun 2024; Qu et al. 2023; Brack et al. 2023, 2023; Rando et al. 2022). In addition, it can damage personal reputation and privacy, and aggravate the sense of insecurity and distrust in cyberspace. Therefore, how to effectively identify and prevent the generation of inappropriate images has become a key challenge to be solved in this field.

Rando et al. (2022) mainly study the security problem of the stable diffusion model in natural language processing tasks. Through the simulation of attacker behavior and reverse engineering analysis, the vulnerabilities of the model’s security filter are revealed and the corresponding improvement measures are proposed. The researchers generated several types of images to test the performance of the security filter and successfully bypassed the filter to generate images with pornographic content. The conclusion points out that there are loopholes in the security filter of the current stable diffusion model, and it is necessary to strengthen the security through open documents and disclosure channels.

Chin et al. (2023) propose an automated tool called Prompting4Debugging (P4D) to detect security vulnerabilities that can lead to inappropriate image generation by optimizing prompt words. P4D uses prompt engineering technology to find modified prompts that bypass security mechanisms through continuous/soft embedding optimization and discrete/hard embedding projection. The experiments show that even seemingly secure prompts can be vulnerable to manipulation, underscoring the importance of fully testing the security of T2I diffusion models.

Qu et al. (2023) aim to comprehensively study the potential risks of text-to-image models in generating unsafe images and hateful messages. By constructing five kinds of unsafe image classification systems and using four advanced text-to-image models, it is found that these models have the risk of generating unsafe images, and the stable diffusion model is particularly prominent.

Brack et al. (2023) conduct an in-depth study of the security of image generation models under text conditions in the application. The goal of the study was to uncover systemic security issues in existing imaging models and to assess the impact of counterattacks.

Pham et al. (2023) conduct an in-depth study of concept erasure methods in text-to-image generation models. Seven different concept erasure methods are described in detail and their effects on pre-trained diffusion models are shown. The experimental results show that these concept erasure methods do not eliminate sensitive concepts, but reintroduce the “erased” concepts by adjusting the embedding of input words.

Zhang et al. (2023) study the problem of text inversion for concept censorship in the text-to-image generation model and proposed a solution based on backdoor technology. By selecting sensitive words as triggers during training and using these triggers in combination with personalized embedding in the generation phase, the model outputs predefined target images instead of images containing malicious concepts. Experimental results show that this method can effectively prevent the cooperation between text inversion technology and censored words, while maintaining the original function of the model.

Brack et al. (2023) focus on the problem of text-to-image models in generating inappropriate content and propose two solutions: negative cuing and semantic guidance (SEGA). The study aims to make the images generated by the model consistent with human preferences by evaluating and guiding strategies. In practice, negative cuing reduces the generation of inappropriate content by avoiding specific cues, while SEGA manipulates the image generation process by adding additional cues while minimizing changes to the original image. These guidance methods can effectively reduce the probability of generating inappropriate content, and SEGA performs better.

Zhang et al. (2023) focus on solving the privacy, copyright, and security problems in text-to-image generation models. In particular, the models can learn and generate unauthorized personal information, content, and potentially harmful content. To this end, the authors propose an efficient and cost-effective solution called “selflessness,” which aims to remove a particular identity, object, or style from the model without affecting the model’s ability to generate other content.

Heng et al. (2024) propose Selective Amnesia (SA) to solve the problem of selective forgetting in deep generation models. Combining Bayesian Continual Learning (BCL), the method integrates Elastic Weight Consolidation (EWC) and Generative Replay (GR) into a training loss function. It allows forgetting of specific concepts without access to the original training data set. This research provides a new solution to the problem that large text-to-image models can generate harmful, misleading, and inappropriate content.

Seunghoo and Juhun (2024) propose a new algorithm called “Concept Eraser”. This method achieves the goal of removing or replacing specific concepts in the pre-trained model by modifying the drift of the classifier guide term and the unconditional score term. This algorithm can not only effectively erase the object concept, but also maintain the generative ability of the model.

5.5 Fake images

The proliferation of fake images, with their deceptive nature, has the potential to mislead the public, distort the truth, and contribute to misunderstanding and panic. It can harm trust, disrupt social stability, and have negative effects on individual reputation and mental health (Shen et al. 2019; Nash et al. 2009). Researchers have developed a number of detection and identification methods to address the threat of fake images. Among these, the method combining generative adversarial networks (GANs) and convolutional neural networks (CNNs) has demonstrated particularly promising results (Neves et al. 2020; Raza et al. 2024; Bhandari et al. 2023).

The advent of diffusion models in image generation has introduced new challenges to the authenticity and integrity of digital images. Qiang et al. (2023) conducted a comprehensive investigation into the collection mechanism of images generated by diffusion models and developed a hybrid neural network model that integrates attention-guided feature extraction (AGFE) and vision transformers (ViTs) based feature extraction (ViTFE) modules to enhance the representation of fake trace features.

Raza et al. (2024) have proposed an innovative framework, designated as Multi-Model GAN Guard (MMGANGuard), which employs transfer learning and multi-model fusion techniques to facilitate the automated identification of PAN-generated fake images, thereby enhancing the precision and scalability of the detection process.

Tassone et al. (2024) have proposed an in-depth analysis of the application of two continual learning techniques in addressing the generalization challenges faced by deepfake detection technology. This research involves a comprehensive examination of continual learning techniques for both short and long sequence fake media.

The efficacy of existing AI-generated image detection methods is contingent upon the availability of extensive training data, which often proves challenging to obtain when the number of samples is limited. et al. (2023) developed FAMSeC, a novel AI-generated image detection method that aims to train a general-purpose detector using a limited number of training samples while avoiding overfitting and maintaining the generalization capabilities of the pre-trained model.

Chen et al. (2024) explored the intricacies of differentiating genuine images from those generated by the Stable Diffusion Model (SDM). They devised a novel approach, comprising a convolutional neural network (CNN) and a Transformer-based detection model, which effectively identifies artificially created images from SDMs. Furthermore, they were the first to assess the generalisation capacity of these detection models across diverse scenarios.

Cazenavette et al. (2024) put forth a technique, designated as “FakeInversion, ” which employs features derived from open-source pre-trained stable diffusion models to identify synthetic images. A salient attribute of this technique is its capacity to generalise effectively to high-visual-fidelity invisible generators, even when trained exclusively on low-fidelity images generated by Stable Diffusion.

6 Challenges and future directions

Dataset limitation The accelerated advancement of image generation techniques based on diffusion models can be attributed to the accessibility of extensive, high-quality datasets. For instance, the current text-to-image synthesis employs billions of high-quality (text, image) pairs (Ramesh et al. 2022; Nichol et al. 2022). However, some other subtasks continue to grapple with data scarcity. Furthermore, datasets also confront challenges related to data bias, encompassing aspects such as language, ethnicity, and gender. These issues can give rise to substantial biases and fairness concerns.

High computational cost The principal challenges to diffusion models include the high cost of training and the number of steps in inference, which serve to exacerbate the disparity in access to resources between industry and academia (Blattmann et al. 2023; Ganguli et al. 2022). Despite efforts to reduce training costs, dataset size and time complexity remain significant obstacles. Furthermore, the model has difficulty generating readable text, and the computational requirements limit its deployment in real-world applications. Therefore, research should focus on improving the efficiency of the model, reducing the computational cost, and exploring further improvements in wavelet-based methods. Additionally, the success of deep learning relies on large amounts of labelled data, which poses a challenge for small companies and edge devices.

Image evaluation The current methods for evaluating image generation are limited in their ability to comprehensively assess quality, rely on user testing and subjective scoring, and are susceptible to bias (Saharia et al. 2022; Parmar et al. 2022; Radford et al. 2021). To address these shortcomings, it is essential to develop more targeted evaluation benchmarks and indicators, as well as more reliable and diverse automatic evaluation criteria.

Multimodal framework The generation of content from text to image, otherwise known as Artificial Intelligence Generated Content (AIGC), has attracted considerable interest from both academic and industrial perspectives. The current popular large language models (OpenAI 2023), based on autoregressive models, have achieved considerable success, particularly in terms of their capacity to generalise across legal domains and zero-shot tasks. Meanwhile, in the field of image generation, represented by models such as Stable Diffusion (Esser et al. 2024) and Sora (OpenAI 2023), diffusion models are widely adopted. As a crucial step towards general artificial intelligence, current research aims to integrate multiple tasks into a single model, thereby constructing multimodal models. Consequently, research interest has shifted towards analysing the emergence capability of diffusion models and developing versatile models capable of generating diverse outputs and handling various data types. This is considered a major challenge and research direction in image generation.

Data Security and Social Ethics The advent of models such as Stable Diffusion and Midjourney has precipitated a period of rapid development in the field of image generation, resulting in a notable increase in the stylistic and diverse range of image creation. However, this technological advancement has also been accompanied by data privacy violations, copyright disputes, and the potential for misinformation and disinformation. The existence of these problems not only threatens the rights and interests of individuals, but also poses a challenge to social trust and moral standards. It is therefore recommended that future research should focus on strengthening data ethics and privacy protection research, developing more transparent and explainable models, and improving the controllability of generated content.

7 Conclusion

We provide a multi-perspective for observing the development and impact of diffusion models in the field of image generation. We first introduce the development background of diffusion models from three basic theories: DDPM, SGMs, and SDEs, and explain some improvement methods of diffusion models in image generation. Second, we explore the wide application and high performance of diffusion models in various subfields of computer vision, including style transfer, image completion, image processing, super-resolution, 3D image generation, etc. Finally, we conduct a comprehensive analysis of the potential social and ethical implications and challenges of diffusion model-based image generation techniques. In summary, this paper provides an in-depth analysis and discussion of the application and potential social impact of diffusion models in the field of image generation. We hope that this survey can provide some guidance and inspiration for the future development of diffusion models in this field.