1 Introduction

Quantum Machine Learning (QML) has recently emerged as a promising framework for generative Artificial Intelligence (AI). In fact, over the years, QML algorithms have been developed for both supervised [1,2,3] and unsupervised [4, 5] learning tasks. The implementation and performance analysis of these machine learning algorithms have demonstrated that quantum computing can bring numerous advantages. In particular, the benefits come from the exponentially large space that a quantum system can express, as well as from the ability to represent mappings that are classically impossible to compute [6, 7]. Additionally, an investigation into effective dimension is presented in [8] by a data-dependent capacity measure, revealing that quantum computing can offer benefits by achieving a better effective dimension than comparable classical neural networks.

In classical AI, Diffusion Models (DMs) have established themselves as the leading candidates for data and image generation [9,10,11], showcasing superior quality and stability in training when compared to state-of-the-art Generative Adversarial Networks (GANs) [12]. DMs rely on an iterative diffusion process that effectively models complex distributions by progressively refining the data distribution through a sequence of diffusion steps. However, DMs may encounter notable challenges, including high computational requirements and the need for extensive parameter adjustments [13].

Considering generative QML, there have been various implementations of Quantum Generative Adversarial Networks (QGANs) that have demonstrated superior performance in capturing the underlying data distribution. They have also shown better generalization properties, allowing for a significantly lower number of trainable parameters compared to classical GANs [14,15,16,17]. One of the first implementations of a QGAN is described in [14]. In this implementation, only the generator is realized using quantum circuits, offering two different solutions: the first one, called Quantum Patch GAN, involves dividing an image into sub-images, which are then generated by sub-generators and later recombined to form the complete image, whereas in the second one, called Quantum Batch GAN, there is no longer a division into sub-images.

Another possible implementation of QGANs is proposed in [15], where a novel architecture called style-qGAN is presented. In this scenario, the Generator is entirely quantum whereas the Discriminator remains entirely classical. The novelty of this model consists in the embedding of the latent variable into every quantum gate of the network, whereas previous qGAN models introduced them only at the beginning of the network. The results obtained by testing the architecture on toy data, namely 1D gamma and 3D correlated Gaussian distributions, as well as on data for realistic quantum processes at the Large Hadron Collider (LHC), demonstrate the effectiveness of such an architecture and its potential use for data augmentation, as it is capable of reproducing known reference distributions from small sample sets.

Concerning DMs, there have been only some initial and simple attempts to develop a quantum version of a DM. While in [18] a sole theoretical discussion of a potential quantum generalization of diffusion models is provided, with results tied only to very basic and simplified scenarios, two different architectures are proposed in [19]: the first one works on downsized images from the MNIST dataset; the second model suggests operating on a latent space using a pretrained autoencoder. Both architectures use amplitude encoding to load data into quantum circuits, which is a memory-efficient approach suitable for current quantum devices, but requires an exponential number of circuit runs to fully reconstruct the entire output distribution of bitstrings.

In this paper, we propose an efficient methodology to design Hybrid Quantum Diffusion Models (HQDMs) by incorporating variational quantum layers, with novel circuit designs within a classical U-Net [20]. Namely, we employ a state-of-the-art U-Net architecture, proposed in [11] and composed of ResNet [21] and Attention blocks as the foundation for our approach. Therefore, the U-Net is hybridized in two different modes: the first involves inserting Variational Quantum Circuits (VQCs) only at the vertex, while the second implies the insertion of VQCs on both the encoder side, which is more sensitive to feature extraction, and the vertex, which is responsible for compressed image processing. The rationale behind this approach is to leverage the strengths of both classical and quantum computing paradigms. By integrating variational quantum layers into the classical U-Net architecture, we aim to exploit the expressive power of quantum circuits for faster network convergence and superior generalization capabilities [8]. On the other hand, classical U-Net layers allow us to introduce modularity and nonlinearity in the computation, thus enabling complex image processing while mitigating the inherent challenges associated with quantum computing [22, 23]. Furthermore, by strategically placing VQCs at key points within the U-Net architecture, we ensure that quantum layers are properly allocated to areas where they can have the most significant impact on performance improvement, thanks to their expressivity and feature extracting capabilities [24,25,26].

We also propose to adopt an approach inspired by transfer learning, aiming at an overall efficient training time of the models. During the initial epochs, a classical model is trained and then, some of its weights are transferred to a hybrid model that is trained, in turn, for some further epochs. In this way, while still maintaining a limited training time, we achieve better results compared to the classical counterpart: on Fashion MNIST, we achieve + 2% FID, + 5% KID, and + 2% IS. On MNIST, the results are even better with an improvement in metrics of approximately + 8% on FID, + 11% on KID, and + 2% on IS.

Our innovative architecture seamlessly incorporates quantum elements through the employment of angle encoding as the primary encoding method, whereas the outputs of the circuits are retrieved through the expected value of the Pauli-Z observable. Since the VQCs are placed at specific points in the U-Net where the image dimensions are reduced, and thus the number of pixels to be encoded in the circuit through angle encoding is low, we achieve a crucial result: having a low number of qubits, which optimizes resource utilization and ensures practicality when deploying our model on Noisy Intermediate-Scale Quantum (NISQ) devices. Moreover, unlike amplitude encoding, angle encoding has the advantage of streamlining output computation without the need for an exponentially long circuit, which is unfeasible for current NISQ machines.

Additionally, the analysis of two different hybrid architectures allows to highlight how certain parts of the U-Net are more sensitive to the quantum integration than others. Not only do these enhancements lead to remarkable improvements across various performance metrics, but they also yield a substantial reduction in the number of parameters requiring training. In particular, depending on the degree of U-Net hybridization, we can achieve up to about an 11% reduction in parameters compared to the classical one. Furthermore, it is possible to achieve an improvement of almost 2% on the FID and KID metrics in the case of the Fashion MNIST dataset, while in the case of the MNIST dataset, we achieve an improvement of about 5% on the FID and more than 6% on the KID compared to the classical network.This reduction not only streamlines the overall training process, but also contributes to more efficient utilization of computational resources, ultimately enhancing the model’s scalability and applicability in real-world scenarios.

The rest of the paper is organized as follows. In Sect. 2 we provide an explanation of DMs and variational circuits. In Sect. 3 we present the employed methodology, while in Sect. 4 we discuss the obtained results. Finally, we draw our conclusions in Sect. 5.

2 Theoretical Background

2.1 Classical Diffusion Models

In recent years, DMs have proven to be an important class of generative models. A standard mathematical formulation for diffusion models is the one presented by [10] and here summarized to give to readers a general overview of its fundamentals.

DMs mainly consist of two distinct phases as shown in Fig. 1. The first one is the forward process, also called diffusion, involving a transformation that gradually converts the original data distribution \({\textbf{x}_0\sim q}\), where q is a probability distribution to be learned, by repeatedly adding Gaussian noise:

$$\begin{aligned} \textbf{x}_t = \sqrt{1 -\beta _{t}}\textbf{x}_{t-1} + \sqrt{\beta _{t}}\varvec{\epsilon }_t,\,\,t=1\dots T, \end{aligned}$$
(1)

where \({\varvec{\epsilon }_1,\dots ,\varvec{\epsilon }_T}\) are IID samples drawn from a zero-mean, unit variance Gaussian (normal) distribution \({\mathcal {N}}(\textbf{0}, \textbf{I})\), and \(\beta _{t}\) determines the variance scale for the t-th step. This progression is underpinned by a Markov chain that can be represented as follows:

$$\begin{aligned}&q(\textbf{x}_{0:T}) = q(\textbf{x}_0)\prod _{t=1}^{T}q(\textbf{x}_{t}|\textbf{x}_{t-1})\,,\end{aligned}$$
(2)
$$\begin{aligned}&q(\textbf{x}_{t}|\textbf{x}_{t-1})= {\mathcal {N}}\left( \sqrt{1 -\beta _{t}}\textbf{x}_{t-1}, \beta _{t}\textbf{I}\right) \,, \end{aligned}$$
(3)

being \(\textbf{I}\) the identity matrix.

The goal of the forward process is to add incremental noise to the initial sample \(\textbf{x}_0\) over a certain number of steps, until at the final time step T all traces of the original distribution \({\textbf{x}_0\sim q}\) are lost so as to obtain \({\textbf{x}_T\sim {\mathcal {N}}(\textbf{0}, \textbf{I})}\). Through the application of the ‘reparameterization trick’, a closed-form solution becomes available for calculating the total noise at any desired step using the cumulative product:

$$\begin{aligned} \textbf{x}_{t} = \sqrt{{{\bar{\alpha }}}_{t}}\,\textbf{x}_{0}+ \sqrt{1 -{{\bar{\alpha }}}_{t}}\,\varvec{\epsilon }, \end{aligned}$$
(4)

where \(\alpha _{t} = 1 - \beta _{t}\), \({{\bar{\alpha }}}_{t}= \prod _{i=1}^{t} \alpha _{i}\), and \({\varvec{\epsilon }\sim {\mathcal {N}}(\textbf{0}, \textbf{I})}\) is the Gaussian noise.

Fig. 1
figure 1

Diffusion process and reverse process of a DM

The second phase is the reverse process or backward diffusion, where the transformations gradually restore the initial noise distribution and reconstruct a noise-free version of the original data. If we could successfully reverse the aforementioned process sampling from \({q(\textbf{x}_{t-1}|\textbf{x}_t)}\), we would gain the ability to recreate the true sample starting from the Gaussian noise input \({\textbf{x}_T\sim {\mathcal {N}}(\textbf{0}, \textbf{I})}\); it is also noteworthy that when \(\beta _t\) is sufficiently small, \({q(\textbf{x}_{t-1}|\textbf{x}_t)}\) is close to a Gaussian distribution. Regrettably, estimating \({q(\textbf{x}_{t-1}|\textbf{x}_t)}\) is complex due to its reliance on the entire dataset and hence, a data-driven learning model like a neural network must be used in order to approximate these conditional probabilities, enabling the execution of the reverse diffusion process.

Let \(p_{\varvec{\theta }}\) be the mathematical model depending on some parameters \(\varvec{\theta }\) that represents the estimated distribution of the backward diffusion process:

$$\begin{aligned}&p_{\varvec{\theta }}(\textbf{x}_T) = {\mathcal {N}}(\textbf{0}, \textbf{I})\,, \end{aligned}$$
(5)
$$\begin{aligned}&p_{\varvec{\theta }}(\textbf{x}_{t-1}|\textbf{x}_{t})= {\mathcal {N}}\left( \varvec{\mu }_{\varvec{\theta }}(\textbf{x}_{t},t), \varvec{\Sigma }_{\varvec{\theta }}(\textbf{x}_{t},t)\right) \,, \end{aligned}$$
(6)

where \(\varvec{\mu }_{\varvec{\theta }}(\textbf{x}_{t},t)\) and \(\varvec{\Sigma }_{\varvec{\theta }}(\textbf{x}_{t},t)\) are the general outputs of the adopted neural network, which takes as inputs \(\textbf{x}_{t}\) and t.

A simplified approach based on variational inference assumes a fixed covariance matrix, such as for instance \({\varvec{\Sigma }_{\varvec{\theta }}(\textbf{x}_{t},t)=\beta _t\textbf{I}}\), and the direct estimation by the neural network of the noise \(\varvec{\epsilon }_{\varvec{\theta }}(\textbf{x}_{t},t)\) at time step t. Then, using reparameterization and the normal distributions of conditional data, we obtain:

$$\begin{aligned} \varvec{\mu }_{\varvec{\theta }}(\textbf{x}_{t},t) = {\frac{1}{\sqrt{\alpha _t}}} \left( \textbf{x}_t - \frac{1-\alpha _t}{\sqrt{1-{{\bar{\alpha }}}_t}}\varvec{\epsilon }_{\varvec{\theta }}(\textbf{x}_{t},t)\right) . \end{aligned}$$
(7)

The neural network producing \(\varvec{\epsilon }_{\varvec{\theta }}(\textbf{x}_{t},t)\) is usually trained by stochastic gradient descent on an even more simplified loss function like:

$$\begin{aligned} L^\textrm{simple}_t = {\mathbb {E}}_{\textbf{x}_0\sim q,t,\varvec{\epsilon }\sim {\mathcal {N}}(\textbf{0},\textbf{I})} \left[ \left\| \varvec{\epsilon }_{\varvec{\theta }}(\textbf{x}_{t},t) - \varvec{\epsilon } \right\| ^2 \right] . \end{aligned}$$
(8)
Fig. 2
figure 2

Scheme of a hybrid quantum-classical VQC

2.2 Variational Quantum Circuits

Variational Quantum Algorithms (VQAs) are the most common QML algorithm that are currently implemented on today’s quantum computers [27], they make use of parametrized quantum circuits known as ansatzes. Ansatz circuits are composed of quantum gates that manipulate qubits through specific parametrized unitary operations. However, these operations depend on parameters denoted as \(\varvec{\theta }\), which are the parameters to be trained during the training process.

The training workflow of a VQC, shown in Fig. 2, can be summarized as follows:

  1. 1.

    Classical data \(\varvec{x} \in {\mathbb {R}}^{n}\) is appropriately encoded in a quantum state of the Hilbert space \({\mathbb {H}}^{2^{n}}\) through the unitary \(U_{\phi }({\varvec{x}})\), to be used by the quantum computer;

  2. 2.

    An ansatz \(U_W({\varvec{\theta }})\) of \(\varvec{\theta }\)-parametrized unitaries with randomly initialized parameters and fixed entangling gates is applied to the quantum state \(|\phi ({\varvec{x}})\rangle\) obtained after the encoding;

  3. 3.

    Upon completion, measurements are taken to obtain the desired outcomes. The expected value with respect to a given observable \({\hat{O}}\) is typically computed, and the resulting prediction is given by:

    $$\begin{aligned} f({\varvec{x}}, \varvec{\theta }) = \langle \phi (\varvec{x}) | U_W(\varvec{\theta })^\dagger {\hat{O}} U_W(\varvec{\theta }) | \phi (\varvec{x}) \rangle , \end{aligned}$$
    (9)
  4. 4.

    Finally, a suitable loss function is evaluated, and a classical co-processor is used to properly update the parameters \(\varvec{\theta }\).

This cycle is repeated until a termination condition is met. To update \(\varvec{\theta }\) and train the VQC, gradient-based techniques can be used; gradients in a parametrized quantum circuit are calculated via the parameter-shift rule:

$$\begin{aligned} \nabla _{\theta }f(\varvec{x},\theta ) = \frac{1}{2} \left[ f(\varvec{x},\theta + \frac{\pi }{2}) - f(\varvec{x},\theta - \frac{\pi }{2}) \right] , \end{aligned}$$
(10)

where \(f(\varvec{x}, \theta )\) is the output of the quantum circuit and \(\theta\) is the parameter to be optimized.

Fig. 3
figure 3

First proposed hybrid U-Net architecture named QVU-Net, where the quantum part is incorporated at the vertex of the network

3 Proposed Methodology

In the following, we introduce the quantum hybrid architectures proposed in this paper.

3.1 Quantum Vertex U-Net Hybrid Architecture

Starting from the classical U-Net architecture, which is an extremely parameter-rich and expressive model, we propose a hybrid architecture capable of harnessing quantum capabilities to achieve better performance and, furthermore, to reduce the total number of parameters to be trained. The classical U-Net is initially implemented according to the proposal by [28], aiming at working with a state-of-the-art model: it is composed of ResNet blocks for residual connections and Attention blocks for feature aggregation; specifically, we use Multi-head Attention with four attention heads, as suggested in [28]. The ResNet and Attention layers are hence applied at various resolution levels in the U-Net. The first hybrid architecture we propose, which we name Quantum Vertex U-Net (QVU-Net), uses this U-Net as its reference architecture and efficiently integrates quantum layers within its structure, as shown in Fig. 3.

We use angle encoding as the data encoding method in order to achieve an architecture that efficiently incorporates quantum layers. This approach avoids the inefficiencies associated with amplitude encoding, which requires exponentially long circuits, and an exponential number of circuit runs to generate statistically valid outputs from the quantum states distribution with the same dimensionality as the inputs. In our angle encoding, input data \(\varvec{x} \in {\mathbb {R}}^n\) is encoded into n qubits via unitary transformation \(U_{\phi (\varvec{x})}\) made up of \(R_x\) rotation gates; each qubit encodes one feature of our input data, resulting in \({\mathcal {O}}(1)\) circuit depth.

The underlying idea in our hybrid model is to integrate the quantum components strategically, in such a way that the number of qubits used is kept limited while ensuring streamlined data processing efficiency. Specifically, considering that the classical U-Net initially processes a \(28 \times 28 \times 1\) image and progressively scales it within the network’s encoder until reaching a vertex with dimensions of \(2 \times 2 \times 40\), we introduce the quantum elements precisely at the vertex of the network.

Fig. 4
figure 4

Architecture of the QResNet block, where Convolutional layers are replaced with VQCs. The QResNet takes as input x, which is the information coming from the image, and t, which is the temporal information, and finally returns h

We propose the Quantum ResNet (QResNet) layer at the vertex of the QVU-Net, as shown in Fig. 4. The QResNet is analogous to the ResNet, as it is characterized by skip connections and two processing layers, whose output is then added to the input; the difference with the classical ResNet layer is that some of the Convolutional layers used in the classical ResNet are replaced with VQCs. As we are working with images scaled down to 2 \(\times\) 2 dimensions, the only viable choice is to use a VQC instead of convolutional layers, which effectively amounts to a single filter pass over the entire image. Not all Convolutional layers of the ResNet are replaced with VQCs; rather, the replacement is done gradually. We initially analyze a hybrid architecture where the percentage of channels processed by VQCs in QResNet is set at 10%. Subsequently, we examine an architecture in which 50% of the input channels in QResNet are processed by VQCs. Finally, we explore an architecture in which all 40 channels are processed solely by VQCs instead of Convolutional layers. Gradually hybridizing the architecture allows us to better and more comprehensively analyze the impact of incorporating quantum elements into the network’s structure. To the best of our knowledge, this is the first time a quantum version of the ResNet block within a hybrid U-Net is proposed in the literature.

The choice of the ansatz in the VQCs was made considering that the maximum channels at the vertex are 40. Since the information in the vertex is distributed across multiple channels rather than confined to one, we want to adopt an ansatz configuration capable of efficiently capturing and leveraging the correlations among these channels. Starting from these considerations, we try two distinct ansatzes structures, inspired by [29]. In fact, their choice is driven by the fact that they work on three channels simultaneously, aiming to process not only local information related to a single channel, but especially intra-channel information.

The Hierarchical Quantum Convolutional Ansatz (HQConv) in Fig. 5 extracts local information separately among the initial channels, followed by additional controlled gates used to encode intra-channel information. Initially, controlled gates are used to extract information within each channel first. In particular, as shown in Fig. 5, the A blocks, acting on qubits belonging to the same channel, can be expressed mathematically as:

$$\begin{aligned} \left| \, q^{2}_{p}, q^2_{ p+s}\right\rangle = \left[ CR_x(\theta _{x,p}) \circ CR_z(\theta _{z,p})\right] \left| \, q^{1}_{p}, q^1_{ p+s}\right\rangle \end{aligned}$$

where symbol \(\circ\) represents the composition of the \(CR_x\) gate and \(CR_z\) gate, with \(CR_z\) being implemented first. Additionally, \(q^1_{p}\) is the control qubit and \(q^1_{p+s}\) is the target qubit, moreover the subscript p indicates the pixel to which the qubit refers and ranges from 0 to 3 as the images has dimensions of \(2 \times 2\) for each channel. The symbol s represents instead the value of hyperparameter stride, i.e. indicates the distance between the control qubit and the target qubit. The stride used in this case is equal to 1. The superscript 1 indicates the initial state of the qubit immediately after encoding, while the superscript 2 indicates the state of the qubit after the application of the considered block A. The second part of the ansatz is instead characterized by intra-channel information processing. As seen in Fig. 5, the B blocks are aimed at working on qubits belonging to two different channels, and in particular, we can express them in mathematical terms as:

$$\begin{aligned} \left| \, q^{3}_{0}, q^3_{4}\right\rangle = \left( \, CR_x(\theta _{x}) \circ CR_z(\theta _{z})\,\right) \left| \, q^{2}_{0}, q^2_{4}\right\rangle \end{aligned}$$

where \(q^2_0\) is the control qubit, i.e. the first qubit of the first channel considered in the block B and \(q^2_4\) is the target qubit, i.e. the first qubit of the second channel considered in the block B. The subscript 3 in this case indicates the state of the qubit after the application of the block B.

On the other hand, the Flat Quantum Convolutional Ansatz (FQConv) in Fig. 6 immediately incorporates both intra-channel and inter-channel information, giving its structure a flat form. The blocks C and D, as shown in Fig. 6, are characterized by the presence of gates controlled by qubits belonging to another channel. In mathematical terms, the block C can be expressed as:

$$\begin{aligned} \left| \, q^{2}_{p}, q^2_{ p+s}\right\rangle = \left( \, CR_z(\theta _{z,p}) \right) \left| \, q^{1}_{p}, q^1_{ p+s}\right\rangle \end{aligned}$$

while the block D as:

$$\begin{aligned} \left| \, q^{3}_{p}, q^3_{ p+s}\right\rangle = \left( \, CR_x(\theta _{x,p})\,\right) \left| \, q^{2}_{p}, q^2_{ p+s}\right\rangle \end{aligned}$$

where as before \(q_{p}\) is the control qubit and \(q_{p+s}\) is the target qubit. The stride used in this case is equal to 4. Once again, the superscript 1 indicates the initial state of the qubit after encoding, 2 after the application of the first block (block C), and 3 the state of the qubit after the application of the second block (block D). These ansatzes operate on 3 channels at a time; since the compressed image at the vertex is \(2 \times 2\) in size, we need 4 qubits per channel, and hence the total number of qubits used in our VQCs is 12. When considering the architecture where the entire vertex is processed with quantum circuits, as there are a total of 40 channels to be processed, the first 39 are processed with the previously described ansatzes, HQConv or FQConv, while the last one is processed with the PennyLane Basic Entangling Layer, which involves Rx rotations acting on 4 qubits only. The number of layer inside each VQC is equal to 3. At the end of the circuit, a measurement operation carried out in the Pauli-Z basis, which has + 1 and \(-\,1\) as eigenvalues, is performed on each qubit to extract the outcome of the quantum circuit as the expected value with respect to the Pauli-Z operator. The output of each VQC is thus restricted to the \([-1, 1]\) range.

Fig. 5
figure 5

HQConv ansatz proposed in [29] initially extracts local information, then uses additional controlled gates to encode intra-channel information

Fig. 6
figure 6

FQConv ansatz proposed in [29] is capable of immediately incorporating both intra-channel and inter-channel information

3.2 Quanvolutional U-Net Hybrid Architecture

In addition to the QVU-Net, where only the vertex is hybridized, we also propose a hybridization in a part of the U-Net network dedicated to feature extraction. The purpose is to assess whether quantum feature extraction can indeed bring further improvements in terms of the quality of the generated images. However, we decide not to act on the first level of the U-Net because the images still have dimensions of \(28 \times 28 \times 10\), making the efficient use of a VQC computationally challenging. Therefore, we consider the second level of the encoder, where the images are \(14 \times 14 \times 20\), as shown in Fig. 7; we call this architecture QuanvU-Net.

Fig. 7
figure 7

Second proposed U-Net hybrid architecture named QuanvU-Net, where the quantum part is incorporated not only at the vertex but also at the second level of the encoder block

In order to keep angle encoding, we process the image in the second level of the encoder with an idea inspired by the Quanvolutional method [30]. We employ Quanvolutional filters that, similarly to classical Convolutional filters, process one subsection of the image at a time, until they have traversed the entire image. In doing so, they produce a feature map by transforming spatially-local subsections of the input tensor. Unlike the straightforward element-wise matrix multiplication operation performed by a classical Convolutional filter, a Quanvolutional filter alters input data through the utilization of a quantum circuit, which may have a structured or random configuration. In our case, the Quanvolutional approach is employed within the ResNet block, thereby creating the QuanResNet block, as depicted in Fig. 8. Specifically, we consider only 3 channels out of the total 20 in the \(14 \times 14\) image to make the approach practically feasible, as it requires a large number of quantum circuits executions. In particular, 4 pixels are taken from each channel and processed by a 12-qubits variational circuit; the variational circuit remains always the same when passing over the entire image and acts as if it was a classical Convolutional filter.

Fig. 8
figure 8

Architecture of the QuanResNet block, where the Convolutional layer of the classical ResNet is replaced with a Quanvolutional filter. The QuanResNet takes as input \(\textbf{x}\), which is the information coming from the image, and \(\textbf{t}\), which is the temporal information, and finally returns \(\textbf{h}\)

As for the previous QVU-Net architecture, the idea with the QuanvU-Net is to work on 3 channels at a time to process both intra-channel and inter-channel features. Therefore, the ansatzes used in the QuanResNet block are the same as the ones used at the vertex, namely HQConv and FQConv shown in Figs. 5 and 6, respectively. Also in this case, the use of Quanvolutional filters in a ResNet block marks a novel advancement compared to the literature.

3.3 Transfer Learning Approach

In addition to standard training for both the hybrid and classical networks, we propose an approach inspired by transfer learning. In fact, both the training phase and the subsequent inference phase prove to be significantly time consuming in the case of the hybrid QVU-Net and QuanvU-Net, becoming even longer compared to the classic U-Net as the percentage of variational circuits at the vertex increases. It is therefore worth reducing the time required for the training phase.

For this reason, we propose to adapt classical-to-quantum transfer learning [31], which proves to be one of the most appealing transfer learning approaches. As illustrated in Fig. 9, our idea is to initially train the classic U-Net for a certain number of epochs. The weights obtained in this way are then transferred to the QVU-Net, except for the weights at the vertex. So the QVU-Net is trained for a very limited number of epochs. In this way, it is expected that the final fine-tuning with the hybrid networks can bring a significant improvement in performance compared to using only the classical network, while still maintaining a low training time.

Fig. 9
figure 9

Outline of the proposed transfer learning approach. The classical model is trained for n epochs. All the weights, except those of the vertex, are then transferred to the hybrid model. The latter is further trained for additional m epochs

4 Experimental Results

In this section, we analyze the results obtained by using the different architectures proposed in the paper:

  • the 1HQConv QVU-Net and 1FQConv QVU-Net with quantum circuits in the QResNet that act only on 10% of the channels at the vertex and use the HQConv or FQConv the ansatz, respectively;

  • 7HQConv QVU-Net and 7FQConv QVU-Net, with quantum circuits in the QResNet that act only on 50% of the channels at the vertex and use the HQConv or FQConv the ansatz, respectively;

  • FullHQConv QVU-Net and FullFQConv QVU-Net, with quantum circuits in the QResNet that act on all channels at the vertex and uses the HQConv or FQConv the ansatz, respectively;

  • QuanvU-Net with the vertex ibridized at 10% and the QuanResNet at the second level of the encoder.

4.1 Experimental Settings

The implementation is carried out in Python 3.8 using PennyLane and Flax. PennyLane is a framework enabling local quantum circuits simulations and integration with classical neural networks, whereas Flax is an open-source framework providing a flexible and efficient platform for hybrid neural network execution via compilation. We use PennyLane for the implementation of quantum circuits, while the hybrid networks and the entire training process are carried out in Flax; the classical U-Net is implemented in Flax as well.

Regarding the experiments, the L2 loss is used with the P2 weighting [32]. We used an exponential moving average (EMA) over model parameters with a rate that depends on the training step, and the Adam optimizer [33] is used with a learning rate of \(10^{-3}\), \(\beta _1\) of 0.9, and \(\beta _2\) of 0.99. The training process consists of a total of 20 epochs.

We use two benchmark datasets, namely MNIST [34] and Fashion MNIST [35]. Both of them contain grayscale images belonging to 10 different classes, with a total of 60k training samples. The metrics used for evaluations are FID [17], Kernel Inception Distance (KID) [36], and Inception Score (IS) [37], assessed on 7000 generated images. We utilize the TorchMetrics library [38], replicating each channel of the generated images three times to make the dimensions compatible with those required by InceptionV3 network backbone. For the KID calculation, the subset size for computing mean and variance was set to 100, while for the IS calculation the dataset was divided into 10 splits for mean and variance computation.

A machine equipped with an AMD Ryzen 7 5800X 8-Core CPU at 3.80 GHz and with 64 GB of RAM is used for the experiments.

4.2 Fashion MNIST Dataset

Let us initially consider the images generated by Fashion MNIST. As there are no significant differences between the use of HQConv and FQConv, we specifically examine the results obtained by HQConv with three layers. As seen in Fig. 10, at the first epoch hybrid networks demonstrate better performance compared to the classical one, achieving a slightly lower FID and a higher IS value. Indeed, the images generated by the FullHQConv QVU-Net achieve an FID of 295.7863 and an IS of \(1.4454 \pm 0.0057\), while those obtained from the classical architecture have a higher FID of 296.869 and a lower IS of \(1.4164 \pm 0.0122\). This is in line with our expectations, as quantum models with a few epochs are generally more adept at extracting features and processing than classical models. By the tenth epoch, all hybrid networks show significantly better values in terms of FID and KID, with the 1HQConv QVU-Net Architecture in Fig. 10f having a FID of 52.5332, which is more than seven points lower than the classical network in Fig. 10e, which has a FID of 60.1476.Therefore, the 1HQConv QVU-Net Architecture is able to achieve an FID that is nearly 13% better than the classical architecture. The IS of hybrid networks at the tenth epoch is comparable to that achieved by the classical network. However, at the twentieth epoch, a gradual deterioration in the performance of hybrid networks in Figs. 10j–l is observed, progressively worsening with an increase in the level of hybridization. The results are still comparable to those obtained by the classical network, as the classical architecture generates images with an FID of 39.4563, while the worst-performing quantum architecture in this case, the FullHQConv QVU-Net, has an FID of 41.3882. But the most important aspect is that there is a significant reduction in parameters as the percentage of vertex hybridization increases: indeed, the FullHQConv QVU-Net has more than 11% fewer parameters than the classical architecture.

Fig. 10
figure 10

Fashion MNIST dataset results using the first hybrid architecture. The figure shows in the first column the images generated by the classical U-Net network, while in the second column the images generated by the 1HQConv QVU-Net, in the third column by the 7HQConv QVU-Net, and in the last column the images generated by the FullHQConv QVU-Net. The row-wise division considers in the first row the images generated after the networks are trained for just one epoch, the second row after training for ten epochs, and the third row after the complete training of twenty epochs

Considering the images generated by the second possible hybridization of the U-Net, the QuanvU-Net architecture, which involves not only the hybridized vertex but also the use of quanvolutional on outer layers of the U-Net, it can be observed in Fig. 11 how this yields a better performance. Considering the case of only 10% of the hybridized vertex, which leads to more satisfactory performance, we outline that if only the vertex is hybridized at the first epoch, the metrics are more or less similar to those of the classical network. However, if we also consider the QuanvU-Net from the first epoch, as shown in Fig. 11c, the performance is greatly improved with a FID lower than that of the classical network by about 10 points, as the images generated by the QuanvU-Net achieve an FID of 285.1551, in contrast to the classical architecture which has an FID of 296.869. By the tenth epoch, there are no significant differences between the two possible implementations. More interesting is the case of the last epoch Fig. 11i, where initially the introduction of the quantum did not bring any improvement. Now, however, better performance in terms of FID and KID is achieved, as the images of the QuanvU-Net reach an FID of 38.8 and a KID of \(0.0269 \pm 0.0007\), which is better than the FID of 39.4563 and KID of \(0.0275 \pm 0.0008\) obtained from the classical architecture. This is in line with our expectations, given that we have now incorporated a quantum component into a much more critical area of the U-Net, which not only processes but, more importantly, extracts features.

Fig. 11
figure 11

Fashion MNIST dataset results using the second hybrid architecture. The figure shows in the first column the images generated by the classical U-Net network, while in the second column the images generated by the 1HQConv QVU-Net, in the third column by the QuanvU-Net. The row-wise division considers in the first row the images generated after the networks are trained for just one epoch, the second row after training for 10 epochs, and the third row after the complete training of 20 epochs

4.3 MNIST Dataset

We then consider the MNIST dataset, whose results are shown in Fig. 12. In this case, we report the results obtained from the first hybridization, the QVU-Net, with the HQConv ansatz implemented with three layers only, as the results obtained using FQConv or the second hybridization, the QuanvU-Net, that involves the use of quantum in layers other than the vertex are very similar. At the first epoch, hybrid networks have significantly better results than the classical one in terms of all metrics, as the FullHQConv QVU-Net, which is the best-performing hybrid architecture in this case, achieves an FID of 299.7292, a KID of \(0.3974 \pm 0.0028\), and an IS of \(1.5930 \pm 0.0239\). These metrics are better than those obtained from the classical architecture, which has an FID of 311.0582, a KID of \(0.4144 \pm 0.0029\), and an IS of \(1.4936 \pm 0.0142\). Similarly to what we observed with the Fashion MNIST dataset, this once again demonstrates the ability of quantum models to perform better than classical models when training epochs are limited. By the tenth epoch, the advantage diminishes somewhat, except for the 7HQConv QVU-Net architecture shown in Fig. 12g that still shows better values in terms of FID, KID, and IS. At the last epoch, once again the 7HQConv QVU-Net architecture shows better results as shown in Fig. 12k, obtaining an FID of 43.0731, a KID of \(0.0523 \pm 0.0015\), an IS of \(1.9256 \pm 0.0205\), with an FID therefore better by almost 2 points compared to that of the classical architecture, which has an FID of 45.2762, a KID of \(0.0558 \pm 0.0014\), an IS of \(1.9152 \pm 0.035\). We note that all hybrid networks have a lower number of parameters than the classical one.

Fig. 12
figure 12

MNIST dataset results using the first hybrid architecture. The figure shows in the first column the images generated by the classical network, while in the second column the images generated by the 1HQConv QVU-Net, in the third column by the 7HQConv QVU-Net, and in the last column the images generated by the FullHQConv QVU-Net. The row-wise division considers in the first row the images generated after the networks are trained for just one epoch, the second row after training for 10 epochs, and the third row after the complete training of 20 epochs

4.4 Transfer Learning

We can now analyze the results obtained with the transfer learning approach. We compare the metrics of the MNIST dataset images obtained by first training a classical U-Net network for all the 20 epochs and then training a classical network for 19 epochs and transferring the weights except for the vertex to another classical network that is fully retrained for an additional epoch. Finally, we compare the results obtained by training a classical network for 19 epochs and performing transfer learning on the hybrid network 1FQConv QVU-Net, which is retrained for a single epoch. The results obtained are shown in Fig. 13, demonstrating how the transfer learning approach from classical to hybrid works very well.

What is noticeable is that, by this approach, images are obtained with FID of 41.6646, KID of \(0.0493 \pm 0.0013\), IS of \(1.9556 \pm 0.0335\), which are significantly better even than the best-performing hybrid architecture previously presented, the results of which are shown in Fig. 12k. Thus, by using the transfer learning approach the results obtained from the classical model are significantly improved, with an approximately 8% improvement on FID and 12% on KID for the MNIST dataset.

Fig. 13
figure 13

Transfer learning results on the MNIST dataset where generated images are: a classical architecture; b transfer learning (19 + 1) classical-classical; c transfer learning (19 + 1) classical-1FQConv QVU-Net

Similarly, the same approach was taken for the Fashion MNIST dataset, with the results reported in Fig. 14. Instead of having 19 training epochs on the classical network and just one on the network to which the weights have been transferred, we now have 18 epochs on the classical network and 2 on the network to which the weights have been transferred. Also in this case, better results are obtained compared to the end-to-end training of previously proposed hybrid architectures. Indeed, the images shown in Fig. 14c are obtained with FID of 38.6835, KID of \(0.0261 \pm 0.0007\), IS of \(4.0527 \pm 0.0890\), which are superior to those of the best-performing hybrid architecture previously presented, the results of which are shown in Fig. 11i.

Fig. 14
figure 14

Transfer learning results on the Fashion MNIST dataset where generated images are: a classical architecture; b transfer learning (18 + 2) classical-classical; c transfer learning (18 + 2) classical-1FQConv QVU-Net

Finally, all the numerical results obtained in the previous experiments are summarized in Tables 1 and 2 for Fashion MNIST and MNIST datasets, respectively, where bold values represent the best ones obtained according to each metric.

Table 1 Numerical results for the images generated from the Fashion MNIST dataset
Table 2 Numerical results for the images generated from the MNIST dataset

5 Conclusions

The use of quantum computing in generative machine learning models can bring numerous advantages, both in terms of performance and in terms of reducing the parameters to be trained. In this paper, we proposed an efficient integration of quantum computing in diffusion models, presenting for the first time two possible hybrid U-Nets. The first, the QVU-Net, involves replacing the convolutional layers that form the ResNet with variational circuits only at the vertex, while the second, the QuanvU-Net, involves the replacement in the second block of the encoder part as well, leveraging in this case an approach inspired by quanvolutional. We also attempted to exploit an approach inspired by transfer learning to reduce the overall training times compared to what we would have had with the complete training of a hybrid architecture.

The obtained results confirm the real advantage in using quantum in extremely complex networks, such as the U-Net of DMs. The approach of hybridizing the U-Net confirms that integrating variational circuits into a classical network can yield certain benefits. Through numerous tests, we proved that quantum allows for further enhancement of network performance. For MNIST, the use of the 7HQConv QVU-Net yielded the best performance. Furthermore, not only at the twentieth epoch did the use of quantum lead to better performance, but in general all hybrid networks show significantly more positive metric values compared to the classical network they are consistently compared against from the first epoch.

On Fashion MNIST, the first possible hybridization of the U-Net, the QVU-Nets, which involves only the hybridized vertex, fails to yield better results than the classical one, despite having a significantly lower number of parameters than the classical network. However, the quantum advantage in this case lies in a faster learning rate, as when we analyze the results at the tenth epoch, all hybrid networks still demonstrate metric values much more advantageous than those observed from the classical network. It is with the second implementation, the QuanvU-Net, which also involves the use of quanvolutional in more expressive layers, that better results are achieved. This demonstrates that having the introduction of quantum in areas dedicated to feature extraction is more effective than introducing it only at the vertex, which primarily operates on processing features already extracted earlier.

Finally, the idea behind using transfer learning between a classical network and a hybrid network is driven by the desire to keep simulation times limited while still achieving better performance than the classical network. This is observed both in the case of MNIST and Fashion MNIST, where we indeed obtain the best results. It is essential to note that the main goal of this paper is to demonstrate that quantum outperforms or performs equally to a classical network with more parameters. It is worth emphasizing that all hybrid networks have significantly fewer parameters.

The possible future developments of this work involve expanding hybridization to other parts of the U-Net, replacing convolutional layers with variational circuits even in regions where the image is less downscaled. This aims to achieve further reduction in the number of trainable parameters, in addition to the potential for improved performance. Furthermore, the goal is to extend the testing to more complex datasets beyond MNIST and Fashion MNIST.