1 Introduction

Deep learning (DL)-based accelerated magnetic resonance (MR) image reconstruction is currently an active area of research. Many model architectures have been proposed, such as networks which learn end-to-end transformations [4], networks which “unroll” the traditional optimisation algorithm into a deep network [7, 9], optimisation algorithms incorporating deep priors [11] and, more recently, networks incorporating adversarial losses [12]. Finding the optimal network architecture remains an exciting problem. Despite there being an advance in architectural search, only marginal progress has been made in understanding the reconstruction networks’ behaviour. In [13], the expressibility of the network to achieve a perfect reconstruction is outlined using the connection between convolutional neural networks (CNN) and convolution framelets. The authors of [6] empirically assessed the generalisability of the variational inference network, and found that the performance of the network is sensitive to the signal-to-noise ratio (SNR) of data. Nevertheless, no theory exists which can explain the worst-case behaviour of these networks. In [1], it is shown that methods with adversarial losses can bias the reconstruction with the risk of hallucination. Even though DL methods are shown effective, having a good grasp of how they produce error is crucial for the reliable deployment of these methods in clinical settings.

While bridging the gap in our theoretical understanding of DL-based reconstruction remains challenging, the literature from Bayesian deep learning suggests that the uncertainty associated with network outputs can be directly modelled using practical regularisation techniques [2]. Namely, these techniques are MC-dropout and heteroscedastic loss [5], which capture model uncertainty and data uncertainty respectively. Although such techniques have been applied to MR image quality transfer/super-resolution (SR) tasks [10], it is yet to be investigated for the general MR image reconstruction setting. In this work, we apply them to two network architectures, UNET [4] and a deep cascade of CNNs (DC-CNN) [9]. We show that the Bayesian DL methods are able to approximately characterise the confidence associated with the generated reconstructions. However, we also point out that the proposed formulation seems to be overly simplistic to model the “true” uncertainty associated with MR reconstruction problem in general. More sophisticated approaches may be necessary before such uncertainty maps can be leveraged in practical scenarios.

2 Methods

Problem Formulation: Let \(\mathbf {x}\in \mathbb {C}^n\) be a fully-sampled image, \(\mathbf {y}\in \mathbb {C}^m\) be the undersampled data obtained as \(\mathbf {y}= \mathcal {F}_u \mathbf {x}+ \epsilon \), where \(\mathcal {F}_u\) is an undersampling Fourier operator and \(\epsilon \sim \mathcal {N}(0, \sigma ^2 \mathbf {I})\). The goal is to learn the inversion \(p(\mathbf {x}|\mathbf {y})\) or \(p(\mathbf {x}| \mathbf {x}_u)\) where \(\mathbf {x}_u= \mathcal {F}^H_u \mathbf {y}\) is the zero-filled reconstruction, which is aliased. This is typically approached by a maximum a posteriori (MAP) estimate \(\arg \max _{\mathbf {x}} p(\mathbf {x}|\mathbf {y}) = \arg \min _{\mathbf {x}} - \log p(\mathbf {y}|\mathbf {x}) - \log p(\mathbf {x})\) because this can be equivalently solved as a minimisation problem, wherein the likelihood and the prior terms correspond to data fidelity and regularisation terms, respectively. For example, compressed sensing can be seen as a MAP inference with sparsity-inducing prior. Many deep learning algorithms can be seen as an approximation to such MAP inference [3, 7, 9], where they learn an inversion function \(f^{\mathbf {w}}(\mathbf {x}_u) \approx \mathbf {x}\), and the network parameters \(\mathbf {w}\) are learnt from the dataset \(\mathcal {D} = (\mathbf {Y}, \mathbf {X}) = (\{ \mathbf {y}_1, \dots , \mathbf {y}_n \},\{ \mathbf {x}_1, \dots , \mathbf {x}_n \})\). The problem with MAP inference is that it provides only a point estimate. In the case of compressed sensing, there is a theoretical framework that relates the number of measurements, sparsity level and reconstruction error. For deep learning, no such theoretical properties exist yet and it is unknown when the network fails to reconstruct an image. Therefore, it is desirable to model the distribution \(p(\mathbf {x}| \mathbf {y})\) instead, which can provide the variance associated with the output.

Bayesian Deep Learning: In the Bayesian formulation, given a new undersampled image \(\mathbf {x}_u\) and dataset \(\mathcal {D}\), a predictive distribution for the reconstructed image \(\mathbf {x}\) is obtained by \(p(\mathbf {x}| \mathbf {x}_u, \mathcal {D}) = \int p(\mathbf {x}| \mathbf {x}_u, \mathbf {w}) p(\mathbf {w}| \mathcal {D}) \,\mathrm {d}\mathbf {w}\). Note that in practice the posterior \(p(\mathbf {w}| \mathcal {D})\) is intractable and often approximated using a distribution \(q(\mathbf {w})\) (variational inference). In addition, the above predictive distribution is often estimated via Monte Carlo integration unless an analytical solution exists.

There are two types of uncertainty that can be identified in general. The first kind is called aleatoric (data) uncertainty: this is irreducible uncertainty observed in data. For MR image reconstruction problem, besides measurement noise, there is an inherently high level of ambiguity whether a pixel value represents an aliasing pattern, some anatomy or a texture. Whenever the network encounters unseen pathological examples, the model should exhibit higher level of uncertainty for such region in the reconstruction. The second kind is called epistemic (model) uncertainty: given dataset \(\mathcal {D}\), there are many plausible network parameters \(\mathbf {w}\) that can reconstruct the data well. This can be reduced by increasing the size of \(\mathcal {D}\), however, in medical imaging domain, it is often difficult to collect large training data and hence it is increasingly important to account for the variability in network output caused by this uncertainty.

The two types of uncertainty can be modelled by incorporating heteroscedastic loss and MC-dropout respectively. Here we only summarise the method, however, the detailed derivation can be found in [2, 5]. Firstly, we set our likelihood function to be given by \(p(\mathbf {x}|\mathbf {x}_u, \mathbf {w}) = \mathcal {N}(\mathbf {x}| f^{\mathbf {w}}(\mathbf {x}_u), g^{\mathbf {w}}(\mathbf {x}_u))\), where \(f^{\mathbf {w}}(\mathbf {x}_u)\) models the mean prediction and \(g^{\mathbf {w}}(\mathbf {x}_u)\) accounts for the uncertainty found in the input to estimate the covariance in the prediction. For simplicity, the covariance matrix is assumed to be diagonal (i.e. we only model the pixel-wise variance). The two networks are trained by minimising the heteroscedastic loss:

$$\begin{aligned} \mathcal {L}_{\mathrm {Het.}}(\mathbf {w}) = \frac{1}{N|\mathcal {D}|}\sum _{(\mathbf {x}_u, \mathbf {x}) \in \mathcal {D}} \sum _{i=1}^N \frac{1}{2g^{\mathbf {w}}_i(\mathbf {x}_u)} \Vert \mathbf {x}_i - f_i^{\mathbf {w}}(\mathbf {x}_u) \Vert ^2 + \frac{1}{2} \log g_i^{\mathbf {w}} (\mathbf {x}_u), \end{aligned}$$
(1)

i.e. the pixel-wise error is weighted by the predicted inverse pixel variance. Epistemic uncertainty can be modelled using MC-dropout: it simply applies dropout to the network activation maps. At test time, the predictive mean is given by \(\mathbb {E} [\mathbf {x}] \approx \frac{1}{T} \sum _{t=1}^T f^{\mathbf {w}_t}(\mathbf {x}_u)\), where \(\mathbf {w}_t\) denotes the network configuration after dropout has been applied. The predictive variance is given by \(\mathbb {V}[\mathbf {x}] \approx \frac{1}{T} \sum _{t=1}^T g^{{\mathbf {w}}_t}(\mathbf {x}_u) + \frac{1}{T} \sum _{t=1}^T (f^{\mathbf {w}_t}(\mathbf {x}_u))^2 - \big ( \frac{1}{T} \sum _{t=1}^T f^{\mathbf {w}_t}(\mathbf {x}_u) \big )^2\), where the first and the last two terms correspond to aleatoric and epistemic uncertainty respectively. In addition, the variance of each complex-valued pixel is given by the sum of real and imaginary components: \(\mathbb {V}[\mathbf {z}] = \mathbb {V}[\mathfrak {R}(\mathbf {z})] + \mathbb {V}[\mathfrak {I}(\mathbf {z})]\).

Fig. 1.
figure 1

The proposed network architectures. For DC-CNN1, f and g are correlated whereas for DC-CNN2, f and g are conditionally independent. For UNET, f and g share the same encoder but have two separate decoding paths.

Network Architectures. We consider UNET [4] and DC-CNN [9] as the base architectures \(f^{\mathbf {w}}(\mathbf {x}_u)\). Note that the design of \(g^{\mathbf {w}}(\mathbf {x}_u)\) is flexible. In particular, one can independently parametrise f and g, or consider one network with multiple heads. In the former case, \(g^{\mathbf {w}}(\mathbf {x}_u)\) models intrinsic data uncertainty [10], whereas in the latter case, the uncertainty is correlated with the mean prediction. For DC-CNN, we consider both variants: DC-CNN1 aggregates the penultimate feature maps from each sub-network, which is fed into a 5-layer variance network. For DC-CNN2, an independent 5-layer variance network is trained directly from the undersampled image. For UNET, \(f^{\mathbf {w}}(\mathbf {x}_u)\) and \(g^{\mathbf {w}}(\mathbf {x}_u)\) share the same encoder but have independent decoders. See Fig. 1 for more details.

3 Experiments and Results

Dataset: Two datasets were considered for the experiments. Dataset A consists of 5000 short-axis cardiac cine MR images from the UK Biobank study [8], which was acquired using bSSFP sequence, matrix size 208 \(\times \) 187, 50 frames and a pixel resolution of 1.8 \(\times \) 1.8 \(\times \) 10.0 mm\(^3\). Since only the magnitude images were available, we simulated the phase components using slowly varying sinusoidal waves. Dataset B consists of 10 fully sampled short-axis cardiac cine MR scans acquired at St. Thomas Hospital, UK, using bSSFP sequence with 32-channels, matrix size 192 \(\times \) 190, 30 frames, 320 \(\times \) 320 mm FOV and 10 mm slice thickness. The multi-coil data was recombined into a single complex-valued image using SENSE, which was then treated as the ground truth image. Both datasets were cropped to 192 \(\times \) 192.

Experiment Setup: In this work, we investigate the following questions: (1) How do the Bayesian networks perform compared to the standard networks? (2) How do the generated uncertainty maps look like for (a) same dataset, same undersampling scheme, different acceleration factors, (b) same dataset, different undersampling scheme, and (c) different dataset? To answer these questions, the following experimental setting was considered: Dataset A was split into 4000 training subjects and 1000 test subjects. For training, we used 1D Cartesian undersampling, where each line was sampled according to a zero-mean Gaussian distribution. The undersampling masks were generated on-the-fly as we trained, and the acceleration factor was chosen arbitrarily from \(n_{\mathrm {acc}} \in [1, 5]\). Note that the networks perform better when they are fine-tuned on a fixed acceleration factor alone. However, for this work, this setup sufficed as we were only interested in the relative performance of the Bayesian formulations. For testing, we used 1000 test subjects from Dataset A and all subjects from Dataset B. In addition, three different undersampling patterns were considered: 1D Cartesian, radial and low-resolution undersampling (SR). We used golden-angle sampling for radial undersampling and for SR, the lowest frequencies were acquired until the desired acceleration factor is met. The results were evaluated using peak signal-to-noise ratio (PSNR).

Model Parameters: For each network proposed above, we considered the following variants for an ablation study: (1) plain network, (2) MC-dropout only (+D), (3) with heteroscedastic loss only (+H), and (4) both (+D+H). For (1) and (2), we used the usual mean squared error (MSE) loss instead. The element-wise dropout with \(p = 0.2\) was applied to every feature map except for the last layers of UNET and DC-CNN sub-networks. For training, we used Adam with \(\alpha =10^{-4}, \beta _1=0.9,\beta _2=0.999\), where \(\alpha \) was reduced by a factor of 0.1 every 100 epochs. Each network was trained for 300 epochs, using He initialisation, weight decay of \(\lambda =10^{-6}\) and a mini-batch size 8. For data augmentation, affine transformations were applied on-the-fly, where parameters were sampled from \(360 ^\circ \) rotation, \(\pm 20\) pixel shift and scaling factor \(s \in [0.9, 1.3]\). For MC-dropout, we used \(T=20\) samples as we empirically found that the result rapidly plateaued beyond that. We used PyTorch for our implementations.

Fig. 2.
figure 2

Quantitative error of all models trained on Dataset A with Cartesian undersampling for acceleration factor up to 5, tested on different combinations of dataset, undersampling scheme and acceleration factor.

Results: In Fig. 2, the quantitative results are summarised for each dataset and undersampling scheme for a range of acceleration factors (AF). For dataset A with Cartesian undersampling (the closest to the training distribution), the Bayesian methods had poorer performance compared to the baseline networks. However, interestingly, the performance gap was tightened when the experiment was repeated with the different undersampling schemes or dataset B. For DC-CNN, the plain network achieved the highest PSNR and DC-CNN1+D+H performed the poorest. This might be either because the variance estimate may be too noisy during the training due to dropout, or being stuck in a suboptimal local minimum as a result of f and g competing. For UNET, the Bayesian models consistently outperformed the baseline for dataset B. This suggests that the proposed Bayesian formulation could alleviate the network overfitting a specific distribution when the model is over-parametrised.

Fig. 3.
figure 3

Visualisation of the generated uncertainty maps for different undersampling patterns and datasets. (top left) Dataset A, Cartesian, AF = 3 (top right) Dataset A, radial, AF = 8 (bottom left) Dataset A, SR, AF = 3, (bottom right) Dataset B, Cartesian, AF = 3. Note that the error and uncertainty maps are normalised across the figure.

Fig. 4.
figure 4

Visualisation of epistemic and aleatoric uncertainty generated by DC-CNN1 and DC-CNN2 overlaid on the ground-truth image

The generated epistemic and aleatoric uncertainty maps for UNET+D+H, DC-CNN1+D+H and DC-CNN2+D+H are displayed in Fig. 3. We see that, in terms of scale, there is a rough correlation between the error map and the epistemic uncertainty map. However, when each uncertainty map is inspected in detail, they do not necessarily highlight the regions with highest error. For aleatoric uncertainty, it tends to highlight image borders, but they do not consistently correspond to the most aliased regions of the undersampled input. In Fig. 4, we compare the epistemic and aleatoric uncertainty maps generated by DC-CNN1+D+H and DC-CNN2+D+H from dataset A with Cartesian undersampling for different AFs. We can see that for DC-CNN1, the uncertainty level increased as AF was increased. For DC-CNN2, we can observe a higher level of uncertainty across the image, where the edges are highlighted more dominantly. We note that the observations made here are consistent with literature [10].

4 Discussion and Conclusion

In this work, we evaluated MC-dropout and heteroscedastic loss for the MR image reconstruction problem. We observed that the Bayesian methods performed competitively when the data is further away from the training distribution and the generated epistemic and aleatoric uncertainty maps showed a correlation with the error maps. However, we note that the current form of modelling posed several limitations. Firstly, the characteristics of aleatoric uncertainty are heavily dependant on whether f and g are correlated or not, an architectural decision that has to be made by the users based on task-oriented goals. Secondly, we also noticed that between the generated error and the uncertainty maps, there were noticeable discrepancies at fine-scale. For epistemic uncertainty map, we speculate that this may be because MC-dropout is a simple technique and cannot capture the full model uncertainty for the reconstruction networks. For aleatoric uncertainty map, it is presumably because the proposed methods model only pixel-wise uncertainty (i.e. only the diagonal entries of the covariance matrix). We hypothesize that such simplification is a poor approximation for modelling the variance in MR reconstruction, as aliasing caused by random undersampling is distributed across the entire image. A better modelling of such covariance matrix is therefore likely to improve the results. Finally, albeit parallelisable, obtaining epistemic uncertainty requires T forward passes, which may be problematic for real-time applications. Nevertheless, we believe that Bayesian deep learning has a great scope for improvement and is a crucial step towards better characterisation of deep reconstruction networks.