1 Introduction and Related Work

Multimodal imaging systems [14, 19] incorporating positron emission tomography (PET) and magnetic resonance imaging (MRI) acquire both functional and anatomical information to improve clinical diagnosis, therapy, and scientific studies in the human body (e.g., neurology and oncology). PET and MRI images have distinctive appearances. The intrinsic spatial resolution of radioactivity in PET, typically 4–6 mm [1], is far lower than the anatomical resolution in MRI, typically 1–1.5  mm. Spatial resolution in PET is limited by positron range, noncollinearity of the annihilation photons, scatter inside the scintillation crystals, finite crystal dimension, interaction depth, etc. In PET-MRI, PET reconstruction is challenged by the stochasticity in gamma-ray emission and the reliability of attenuation-map estimation without the computed tomography (CT) images; thus, it typically relies on statistical prior models on the MRI and PET images.

After expectation maximization (EM) based maximum-likelihood (ML) estimation for PET reconstruction [15], edge-preserving smoothness priors and other gradient based penalties within the EM framework were exploited [7]. The method in [4] relies on hierarchical image segmentation, while that in [11] uses voxel-reweighted fidelity based on Fisher information. Some methods use edge information in the associated anatomical image through edge locations [5, 6, 8, 9] or mutual information of voxel intensities [16]. However, all these priors [5,6,7,8,9, 11, 16] use very local neighborhoods to enforce piecewise-smooth images and cannot capture larger-scale inter-image (structure-function) correlations and intra-image texture patterns in PET images or associated MRI images. In contrast, we propose a framework to model regularity within and statistical dependencies across spatially-corresponding patches in PET and MRI.

For dynamic PET, recent methods [12, 13] use hidden Markov random field (MRF) label priors with a Gaussian mixture model for intensities. Some methods [20] use spatially-varying smoothing using kernel similarity on time curves and others [10] use a Bowsher prior to capture edge information from the anatomical image. However, these methods [10, 12, 13, 20] do not capture spatial dependencies across PET and MRI. So, we propose a joint patch-based dictionary as a MRF prior on the PET-MRI image pair for Bayesian PET reconstruction. Early works on patch-based (multimodal) medical image denoising are in [2, 3].

Very recent works [17, 18] on PET reconstruction use a patch dictionary learned from MRI images, but this strategy risks over/underfitting the functional features of PET images by the anatomical features of MRI. In contrast, we propose a joint generative model, based on joint patch regularity, for the pair of PET and MRI images and leverage the joint model to reconstruct PET images. Our model subsumes learning PET image patch statistics by jointly learning statistics of PET image patches and spatially-corresponding MRI patches.

We propose a novel joint generative model for the pair of images of PET radiotracer activity and MRI magnitude. We rely on a joint patch-based sparse dictionary model formulated as a MRF prior for Bayesian PET reconstruction using EM. While our model learns the fine-scale and larger-scale regularity in patches in PET and MRI images individually, it also learns correlations between spatially-corresponding patches in MRI and PET images. Reconstructions on simulated and in vivo PET-MRI show that our method produces (i) qualitatively better regularity and (ii) quantitatively lower errors and higher structural similarity (SSIM), over the state of the art.

2 Methods

We describe our novel generative model on joint PET-MRI, relying on a sparse joint-dictionary model, and our Bayesian PET image reconstruction using EM.

2.1 Generative Model for PET-MRI Using a Joint Sparse Dictionary

We propose a joint MRF-based sparse dictionary model for the pair of MRI magnitude and PET activity images. Let X model the MRI image and Y model the PET activity image, in a common spatial coordinate frame. Let (XY) be a MRF with a neighborhood system \(\mathcal {N} := \{ \mathcal {N}_i \}_{i=1}^I\), where \(\mathcal {N}_i\) is the neighbors of voxel i. For any voxel i in the MRI or PET images, its neighbors include all other voxels in the MRI image within a distance \(d_X\) and all other voxels in the PET image within a distance \(d_Y\). This paper sets \(d_X := 4\) mm, \(d_Y := 8\) mm. This neighborhood system allows us to model the MRF’s Gibbs energy in terms of square-shaped cliques (patches) \(\mathcal {P}_i\) of width 5 mm \((d_X + 1)\) in the MRI image and patches \(\mathcal {Q}_i\) of width 9 mm \((d_Y + 1)\) in the PET image. We model the PET-MRI joint patch \((X_{\mathcal {P}_i},Y_{\mathcal {Q}_i})\), at each voxel i, as a sparse linear combination of template patches in the joint dictionary A with J atoms, where the j-th atom comprises a MRI template patch \(A^X_j\) paired with a PET template patch \(A^Y_j\). Because the PET and MRI images are non-negative, we model the components of atoms \(A^X_j, A^Y_j\) as non-negative and we enforce non-negativity on the coefficients in the dictionary fit. The prior \(P (X,Y | A) := \eta \exp (- G (X,Y,A) )\), where the Gibbs energy \(G (X,Y,A) := \sum _{i=1}^I \min _{c_i \succeq 0} \beta \Vert X_{\mathcal {P}_i} - A^X c_i \Vert _2^2 + (1 - \beta ) \Vert Y_{\mathcal {Q}_i} - A^Y c_i \Vert _2^2 + \lambda \Vert c_i \Vert _1\), \(c_i\) is the common coefficient vector used for fitting patches in both MRI and PET images at voxel i, \(c_i \succeq 0\) constraints each coefficient within \(c_i\) to be non-negative, and \(\eta \) is the normalizing constant. Free parameter \(\beta \in [0,1]\) balances the quality of fit in the MRI and PET images, adapted to their relative noise levels. Free parameter \(\lambda \in \mathbb {R}^+\) controls the sparsity of the dictionary coefficients. We tune free parameters \(\beta , \lambda \) using cross validation.

Learning a Joint PET-MRI Dictionary. We learn the joint dictionary A, comprising atom pairs \(\{ (A^X_j, A^Y_j \}_{j=1}^J\), from a training set of T high-quality PET-MRI images \(\{ (\mathring{X}^t, \mathring{Y}^t) \}_{t=1}^T\), as the maximum-a-priori estimate \(\arg \max _A \prod _{t=1}^T P (\mathring{X}^t, \mathring{Y}^t | A)\) subject to scale constraints on the atoms \(\Vert A^X_j \Vert ^2_2 + \Vert A^Y_j \Vert ^2_2 \le 1\), \(\forall j\), positivity constraints on the atoms \(A^X_j \succeq 0, A^Y_j \succeq 0, \forall j\), and positivity constraints on the coefficients \(c_i \succeq 0\). Our design of the dictionary learning (and coding) formulation leads to smooth convex problems for optimizing (i) A given c and (ii) c given A; both involve quadratic objective functions and convex constraint sets. We solve by iterative alternating optimization of A and c, where each update uses projected gradient descent with an adaptive step size to ensure improvement in the objective function value. We initialize atoms through (i) k-means++ on the subset of joint patches \((\mathring{X}_{\mathcal {P}_i}, \mathring{Y}_{\mathcal {P}_i})\) with variance significantly higher than the noise variance, excluding constant patches, and (ii) adding two constant atoms each modeling patches in MRI or PET that are constant or have the minimum intensity as positive.

2.2 MAP Reconstruction of PET with Joint-Dictionary Prior, EM

Likelihood Model for PET. Given \(Y_i\) as the rate of gamma-ray emission counts at each voxel i, let \(\alpha _{i,d}\) be the emission fraction reaching detector d in a ring of D detectors. We model \(\alpha _{i,d}\) as voxel strip integrals, as in STIR (github.com/UCL/STIR). At detector d, let the emission rate coming from voxel i be \(Y_{i,d} := Y_i \alpha _{i,d}\) and let the total emission rate be \(\sum _{i=1}^I Y_{i,d}\). Let the observed emission at detector d be \(Z_d \sim \text {Poisson} (\sum _{i=1}^I Y_{i,d})\). Equivalently, let the part of the observed emission at detector d coming from voxel i be \(Z_{i,d} \sim \text {Poisson} (Y_{i,d})\). Then, the likelihood \(P (Z | Y) := \prod _{d=1}^D P (Z_d | \sum _{i=1}^I Y_{i,d}) = \prod _{d=1}^D \prod _{i=1}^I P (Z_{i,d} | Y_{i,d})\).

Given the observed PET data \(\{ z_d \}_{d=1}^D\), the observed MRI image X, and the joint dictionary A, our reconstructed PET activity image Y is the MAP estimate: \(\arg \max _Y P (Y | z, x, A) = \arg \max _Y P (z | Y) P (Y, x | A)\).

EM for PET Image Reconstruction. We propose to solve the MAP estimation problem for parameter Y, given observations \(\{ z_d \}_{d=1}^D\), using EM. We model \(Z_{i,d}\) as the hidden variable. We initialize the iterative EM optimization with the PET image estimate based on filtered back-projection followed by standard EM [15] without any priors. At iteration m, let the current estimate of the PET image be \(y^m\). The E step designs the function \(Q (Y; y^m) := E_{P (\{ Z_{i,d} \}_{i = 1, d = 1}^{I,D} | y^m, \{ z_d \}_{d = 1}^D; X, A)} [ \log P (z | Y) + \log P (Y,x | A) ]\), where the expectand is the sum of the complete-data log likelihood and the log prior. The M step maximizes the \(Q (Y; y^m)\) function to produce the updated reconstruction estimate \(y^{m+1}\) using the update rule based on the one-step-late modified-EM [7] strategy adapted to our PET-MRI joint-dictionary modeling framework. Thus, the M step updates the reconstruction estimate, for each voxel i, as

$$\begin{aligned} y^{m+1}_i := \left( \sum _{d=1}^D \frac{ y^m_i \alpha _{i,d} z_d }{ \sum _{i=1}^I y^m_i \alpha _{i,d} } \right) / \left( { \frac{ \partial G (y,x,A) }{ \partial y_i } \bigg |_{y_i = y^m_i} + \sum _{d=1}^D \alpha _{i,d} } \right) , \forall i , \end{aligned}$$
(1)

where we evaluate \(\partial G (y,x,A) / \partial y_i\) at \(y_i = y^m_i\) by (i) fitting the dictionary A to the image pair \((x,y^m)\), as dictated within G(yxA), to produce optimal coefficients \(\{ c^*_i \}_{i=1}^I\) (our formulation makes this is a quadratic programming problem having efficient solvers leading to global optima) and then (ii) taking the partial derivative of \((1-\beta ) \sum _{i=1}^I \Vert Y_{\mathcal {Q}_i} - A^Y c^*_i \Vert _2^2\) with respect to \(y_i\). EM iterations stop when the relative change in the estimates \(y^m\) and \(y^{m+1}\) is small.

3 Evaluation, Results, and Discussion

We compare our method with 5 other reconstruction methods: (i) EM without priors (MLEM) [15], (ii) EM with edge-preserving Huber-loss based MRF prior (HuberMRF), (iii) joint total variation (JTV) prior [6], (iv) parallel level set (PLS) [5, 6] prior, (v) MRI-patch dictionary prior for PET reconstruction (MRI-Dict) [18]. For all methods relying on the information in the MRI image, including JTV and PLS, we fix the acquired MRI image to the ground truth. For all methods we tune the underlying free parameters to give the best results qualitatively and quantitatively. We evaluate on 3 datasets: (i) simulated phantom used in [5, 6], (ii) simulated BrainWeb used in [5, 6], and (iii) in vivo PET-MRI. For quantitative evaluation we use SSIM [21] and relative root mean squared error (RRMSE), i.e., ratio of RMSE between \(y^{\mathrm {true}}\) and \(y^{\mathrm {estimate}}\) to RMS of \(y^{\mathrm {true}}\).

Fig. 1.
figure 1

Validation: Simulated Phantom. (a1)–(a2) PET-MRI ground truth. (b), (c), (d1)–(g1) PET reconstructions using various methods. (d2)–(g2) Residual (reconstructed - truth) images for the results in (d1)–(g1). RRMSE: Ours 0.06, MRI-Dict 0.10, PLS 0.09, JTV 0.08, HuberMRF 0.09, MLEM 0.18. SSIM: Ours 0.92, MRI-Dict 0.81, PLS 0.86, JTV 0.90, HuberMRF 0.84, MLEM 0.61.

Validation: Simulated Phantom. For the simulated phantom (Fig. 1(a1)–(a2)), we sufficiently blur the PET image to reproduce the lower resolution in PET [1], relative to MRI. The EM reconstruction without any prior (Fig. 1(b)) retains a lot of the noise compared to prior-based methods. EM with the HuberMRF prior (Fig. 1(c)) gets rid of most of the random noise. JTV and PLS (Fig. 1(g1), (f1)) leverage the anatomical structure in the MRI, encouraging edges in the PET reconstruction to occur at the same spatial locations as the edges in the MRI image. They improve over HuberMRF, but the gradient-based penalty limits the quality of reconstruction of the (i) blue circular blobs, (ii) red circular outside rim, and (iii) red parallel bars in the center. Using a MRI-patch statistical model to reconstruct PET images (PET patch intensities being significantly smoother than MRI patch intensities) results in overfitting of the dictionary to the noise (Fig. 1(e1)). Our reconstruction (Fig. 1(d1)) using a joint patch-based dictionary model maintains both fine-scale regularity, in the form of smoothness, and larger-scale regularity by preservation of structures like the straightness and separability of the red bars, circularity of the blue blobs, and the continuity in the red outer ring. Our reconstruction has much smaller residual magnitudes (Fig. 1(d2)–(g2)) compared to all other methods, and is closest to the ground truth qualitatively and quantitatively (Fig. 1).

Fig. 2.
figure 2

Validation: BrainWeb-based Phantom. (a1)–(a2) PET and MRI ground truth. (b), (c), (d1)–(g1) PET image reconstructions using various methods. (d2)–(g2) Residual (truth - reconstructed) images for the results shown in (d1)–(g1).

Validation: BrainWeb-Based Phantom. We simulated PET-MRI from BrainWeb MRI and segmentation, akin to the scheme in [5], but our PET image is much smoother than the MRI image, as exhibited in in vivo imaging. We learn our joint dictionary A on patches from five slices and reconstruct 50 slices. Our reconstructions (i) better preserve fine-scale and larger-scale structure (Fig. 2) and (ii) have better RRMSE and SSIM (Fig. 4(c)), over all other methods.

Fig. 3.
figure 3

Results on in vivo Brain PET-MRI. (a1)–(a2) PET and MRI ground truth. (b), (c), (d1)–(g1) PET image reconstructions using various methods. (d2)–(g2) Residual (truth - reconstructed) images for the results shown in (d1)–(g1).

Fig. 4.
figure 4

Results: Quantitative Evaluation. (a) Dictionary of MRI patches, used by [18] (MR-Dict). (b) Our joint dictionary of PET-MRI patches (MRI patch appears as red channel; corresponding PET patch as green channel). (c)–(d) Box plots for RRMSE and SSIM for 50 slices from BrainWeb simulation and 4 subjects in vivo.

Results: In vivo Brain PET-MRI. We collected data for 5 subjects using a 3T PET-MRI Siemens scanner (PET slice thickness 2 mm and T1 MRI 1 mm\(^3\) voxels). We learn the joint patch dictionary from one subject and reconstruct 50 slices from the other subjects. Unlike the dictionary of MR patches in MRI-Dict (Fig. 4(a)), our joint dictionary (Fig. 4(b)) captures regularity in individual patches (which is distinctive for PET and MRI) and PET-MRI correlations in spatially-corresponding patches. Our reconstructions are qualitatively (Fig. 3) and quantitatively (Fig. 4(d)) better than all other methods.

Conclusion. We propose a novel joint generative model for the PET-MRI image pair, relying on a joint patch-based dictionary model, formulated as a MRF prior for Bayesian PET reconstruction using EM. While our model learns the fine-scale and larger-scale regularity in patches in PET and MRI images individually, it also learns structure-function correlations between spatially-corresponding patches in MRI and PET images. Our reconstructions on simulated and in vivo PET-MRI improve qualitatively and quantitatively over the state of the art.