1 Introduction

In neuroimaging studies, or more generally in shape analysis, normalising a set of subjects consists in deforming them towards a common space that allows one-to-one correspondence. Finding this common space usually reduces to finding an optimal shape in terms of distance to all subjects in the space of deformations. However, the covariance structure of these deformations is not known a priori and the deformation metric generally involves penalising roughness. Yet, in a Bayesian setting, a prior that is informative of population variance would make the registration process, which relies on a posteriori estimates, more robust. Shape models aim to learn this covariance structure from the data. As deformations are parameterised in a very high-dimensional space, learning their covariance is computationally intractable. Dimensionality reduction techniques are therefore commonly used, even though some have tackled this problem by parameterising deformations using location-adaptive control points [8].

For a long time, due to their computational complexity, shape modelling approaches had only been applied to simple models of deformations [7] or simple data sets [6, 9]. Recently, Zhang and Fletcher applied a probabilistic shape model, named principal geodesic analysis, to densely sampled diffeomorphisms and 3D MR images of the brain [14]. Diffeomorphisms correspond to a particular family of deformations that are ensured to be invertible, allowing for very large deformations. Geodesic shooting of diffeomorphisms involves specifying a Riemannian metric on their tangent space and allows a diffeomorphism to be entirely parameterised by its initial velocity [3, 11]. However, because Zhang and Fletcher’s optimisation scheme relies either on Gradient descent or on Monte Carlo sampling of the posterior, they have focused on effectively reducing the dimensionality of velocity fields. It results in an effective approach for studying the principle modes of variations, but may give less accurate alignment than with classical approaches. In particular, they do not explicitly model “anatomical noise”, i.e., deformations that are not captured by the principal modes.

Here, we propose a generative shape model, whose posterior is inferred using variational inference and Laplace approximations. A residual velocity field capturing anatomical noise is explicitly defined and its magnitude is inferred from the data. An efficient Gauss-Newton optimisation is used to obtain the maximum a posteriori latent subspace as well as individual coordinates, minimising the chances of falling into local minima, and making the registration more robust.

2 Methods

2.1 Generative Shape Model

First, let us define a generative model of brain shape. Here, the observed variables are supposed to be categorical images (i.e., segmentations) comprising K classes—e.g. grey matter, white matter, background—stemming from a categorical distribution. This kind of data term has proved very effective for driving registration [2] and is compatible with unified models of registration and segmentation. If needed, it is straightforward to replace this term with a stationary Gaussian noise model. The template \(\varvec{\mu }\) encodes prior probabilities of finding each of the K classes in a given location, and is deformed towards the n-th subject according to the spatial transform \(\varvec{\phi }_n\). In practice, its log-representation \(\varvec{a}\) is encoded by trilinear basis functions, and the deformed template is recovered by softmax interpolation [2]:

$$\begin{aligned} \textstyle \mu _n^{(k)}(\varvec{x}) = \left[ \mathrm {softmax}\left( \varvec{a}\circ \varvec{\phi }_n\left( \varvec{x}\right) \right) \right] _k = \frac{\exp \left( a^{(k)}\circ \varvec{\phi }_n(\varvec{x})\right) }{\sum _{l=1}^K \exp \left( a^{(l)}\circ \varvec{\phi }_n(\varvec{x})\right) } . \end{aligned}$$
(1)

Note that the discrete operation \(\varvec{a}\circ \varvec{\phi }_n\) can be equivalently written as the matrix multiplication, \(\varvec{\varPhi }_n\varvec{a}\), where \(\varvec{\varPhi }_n\) is a large and sparse matrix that depends on \(\varvec{\phi }_n\) and performs the combined “sample and interpolate” operation. We will name this operation pulling, while the multiplication by its transpose, \(\varvec{\varPhi }_n^{T}\), will be named pushing.

Let \(\left\{ \varvec{f}_n \in \mathbb {R}^{I \times K} ;~ 1 \leqslant n \leqslant N\right\} \) be the set of observed imagesFootnote 1. For each subject, let \(\varvec{\phi }_n \in \mathbb {R}^{I \times 3}\) be the diffeomorphic transformation, and let \(\varvec{\mu }_n = \mathrm {softmax}\left( \varvec{a}\circ \varvec{\phi }_n\right) \) be the deformed template. The likelihood of observed voxel values at locations \(\left\{ \varvec{x}_i ~;~ 1 \leqslant i \leqslant I\right\} \) is:

$$\begin{aligned} \textstyle p(\varvec{f}_n(\varvec{x}_i) \mid \varvec{\mu }_n(\varvec{x}_i)) = \mathrm {Cat}(\varvec{f}_n(\varvec{x}_i) \mid \varvec{\mu }_n(\varvec{x}_i)) = \prod _{k=1}^K \mu _n^{(k)}(\varvec{x}_i)^{f_n^{(k)}(\varvec{x}_i)}. \end{aligned}$$
(2)

In this work, diffeomorphisms are defined as geodesics, according to a Riemannian metricFootnote 2 defined by a positive definite operator named L, and are thus entirely parameterised by their initial velocity [11]. A complete transformation \(\varvec{\phi }\) is recovered by integrating the velocity in time, knowing that the momentum \(\varvec{u}_t = L \varvec{v}_t\), is conserved at any t:

$$\begin{aligned} \textstyle \varvec{u}_t(\varvec{x}) = \left| \varvec{D}\varvec{\phi }_t^{-1}(\varvec{x})\right| \left( \varvec{D}\varvec{\phi }_t^{-1}(\varvec{x})\right) ^{T} \left( \varvec{u}_0\circ \varvec{\phi }_t^{-1}(\varvec{x})\right) . \end{aligned}$$
(3)

The velocity field can be retrieved from the momentum field by performing the inverse operation \(\varvec{v}_t = K \varvec{u}_t\), where K is L’s Green’s function Because we want to model inter-individual variability, we need them to be all defined in the same (template) space, which is achieved by using the initial velocity of their inverseFootnote 3, that we name \(\left\{ \varvec{v}_n \in \mathbb {R}^{I \times 3} ~;~ 1 \leqslant n \leqslant N\right\} \). Following [13], we use the probabilistic principal component analysis (PPCA) framework to regularise initial velocity fields, which leads to writing them as a linear combination of principal modes plus a residual field. Let us assume that we want to model them with \(M \ll 3I\) principal components. Then, let \(\varvec{W}\) be a \(3I \times M\) matrix (called the principal subspace), each column being one principal component, let \(\varvec{z}_n\) be the latent representation of a given velocity field in the principal subspace and let \(\varvec{r}_n\) be the corresponding residual field. This yields \(\varvec{v}_n = \varvec{W}\varvec{z}_n + \varvec{r}_n\). In [13], latent coordinates \(\varvec{z}_n\) stem from a standard Gaussian and \(\varvec{r}_n\) is i.i.d. Gaussian noise, and a maximum-likelihood estimate of the principal subspace is retrieved. A Bayesian version can be designed by placing a Gaussian prior on each principal component [5]. Smooth velocities can be enforced with a smooth prior over each principal component and over the residual field, and a Gaussian prior over the latent coordinates, yielding:

$$\begin{aligned} p(\varvec{W})&= \prod _{m=1}^M \mathcal {N}\left( \varvec{w}_m \mid \varvec{0}, \varvec{L}^{-1}\right) , \end{aligned}$$
(4)
$$\begin{aligned} p(\varvec{z}_n \mid \varvec{A})&= \mathcal {N}\left( \varvec{z}_n \mid \varvec{0}, \varvec{A}^{-1}\right) , \end{aligned}$$
(5)
$$\begin{aligned} p(\varvec{r}_n \mid \lambda )&= \mathcal {N}\left( \varvec{r}_n \mid \varvec{0}, \left( \lambda \varvec{L}\right) ^{-1}\right) , \end{aligned}$$
(6)

where \(\varvec{L}\) is the discretisation of L and \(\lambda \) is the anatomical noise precision. However, this approach is often not regularised enough. Zhang and Fletcher [14] proposed a different prior, which can be seen as being set over the reconstructed velocities. In practice, it takes the form of a joint distribution over all latent coordinates, residual fields and the principal subspace:

$$\begin{aligned} \textstyle p(\varvec{z}_{1 \dots N}, \varvec{r}_{1 \dots N}, \varvec{W}) \propto \prod _{n=1}^N \mathcal {N}\left( \varvec{W}\varvec{z}_n + \varvec{r}_n \mid \varvec{0}, \varvec{L}^{-1}\right) . \end{aligned}$$
(7)

The advantage of the first formulation (46) is that it explicates the covariance matrix of the latent coordinates and the noise precision, which can then be inferred from the data. Additionally, it could be extended to multimodal latent distributions such as Gaussian mixtures. However, the second formulation (7) is better at effectively regularising the principal subspace and, in general, the reconstructed velocities. In practice, we use a weighted combination of these two formulations, and call the weights \(\gamma _1\) and \(\gamma _2\).

The noise precision, \(\lambda \), can be inferred in a Bayesian fashion by introducing a conjugate Gamma priorFootnote 4 with \(\alpha = \frac{\nu _0 \times 3I}{2}\) and \(\beta = \frac{\nu _0 \times 3I}{2\lambda _0}\) as shown in [12]. Similarly, the latent covariance matrix is given a conjugate Wishart prior, which is made as non-informative as possible by setting its degrees of freedom to M, and whose expected value is the identity matrix, i.e., . This prior has the opposite effect of an automatic relevance determination prior, since it prevents principal components from collapsing during the first iterations by promoting non-null variances.

Finally, we look for a maximum a posteriori estimate of the template, \(\varvec{a}\), with a very uninformative log-Dirichlet prior that prevents null probabilities (a smooth prior could also be used following [2]). The complete model is depicted in the form of a graphical model in Fig. 1.

Fig. 1.
figure 1

Generative shape model, in the form of a graphical model. Circles indicate random variables while diamonds indicate deterministic variables. Shaded variables are observed. Plates indicate replication.

2.2 Inference

A basic inference scheme would be to search for a mode of the model likelihood, by optimising in turn each parameter of the model. It is however more consistent to tackle this problem as one of missing data, which is dealt with by computing the posterior distribution over all latent variables. Unfortunately, the posterior does not possess a tractable form. A solution is to use variational inference to describe an approximate posterior q that can be more easily computed, by restricting the search space to distributions that factorise over a subset of variables [4]. This method allows the uncertainty about parameters estimates to be accounted for when inferring other parameters. Here, for computational reasons, we do not perform a fully Bayesian treatment of the problem and look for mode estimates of the principal subspace and template. We still marginalise over all subject-specific parameters (latent coordinates and residual field), as recommended by [1]. We state that the set of marginalised latent variables is \(\varvec{\varUpsilon } = \left\{ \varvec{z}_{1 \dots N}, \varvec{r}_{1 \dots N}, \varvec{A}, \lambda \right\} \) and make the (mean field) approximation that the posterior factorises over \((\varvec{z}_{1 \dots N})\), \((\varvec{r}_{1 \dots N})\), \((\varvec{A})\) and \((\lambda )\).

Since we used conjugate priors for the latent precision matrix and the anatomical noise precision, their posterior have the same form as their prior and update equations are equivalent to computing a weighted average between their prior expected value and their maximum likelihood estimator. In contrast, posterior distributions of \(\varvec{z}_n\) and \(\varvec{r}_n\) have no simple form. We thus make a Laplace approximation and estimate their mean (\(\varvec{z}_n^\star \), \(\varvec{r}_n^\star \)) and covariance (\(\varvec{S}_{\mathrm {z},n}^\star \), \(\varvec{S}_{\mathrm {r},n}^\star \)) with their mode and second derivatives about this mode. They are obtained by Gauss-Newton optimisation and the corresponding derivations are provided in Sect. 2.3. Because of the non-linearity induced by geodesic shooting and template interpolation, we first make the approximation that:

$$\begin{aligned} \mathbb {E}_q\Big [p\left( \varvec{f}_n \mid \varvec{z}_n, \varvec{r}_n, \varvec{W}, \varvec{a}\right) \Big ] \approx p\left( \varvec{f}_n \mid \varvec{z}_n^\star , \varvec{r}_n^\star , \varvec{W}, \varvec{a}\right) = p\left( \varvec{f}_n \mid \varvec{\mu }_n^\star \right) , \end{aligned}$$
(8)

where \(\mathbb {E}_q\) means the posterior expected value and \(\varvec{\mu }_n^\star \) is the template deformed according to the above parameter estimates. Consequently, we find:

(9)
(10)

2.3 Gauss-Newton Optimisation

Gauss-Newton (GN) optimisation of an objective function \(\mathcal {E}\) with respect to a vector of parameters \(\varvec{x}\) consists of iteratively improving the objective function by making, locally, a second-order approximation. The gradient, \(\varvec{g}\), and Hessian matrix, \(\varvec{H}\), are computed about the current best estimate of the optimal parameters, \(\varvec{x}_i\), and the new optimum is found according to \(\varvec{x}_{i+1} = \varvec{x}_i - \varvec{H}^{-1} \varvec{g}\). In practice, this update scheme sometimes overshoots it is therefore common to perform a backtracking line search along the direction \(-\varvec{H}^{-1} \varvec{g}\).

Differentiating the Data Term: Let us write \(\mathcal {E}\) the (negative) data term for an arbitrary subject:

$$\begin{aligned} \textstyle \mathcal {E} = -\ln p\left( \varvec{f} \mid \varvec{\mu }\right) = -\sum _{i=1}^{I} \ln \mathrm {Cat}\Big (\varvec{f}(\varvec{x}_i) ~\Big |~ \mathrm {softmax}\left( \varvec{\varPhi }\varvec{a}(\varvec{x}_i)\right) \Big ) = \mathcal {C}_{\varvec{f}}\left( \varvec{\varPhi }\varvec{a}\right) . \end{aligned}$$
(11)

Following [3], differentiating \(\mathcal {E}\) with respect to \(\varvec{v}\) can be approximated by differentiating with respect to \(\varvec{\varPhi }\) and applying the chain rule, which yields:

$$\begin{aligned} \textstyle \frac{\partial \mathcal {E}}{\partial \varvec{v}} = \left( \varvec{\varPhi }^{T} \varvec{\nabla } \mathcal {C}_{\varvec{f}}\left( \varvec{\varPhi }\varvec{a}\right) \right) \varvec{\nabla }\varvec{a}, \end{aligned}$$
(12)

where \(\varvec{\nabla } \mathcal {C}_{\varvec{f}}\) is the gradient of the log-Categorical distribution and takes the form of an \(I \times K\) vector field, and \(\varvec{\nabla }\varvec{a}\) contains spatial gradients of the log-template and takes the form of an \(I \times 3\) vector field. Second derivatives can be approximated by:

$$\begin{aligned} \textstyle \frac{\partial ^2 \mathcal {E}}{\partial \varvec{v}^2} = \varvec{\nabla }\varvec{a}^{T}\left( \varvec{\varPhi }^{T} \varvec{\nabla }^2 \mathcal {C}_{\varvec{f}}\left( \varvec{\varPhi }\varvec{a}\right) \right) \varvec{\nabla }\varvec{a}, \end{aligned}$$
(13)

where \(\varvec{\nabla }^2 \mathcal {C}_{\varvec{f}}\) is the Hessian of the log-Categorical distribution and takes the form of an \(I \times K \times K\) symmetric tensor field. The gradient and Hessian of were derived in [2] and can be computed in each voxel according to:

$$\begin{aligned} \textstyle \frac{\partial \mathcal {C}_{\varvec{f}}(\varvec{a})}{\partial a_k} = \mu _k \left( \sum _{l=1}^K f_l\right) - f_k , \frac{\partial ^2 \mathcal {C}_{\varvec{f}}(\varvec{a})}{\partial a_k \partial a_m} = \mu _k(\delta ^m_k - \mu _m) \sum _{l=1}^K f_l . \end{aligned}$$
(14)

Since \(\varvec{v} = \varvec{W}\varvec{z} + \varvec{r}\), derivatives with respect to \(\varvec{r}\), \(\varvec{z}\) and \(\varvec{W}\) are obtained by applying the chain rule.

Orthogonalisation: The PPCA formulation is invariant to rotations inside the latent space [13]. Consequently, it allows finding an optimal subspace but does not enforce the individual bases \(\varvec{w}_{1 \dots M}\) to be the eigenmodes of the complete covariance matrix. It makes sense, however, to transform the subspace so that it corresponds to the first eigenmodes as it eases the interpretation. Also, in order to enforce a sparse Hessian matrix over \(\varvec{W}\), we require \(\varvec{Z}\varvec{Z}^{T}\) to be diagonal, where the columns of \(\varvec{Z}\) are the individual \(\varvec{z}_n\). This leads us to look for an \(M \times M\) transformation matrix \(\varvec{T}\) that keeps the actual diffeomorphisms untouched (\(\varvec{W}\varvec{z} = \varvec{W}\varvec{T}^{-1}\varvec{T}\varvec{z}\)), while diagonalising both \(\varvec{Z}\varvec{Z}^{T}\) and \(\varvec{W}^{T}\varvec{L}\varvec{W}\). This is done by a series of singular value decompositions that insure that \(\varvec{T}\varvec{Z}\varvec{Z}^{T}\varvec{T}^{T}\) is diagonal and \(\varvec{T}^{-T}\varvec{W}^{T}\varvec{L}\varvec{W}\varvec{T}^{-1}\) is the identity. However, the distribution of diagonal weights between \(\varvec{W}^{T}\varvec{L}\varvec{W}\) and \(\varvec{Z}\varvec{Z}^{T}\) is not optimal and we optimise an additional diagonal scaling matrix \(\varvec{Q}\) by alternating between updating \(\varvec{A}\) from the rotated \(\mathbb {E}\left[ \varvec{Z}\varvec{Z}^{T}\right] \), and updating the scaling weights by Gauss-Newton optimisation of the remaining terms of the lower bound that depend on them:

(15)
Fig. 2.
figure 2

Template reconstructed using an increasing number of principal modes, and with the addition of the residual field.

Fig. 3.
figure 3

Left: deformed template along the first two principal modes. Right: latent coordinates vs. subject ages. Probable AD subjects are marked with a red cross.

Fig. 4.
figure 4

Left: two random examples of fit from the train and test sets. Right: distribution of categorical log-likelihood values for the train and test sets.

3 Experiments and Results

We ran the algorithm on a training set consisting of the first 38 subjects of the OASIS cross-sectional database [10]. We used the provided FSL segmentations, which we transformed into tissue probability maps by extracting the grey and white matter classes and smoothing them with a 1-voxel FWHM Gaussian kernel. We set the number of principal components to 32, the parameters of the membrane, bending and linear-elastic energies were respectively set to 0.001, 0.02 and (0.0025, 0.005), and we used \(\gamma _1 = \gamma _2 = 1\). We set the prior parameters of the residual precision magnitude based on tests conducted on 2D axial slices (\(\lambda _0 = 17\), \(\nu _0 = 10\)). Templates reconstructed with a varying number of principal modes, and with or without the residual field, are presented in Fig. 2, while Fig. 3 shows the template deformed along the first two principal modes, the first one being typical of brain ageing. This pattern is validated by plotting coordinates along the first dimension against actual ages. Finally, the learnt model was tested by registering the template towards 38 unseen images from the OASIS database. The distribution of categorical log-likelihood values for the training and testing sets are depicted in Fig. 4, along with two example fits that were randomly selected. Both sets have similar distributions (mean ± std \(\times 10^5\). Train: \(-7.23 \pm 1.26\) ; Test: \(-7.55 \pm 0.88\)), showing the model’s robustness.

4 Conclusion

We presented a generative model of brain shape that does not limit the space of diffeomorphisms, allowing learning regularisation while preserving enough flexibility for accurate normalisation. We showed how principal modes of variation correlate with known factors of brain shape variability, hinting towards the fact that this low-dimensional representation might allow to discriminate between physiological states. Future research will focus on applying this framework to very large databases, and on combining it with segmentation models in order to work directly with raw data. The latent distribution may be improved by using multimodal priors such as Gaussian mixtures. Our main limitation is the small number of principal basis that can be learned due to their large size. This could be overcome by explicitly modelling sparse and local covariance patterns.