1 Introduction

Estimating fiber orientation distribution functions (fODFs) is a traditional step in analyzing data from diffusion MRI (dMRI). It is usually performed by spherical deconvolution, which solves an ill-conditioned optimization problem with suitable regularizers and constraints [1, 9]. Recently, supervised learning has been explored as an alternative strategy for closely related problems, such as estimating NODDI or diffusional kurtosis [3], microstructure imaging [6], or fiber tractography [7]. Such approaches have been shown to produce robust results, sometimes even from a greatly reduced number of measurements.

This paper aims to harness the power of machine learning also for fODF estimation. A challenge in doing so is the fact that spherical deconvolution involves a response function as a parameter, which is usually estimated individually for each subject [8, 10]. In order to ensure that the learned method will generalize to new subjects, we have to design it so that it does not capture specific assumptions about the response function.

We propose a hybrid solution for this problem that combines traditional convex optimization with learning in such a way that the relative strengths of both approaches are preserved. In particular, we build on an fODF autoencoder, which leverages the ability of unsupervised learning to capture distributions that are too complex to be modeled explicitly: In our case, the distribution of fODFs in high-quality real-world data. We use this autoencoder to regularize optimization-based spherical deconvolution, in which the response function can still be freely exchanged, and traditional constraints can be enforced.

2 Background and Related Work

Spherical deconvolution is the standard approach to fODF estimation. It can be formalized as

$$\begin{aligned} \hat{f}={\text {argmin}}_{f} \Vert M f - s \Vert ^2 \end{aligned}$$
(1)

where vector \(s\in \mathbb {R}^k\) contains k diffusion-weighted measurements, solution vector f contains spherical harmonics coefficients describing the fODF, matrix \(M=ER\) is the product of a diagonal convolution matrix R and a matrix E that evaluates the spherical harmonics basis functions in the directions of the measurements. Computing \(\hat{f}\) via Eq. (1) is ill-conditioned, and it is typically made more robust by imposing non-negativity constraints [1, 9].

We have been inspired by recent works which demonstrated that, when dealing with other models in diffusion MRI, replacing optimization-based fitting with supervised regression improves robustness and makes it possible to recover plausible estimates from a significantly reduced set of measurements [3, 6]. A straightforward transfer of this idea to Eq. (1) would imply learning a direct mapping from measurements s to model parameters f. We decided against this, since it would effectively hard-wire the convolution matrix R into the mapping. Since R is typically adjusted per-subject [8, 10], this would impair our ability to transfer the resulting method from one subject to another.

Instead, we pursue an alternative strategy that is based on unsupervised learning. It has been inspired by recent works on image restoration, which is a mathematically analogous problem to Eq. (1), with s taking the role of an image degraded by blur and noise, f a reconstructed image, and M a matrix that includes the characteristics of the blur, similar to how M depends on the response function in our case. Our approach follows [2, 5, 13] in using autoencoders to define a regularizer for f, maintaining full flexibility with respect to M. To our knowledge, we are the first to apply this strategy to fODF estimation, and we implement this idea differently than [2, 5, 13].

3 Hybrid fODF Estimation

Our primary goal in regularizing constrained spherical deconvolution is to improve results on suboptimal inputs, such as diffusion MRI data with few gradient directions or low b value. Our proposed method can be written as

$$\begin{aligned} \tilde{f}={\text {argmin}}_{f} H(f)\quad \text {with}\quad H(f)=\Vert M f - s \Vert ^2+\lambda \Vert f-AE(\hat{f};\varTheta )\Vert ^2 \end{aligned}$$
(2)

where H(f) is our novel objective function. We refer to it as a hybrid method, since it combines convex optimization with a learning-based regularizer, rather than replacing optimization with learning as in [3, 6, 7]. M, f, \(\hat{f}\), and s are defined as in Eq. (1), and \(AE(\hat{f};\varTheta )\) denotes application of an autoencoder with parameters \(\varTheta \) to the preliminary estimate \(\hat{f}\) from standard CSD. We will now justify this use of autoencoders and provide details on how they were trained, then explain the computational pipeline of our method and highlight two theoretical advantages of the hybrid approach.

3.1 Role and Training of Autoencoders in Our Method

Autoencoders are neural networks that are trained in an unsupervised fashion to reconstruct their own input. They are most frequently used to learn a latent representation of the training data, e.g., for dimensionality reduction. Denoising autoencoders are a variant of this idea, trained to reconstruct a clean output from a noisy input. Their use in Eq. (2) is motivated by the observation in [2] that the difference between the input and output of a denoising autoencoder can be interpreted as the gradient of a regularizer that represents a Gaussian-smoothed version of the data distribution. In other words, when dealing with input that is suboptimal for CSD, our regularizer should guide us to prefer fODFs that agree with those observed in more suitable data. Learning the distribution of fODFs, rather than the mapping from the corresponding measurements, increases our ability to generalize to new subjects and measurement protocols.

All our autoencoders were trained with output fODFs that were estimated with standard CSD from 64 uniformly distributed diffusion-weighted images (DWIs) at \(b=2000\), which is clearly above the minimum number of 45 recommended in [10]. We experimented with three different types of autoencoders:

  • A denoising autoencoder (DAE) was trained with input fODFs that have been estimated after perturbing the DWIs with additional Rician noise (SNR = 20)

  • A subsampling autoencoder (SSAE) was trained with input fODFs that have been estimated from a subset of 32 DWIs at \(b=2000\)

  • A quality transfer autoencoder (QTAE) was trained with input fODFs that have been estimated from 30 DWIs at \(b=700\), i.e., a measurement setup commonly used for Diffusion Tensor Imaging, but not for CSD

All autoencoders were trained on 122,324 fODFs from the white matter masks of three subjects; a fourth subject was used as a validation dataset. Hyperparameters were tuned individually for each type of autoencoder. Reported results used network architecture \((64-32-64)\) (i.e., three hidden layers with 64, 32, and 64 neurons, respectively) for the DAE and QTAE, \((128-128-128-128)\) for the SSAE. We obtained better results with sigmoid than with ReLU activation functions, due to the limited size and depth of our networks. Optimization was done using Adadelta [12] with initial learning rate 0.1, decayed by half every 500 epochs, batch size 512, L2 regularization with trade-off parameter 0.01, and dropout with probability 0.01.

3.2 Computational Pipeline and Theoretical Guarantees

Our computational pipeline is illustrated in Fig. 1. The autoencoder is trained in a pre-process. Given new measurements, we first solve a standard CSD according to Eq. (1) to obtain a preliminary estimate \(\hat{f}\). After passing \(\hat{f}\) through the autoencoder, we integrate the regularizer in Eq. (2) into the least squares problem by amending the measurement vector s with \(\sqrt{\lambda }\,AE(\hat{f};\varTheta )\), and matrix M with \(\sqrt{\lambda }\,I\), where the dimension of identity matrix I matches vector f.

Fig. 1.
figure 1

A flow diagram of our method, in which standard constrained spherical deconvolution (CSD) and the output of an fODF autoencoder (AE) feed into a regularized deconvolution that produces the final result \(\tilde{f}\).

Unlike pure learning-based methods that directly map measurements to model parameters [3, 6, 7], our hybrid method allows us to formally guarantee a minimum degree of data fidelity, by maintaining the upper bound

$$\begin{aligned} \Vert M\tilde{f}-s\Vert ^2\le \Vert M\hat{f}-s\Vert ^2+k\sigma ^2 \end{aligned}$$
(3)

on the fitting error of the final result \(\tilde{f}\). Here, k is the number of diffusion-weighted measurements, and the user’s choice of \(\sigma \) determines by how much we are willing to increase average residual magnitudes per measurement with respect to the unregularized optimum \(\hat{f}\). In our experiments, we set \(\sigma \) to \(5\%\) of the average \(b=0\) signal intensity within a white matter mask.

The constraint in Eq. (3) is maintained by setting \(\lambda =k\sigma ^2/\Vert \hat{f}-AE(\hat{f};\varTheta )\Vert ^2\). To see why, assume that a solution \(\tilde{f}^\prime \) violates it. Since \(\Vert \tilde{f}^\prime -AE(\hat{f};\varTheta )\Vert ^2\) is non-negative, this implies that the value of the objective function from Eq. (2) must be \(H(\tilde{f}^\prime )>\Vert M\hat{f}-s\Vert ^2+k\sigma ^2\). However, according to our choice of \(\lambda \), \(H(\hat{f})=\Vert M\hat{f}-s\Vert ^2+k\sigma ^2\), so \(H(\tilde{f}^\prime )>H(\hat{f})\), and \(\tilde{f}^\prime \) cannot be optimal.

Our method is implemented within the framework of a recently proposed variant of CSD, which represents fODFs as fourth-order tensors with a positive definiteness constraint (H-psd) [1]. The problems in Eqs. (1) and (2) are expressed as quadratic cone programs, and solved using the publicly available package cvxopt [11], as described in more detail in [1]. Note that [1] does not consider regularization based on unsupervised learning, which is our main contribution.

A second theoretical guarantee that is afforded by our hybrid method, but would be non-trivial to integrate into a pure learning-based approach, is preservation of the H-psd constraint. This implies that, while setting \(\sigma =0\) reduces our method to [1], very high values of \(\sigma \) do not simply lead to \(\tilde{f}\approx AE(\hat{f};\varTheta )\), but rather project \(AE(\hat{f};\varTheta )\) onto the set of non-negative fODFs when needed.

4 Results and Discussion

4.1 Quantitative Results

We quantify the benefit from our method on an independent subject, unrelated to the training or validation data. Within a single session, 64 DWIs at \(b=2000\) and 30 DWIs at \(b=700\) were acquired on a 3T Skyra (Siemens), and corrected for head motion using FSL [4]. From these data, we created four test sets:

  1. 1.

    CSD protocol (64 DWIs at \(b=2000\))

  2. 2.

    A subsample of 32 DWIs from the CSD protocol

  3. 3.

    A subsample of 15 DWIs from the CSD protocol

  4. 4.

    A DTI protocol (30 DWIs at \(b=700\))

Table 1. Our regularization resulted in lowest average L2 and angular errors on all test sets, with the most distinct benefits on the suboptimal inputs (2–4).

For each set, Table 1 lists results achieved with standard CSD, as well as with CSD regularized by our three different types of autoencoders. Errors are specified with respect to fODFs that have been estimated with standard CSD from an independent measurement of the same subject (64 DWIs at \(b=2000\), on a 3T Prisma) after spatial registration with the Skyra scan, and corresponding rotation of gradient vectors. Error measures are the L2 norm of differences between estimated and reference fODF coefficients, as well as angular differences between up to three strongest fiber directions, weighted by their corresponding volume fractions. All errors are averaged over the 35,470 voxels in the test subject’s white matter mask.

For all suboptimal inputs (sets 2–4), all autoencoders resulted in a clear improvement. The best result in each column is marked in boldface. Notably, our regularization allows us to achieve similar angular errors from half the measurements (set 2) as standard CSD on the full set of measurements (set 1). Moreover, we achieve similar angular errors from \(b=700\) data (set 4) as standard CSD does given a similar number of DWIs at \(b=2000\) (set 2). We believe that the fact that the reference dataset has been reconstructed using standard CSD from data similar to set 1 makes it difficult to beat standard CSD on that set. Despite this, the SSAE led to a slight improvement even in this case.

Unsurprisingly, best results were obtained when the protocol for generating training data matched the final use case, i.e., the SSAE worked best on set 2, while the QTAE gave best results on set 4. In all experiments, the SSAE and QTAE dominated the DAE. The fact that all autoencoders improved all challenging use cases 2–4, even ones that did not match their training paradigm, indicates that we succeeded in our attempt to construct a method that, in contrast to pure supervised methods such as [3], does not strictly require an expensive re-training for each new use case.

4.2 Practical Benefit from Theoretical Guarantees

We conducted experiments with a simplified variant of our method that directly uses \(AE(\hat{f};\varTheta )\) as the final fODF estimate rather than regularizing deconvolution with it. This simplification sacrifices the formal guarantees that were discussed in Sect. 3.2 and thus helps us understand their practical benefit.

Table 2. For all autoencoders and test sets, the percentage of voxels for which L2 errors improve over standard CSD is greater when using the autoencoder output to regularize deconvolution, compared to using it directly as an estimate (“CSD+AE”).

We found that average L2 and angular errors of the two variants were very similar, and none of the two variants dominated the other one. However, we do not only strive for high accuracy on average, but also to achieve an improvement over the baseline for as many individual voxels as possible. As can be seen in Table 2, regularized deconvolution makes a clear contribution towards this goal, since it consistently increases the fraction of voxels for which an improvement over standard CSD is obtained.

This implies that we can often benefit from regularizing CSD with an fODF that has a higher error than the unregularized result, which may seem counter-intuitive. We observed that autoencoders tend to oversmooth the fODFs, sometimes to an extent that increases the L2 error beyond the baseline. In this case, the regularization better balances data fidelity with the level of smoothing.

4.3 Iterating Our Method

Since the image restoration works that have inspired our approach [2, 5, 13] apply the autoencoder repeatedly within an iterative optimization, it was natural to attempt iterating Eq. (2) so that the result \(\tilde{f}\) from the previous iteration replaces the original estimate \(\hat{f}\) in the next one. However, this increased L2 errors almost always, which we found to be due to the onset of oversmoothing.

4.4 Qualitative Results

We also validated our method qualitatively by performing tractography on the resulting fODFs. Figure 2 provides an example from the brainstem region. A tractography on our reference dataset is in (a), while (b) and (c) are based on test set 3 with merely 15 DWIs. As expected from Table 1, our regularizer (c) cannot obtain a perfect reconstruction from this strong subsampling, but it does restore several fibers missing from the unregularized result, marked with ellipses in (b), and it eliminates some spurious fibers, marked with arrows.

Fig. 2.
figure 2

Compared to a tractography on our reference dataset (a), our regularizer (c) restores some plausible tracts missing from a standard reconstruction of test set 3 (ellipses in b), and it eliminates some false positives (arrows in b).

5 Conclusion

We have presented a novel method for estimating improved fiber ODFs from suboptimal diffusion MRI data, by using fODF autoencoders to learn a suitable regularizer in an unsupervised manner. On independent test data, our regularizer allowed us to achieve similar angular accuracy from only 32 gradient directions as with the standard approach from 64 directions. Similarly, we achieve comparable angular accuracy from 30 directions at \(b=700\) as the standard approach from 32 directions at \(b=2000\). Our current regularizer only uses information from the same voxel; we hope to extend it to also account for spatial neighborhoods.

On a conceptual level, our work addresses more general challenges such as: How can we use machine learning to produce accurate results not just on average, but in as many individual cases as possible? (By imposing suitable constraints.) How can we enforce formal guarantees on the results of a learning based approach? (By combining it with constrained optimization.) How can we eliminate the need for frequent re-training? (By learning something more general, e.g., the natural distribution of fODFs rather than the mapping from subject- or scanner-specific data to fODFs.) In our future work, we plan to continue exploring strategies for these conceptual challenges, within and beyond this specific application.