1 Introduction

Cardiac motion analysis assesses regional deformation parameters such as volume output, strain and torsion, which are indicative for the diagnosis and treatment for patients with cardiovascular diseases [8, 9]. The deformation parameters can be derived from displacement field estimated from cardiac magnetic resonance (MR) images. Traditionally cardiac motion estimation is cast as a series of pairwise registration tasks. Shen et al. [8] extended a hierarchical attribute-matching based registration method to simultaneously estimate cardiac motion of all frames in a sequence by formulating cardiac motion as spatial-temporal 4D registration. Shi et al. [9] applied B-spline free-form deformation (FFD) registration [7] on both cine and tagged cardiac MR images by spatially weighting the complementary information from the two modalities.

Deep learning methods have been successfully applied to deformable registration, demonstrating competitive performances with significantly superior speed. Several methods that train deep convolutional neural networks (ConvNets) to perform one-shot prediction of the deformation between two images have been proposed. A critical difference in the proposed methods is the supervision signal used during training. On the one hand, networks are trained to perform a regression task using ground truth deformation that are acquired either via random simulation [3, 10] or traditional registration algorithms [2, 12]. These methods are termed supervised methods since the ground truth of the deformation is used in training. On the other hand, several recent unsupervised methods opt to directly optimise the parameters of the network to maximise intensity-based similarity for all image pairs in a training dataset [1, 11]. Most related to this work, [6] incorporated unsupervised registration method to provide complementary motion information for cardiac segmentation. Despite the advances of both supervised and unsupervised methods, it remains unclear which training strategy is more suitable for cardiac motion estimation.

In this work, we trained a deep learning registration network to perform cardiac motion estimation using both supervised and unsupervised training strategy, and compared the performances on both the accuracy and the regularity of the estimated motion. We show that the unsupervised model was able to extract motion that describes the deformation of anatomical structure more accurately, while the supervised model produced spatially smoother and more topology-preserving deformation.

2 Background

The objective of cardiac motion estimation is to determine the spatial transformation of cardiac structures over time. Let \(\{I_t\}_{t=0,1,2,...,N_T}\) represent a sequence of cardiac cine MR images where \(N_T\) is the total number of frames and let \(\mathbf {p}_0 \in \mathbb {R}^2\) denotes the position of a point on the first frame (\(t=0\)). We can determine the spatial transformation \(\mathcal {T}(\cdot )\) using image registration such that \(I_0(\mathbf {p}_0)\) and \(I_t(\mathcal {T}_t(\mathbf {p}_0))\) represent the same anatomical structure. The transformation can be described by a dense displacement field (DDF), denoted by \(\mathbf {u}_t\) where \(\mathbf {u}_t(\mathbf {p}_0) = \mathbf {p}_t - \mathbf {p}_0\).

Deep learning has been used to perform the registration with one-step prediction by modelling a complex function \(f_\theta (I_0, I_t) = \mathbf {u}_t\) that maps a pair of images to the optimal displacement field using convolutional neural network (ConvNet), where \(\theta \) is the parameters of the network. The parameters \(\theta \) in the registration network can be trained using two different supervision signals: ground truth DDF \(\mathbf {u}_{GT}\) (supervised), or the similarity between the pairs of images after registration (unsupervised).

3 Method

This paper adapts and compares two training strategies, supervised and unsupervised, for a deep learning based cardiac motion estimation in cine MR image sequences. The registration networks and the training strategies were set up in a comparable manner for a fair comparison. An overview of both the supervised and unsupervised registration frameworks is illustrated in Fig. 1.

Fig. 1.
figure 1

Supervised and unsupervised registration framework

3.1 Supervised Training

Ground Truth Deformation. The ground truth deformation is required for supervised training of the registration network. Existing deep learning methods for deformable registration usually generate the ground truth displacement \(\mathbf {u}_{GT}\) using traditional registration methods [2], and use the original image pair \(\{(I_0, I_t)\}\) as input to the network. The network sees the real image pairs during training but the ground truth deformation does not completely capture the transformation due to residual errors from the traditional registration methods used to estimate the ground truth deformation. Alternatively, the deformation field acquired from traditional registration can be used to deform image \(I_t\) to generate a pseudo-target image \(I_0'=I_t \circ \mathcal {T}_{\mathbf {u}_t}\). We then use the image pairs \(\{(I_0', I_t)\}\) as input to the network. The ground truth in this setting fully captures the deformation between the input image pair \((I_0', I_t)\) and thus is not limited by residual registration errors. These two variants of supervised training are compared in Sect. 4.2. B-spline FFD [7] is used for traditional registration.

Training. As shown in Fig. 1(a), the network predicts the DDF \(\mathbf {u}_t\) from each pair of input images. For cardiac motion estimation, a sequence of image pairs \(\{(I_0, I_t)\}_{t=1,2,3,...,N_T}\) is given as input to the network in one batch such that each training iteration optimises the group registration of the sequence [6, 8]. The end-diastolic (ED) frame is used as the first frame (or the target frame) and is repeated in each pair in the batch. To train the model, we use Mean Square Error (MSE) between the predicted and ground truth DDF as the regression loss:

$$\begin{aligned} \mathcal {L}_{supervised} = \frac{1}{N_T} \sum _{t=1}^{N_T} \left( \frac{1}{|\varOmega |} \sum _{\mathbf {p}\in \varOmega } (\mathbf {u}_t(\mathbf {p}) - \mathbf {u}_{GT}(\mathbf {p}))^2 \right) \end{aligned}$$
(1)

where \(N_T\) is the number of frames in one batch/sequence and \(\varOmega \) is the spatial domain of the images.

3.2 Unsupervised Training

As illustrated in Fig. 1(b), we use image intensity-based similarity as loss function with an additional regularisation on the predicted displacements. The loss function that the training minimises at each iteration is:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{MSE} + \lambda \mathcal {L}_{smooth} \end{aligned}$$
(2)

The first term in the loss function measures the pixel-wise difference between the target image and the registered source image:

$$\begin{aligned} \mathcal {L}_{MSE} = \frac{1}{N_T} \sum _{t=1}^{N_T} \left( \frac{1}{|\varOmega |} \sum _{\mathbf {p}\in \varOmega } (I_0'(\mathbf {p}) - I_0(\mathbf {p}))^2 \right) \end{aligned}$$
(3)

Here \(I_t\) is transformed to \(I_0'\) using differentiable bi-linear sampling in the spatial transformation network [4], enabling backpropagation for training. The second term in the loss function encourages spatially smooth deformation by minimising the variation of displacements using approximated Huber loss [6] on first-order spatial derivatives of \(\mathbf {u}_t\),

$$\begin{aligned} \mathcal {L}_{smooth} = \frac{1}{N_T} \sum _{t=1}^{N_T} \left( \frac{1}{|\varOmega |} \sum _{\mathbf {p}\in \varOmega } \sqrt{ \left| \frac{\partial \mathbf {u}_t(\mathbf {p})}{\partial x} \right| ^2 + \left| \frac{\partial \mathbf {u}_t(\mathbf {p})}{\partial y} \right| ^2 } \right) \end{aligned}$$
(4)

Similar to the supervised training, one sequence of image pairs from one cardiac sequence is used in each input batch. The weight \(\lambda \) of the smoothness regularisation loss is set to \(10^{-4}\) which is selected based on the performance on the validation dataset.

3.3 Network Architecture

A schematic of the network is shown in Fig. 2. The same network architecture is used in both training strategies and is adapted from the motion estimation branch of the joint segmentation and motion estimation framework proposed in [6]. The network employs two encoder branches with \(3\times 3\) convolutional kernels to extract features from the images. A stride of 2 is used every two convolutional layers to reduce the resolution of feature maps by 2 and increase the size of receptive field [1]. The features from all levels of the two encoders are concatenated before a convolution layer and upsampling to full resolution. Further convolutional layers are applied to fuse information from different scales before making the final prediction.

Fig. 2.
figure 2

Architecture of the registration network. The coloured blocks represent images or feature maps with the number of channels written inside. The resolution of the feature maps with respect to the input is written underneath the blocks. The final output has 2 channels encoding the displacement in 2 directions. (Color figure online)

4 Experiments

4.1 Set up

Data. The two training strategies are evaluated using short-axis view cardiac MR images of healthy subjects from the UK BioBank studyFootnote 1. Randomly selected image sequences of 120 subjects were used for training and validation with another 100 subjects used for testing. Each sequence contains temporally pre-aligned 2D stacks of images of 50 consecutive time points in a complete cardiac cycle. In-plane resolution of the images is \(1.8\times 1.8\,\mathrm{mm}\) per pixel while through-plane resolution is \(10\,\mathrm{mm}\) per pixel. The low resolution between planes could lead to physically implausible displacements of anatomical structure in 3D registration, which is the reason that our motion estimation is performed in 2D plane. The segmentation of the left-ventricular cavity (LV), myocardium wall (MYO) and right-ventricular cavity (RV) on the ED frame and the end-systolic (ES) frame is used to evaluate the accuracy of the estimated motion.

Metrics. The estimated cardiac motion is evaluated on both accuracy and smoothness. To evaluate the accuracy, we first estimate the motion between the ED frame and the ES frame. Then we apply the estimated motion to deform the segmentation mask of the ES frame towards the ED frame, and measure its overlap with the ground true ED frame segmentation using the Dice score and Hausdorff Distance (HD). HD is measured on the outer contours of the anatomical structures. To evaluate regularity of deformation, we calculate the determinant of the Jacobian matrix \(J_\phi (\mathbf {p})=\nabla \phi (\mathbf {p})\), or simply the Jacobian, where \(\phi \) denotes the transformation. We compute the percentage of points that exhibit non-diffeomorphic deformation, indicated by \(|J_\phi (\mathbf {p})|\le 0\). We also calculate magnitudes of the gradient of the Jacobian, i.e. \(|\nabla |J_\phi ||\) which is a second-order metric measuring the spatial smoothness of deformation [5].

Comparison. To ensure fairness of the comparison, the supervised and unsupervised model use exactly the same network architecture described in Sect. 3.3. Both models are trained using the same amount of data for the same number of iterations and tested on data of the same testing subjects. As a reference of performance, the traditional B-spline FFD registration algorithm is also evaluated on the same testing data. The FFD algorithm is set to use the sum squared difference (SSD) as dissimilarity measure and Bending Energy (BE) as regularisation [7]. A 3-level hierarchical multi-resolution approach is used where the spacing of B-spline control points on the highest resolution is set to \(8\,\mathrm{mm}\). The same setting of FFD was used to generate the ground truth deformation for supervised training. The regularisation weights in the unsupervised method and FFD introduce a trade-off between accuracy and deformation regularity, making the selection of these parameters for fair comparison non-trivial. In this paper, both regularisation weights were selected to maximise the accuracy performance on the validation dataset.

Implementation Details. Input images are pre-processed by cropping to the size of \(160\times 160\) so that the registration is focused on the region of interest. The intensity value of input images is normalised to [0, 1]. The deep learning registration networks were implemented in Pytorch and trained for 500 epochs on NVIDIA® GeForce® Titan Xp GPUs. The B-spline FFD registration was performed using the implementation in MIRTKFootnote 2. The runtime of FFD is measured on an Intel® CoreTM i7-8700 CPU.

4.2 Results

Table 1 shows the results of the accuracy and regularity of different methods. When comparing the results of different methods, the Wilcoxon signed-rank test is performed to assess the statistical significance. It can be observed that the unsupervised training outperforms (\(p\ll 0.001\)) supervised training in terms of accuracy especially on left ventricle and right ventricle, and on-par with B-spline FFD on most metrics. Between supervised models, the one trained using the \(\{(I_0', I_t)\}\) image pair (“sup+warp.”) performs similar to the one trained using the \(\{(I_0, I_t)\}\) image pair (“sup+orig.”) except better on myocardium measurements. In terms of regularity, the supervised methods produce deformations that are spatially smoother (lower \(|\nabla |J_\phi ||\)) and significantly less topology-altering (lower \(\% |J|\le 0\)).

Table 1. Accuracy and regularity of cardiac motion estimated by different methods. The accuracy metrics are also evaluated on unregistered input images (“Unreg”) as a reference. The mean and standard deviation over 100 testing subjects are presented. The best results with statistical significant advantage (\(p\ll 0.001\)) are highlighted in bold.

Figure 3 visually demonstrates the difference amongst different motion estimation methods on one exemplar subject. The deep learning model trained using supervised strategy performs inferior to its unsupervised counterpart. It can be observed, from the ED frame image reconstructed by deforming the ES frame image, that the supervised method significantly underestimates the deformation and produces some artefacts in the middle of the LV blood pool. The unsupervised method captures the deformation better but violates some topological structure especially around epicardial contour, as illustrated by the folding that can be observed on the deformed meshgrid.

Runtime Advantage. Despite not achieving significant performance advantage over the traditional method, the deep learning models are able to register a sequence of 50 2D frames in 80.05 ms whereas FFD takes 23.18 s. The runtimes are measured and averaged over 100 test subjects.

Fig. 3.
figure 3

Visualisation of motion estimation results. The target ED frame image \(I_{ED}\) is shown on the top left. The rest of the first row shows the ED frame reconstructed by deforming the ES frame image using deformation estimated by the various methods (\(I_{ED}'\)), overlaid by the meshgrid deformed using the same estimated deformation. The second row shows the error maps (with RMSE values) between the reconstructed image and the ED frame.

5 Conclusion and Discussion

In this work, we evaluated and compared the effect of different training strategy on the performance of deep learning registration network on the task of cardiac motion estimation. In terms of accuracy, we found that unsupervised training, which uses only image similarity, outperforms the supervised training strategies. This could be attributed to the fact that the unsupervised learning optimise directly on the image intensity difference, while the supervised training is either restricted by the registration error from FFD (“sup+orig.”) or the difference between the testing target images and training target images (“sup+warp”). Although performing inferior on accuracy, the supervised methods produce spatially smoother and more topology-preserving deformation.

The superior regularity of the supervised methods could be a result of inheriting spatial smoothness property from the B-spline basis functions in FFD and further regularisation in the regression. It is also possible that the better regularity can only be achieved while under-estimating deformation. This will be further investigated in the future. Future studies should also include a supervised model trained using randomly generated or permuted deformation ground truth in the comparison. This will help to understand the need for realistic ground truth deformation for the supervised method. Another limitation of the paper is that only two representative supervised and unsupervised designs are experimented whereas a study of more existing methods under the same experimental setting would be able to draw a more general conclusion.