Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Image registration is one of the most important tasks in many medical image processing applications, e.g. for atlas-based segmentation, motion analysis or monitoring of growth processes, and therefore a variety of non-linear registration approaches have been proposed over the past three decades [15].

Inspired by the remarkable success of convolutional neural networks (CNNs) for image classification, a number of CNN-based approaches have been proposed to tackle image registration/optical flow problems in (mostly) computer vision. One line of research is to integrate CNN-based correspondence matching into registration/optical flow methods [11, 16], while others successfully learned similarity metrics [14]. Recently, Dosovitskiy et al. [4] rephrased the dense optical flow problem in computer vision as a regression task, learned by CNNs in an end-to-end manner. Their CNN (FlowNet) is able to estimate dense deformation fields from pairs of 2D images at high frame rates and with competitive accuracy.

The success of CNNs for classification tasks heavily relies on the availability of large annotated training populations. However, for real world image registration problems, dense ground truth correspondences are rarely available and their manual generation is usually infeasible. In computer vision [4], this problem is overcome by the generation of synthetic data sets using 3D CAD models and (photorealistic) rendering. This approach is difficult to transfer to the medical field, and the lacking availability of training images of a certain kind is an even bigger challenge. This paper addresses two problems in training CNNs for image registration tasks: missing ground truth correspondences, and a small number of available training images. We aim to generate a large and diverse set of training image pairs with known correspondences from few sample images.

The usual approach to cope with few training samples is data augmentation. A discussion and comparison of augmentation techniques for shape modeling is given in [17]. In the context of machine learning data augmentation aims to enforce invariance of a learner to certain geometric deformations or appearance features by applying random transformations to the samples during the learning process, and, hence to improve its generalization abilities. This is a key aspect for performance improvements in recent classification and segmentation systems [2, 12]. Most data augmentation schemes are manually specified, i.e. a set of geometry and intensity transformations is defined for which the task at hand is believed to be invariant, e.g. affine transformations, noise, and global changes in brightness, see e.g. [4, 7]. To learn invariance related to elastic distortions, so far mostly unspecific random deformations are applied (i.e. in U-Net [12]). Only few data-driven augmentation techniques with transformations learned from the training data exist [6, 10]. For example, in [6], non-linear transformations are learned to estimate probabilistic class-specific deformation models.

The absence of sufficiently large training populations and the unspecific data augmentation approaches currently available prevent the use of CNN-based image registration approaches like FlowNet for medical applications. We, therefore, propose a novel approach for learning representative shape and appearance models from few training samples, and embed this in a new model-based data augmentation scheme to generate huge amounts of ground truth data. Compared to [12] this allows us to synthesize more specific data and in contrast to [6] our approach also seamlessly integrates appearance related data augmentation.

The contribution of this paper is threefold: (1) A recent approach for shape modeling from few training samples [17] is extended to appearance modeling. (2) We show that this approach can be used to synthesize huge amounts of realistic ground truth training data for CNN-based medical image registration. (3) We adapt the FlowNet architecture to two medical image registration problems and show its potential to outperform state-of-the-art registration methods.

Fig. 1.
figure 1

Overview of the proposed model-based data augmentation approach.

2 Methods

The training of CNNs requires huge amounts of training data, e.g. in [4]\({\sim }22000\) image pairs with dense ground truth are used to train FlowNet. Thus, the central goal of our approach is to generate many pairs of synthetic (but realistic) images \((\tilde{I}_i,\tilde{I}_j)\) with associated ground truth deformations \(\phi _{i\rightarrow j}\), i.e. \(\tilde{I}_j\approx \tilde{I}_i\circ \phi _{i\rightarrow j}\), from few real samples. Basically, our approach learns a statistical appearance model (SAM) [3] from the available training images and applies this model to synthesize an arbitrary number of new images with varying object shape and appearance (see Fig. 1). A common problem of classical SAMs is the limited expressiveness as the dimension of the model space is usually restricted by the number of available training images. Therefore, our appearance model adapts a recently published approach for building representative statistical shape models (SSMs) from few training data [17]. This allows us to generate highly flexible SAMs from few real samples. We begin by briefly describing statistical appearance models (SAMs), followed by our adaption of the approach presented in [17].

2.1 Statistical Appearance Models

Given are a set of n training images \(I_1,\ldots ,I_n\); \(I_i:\varOmega \rightarrow \mathbb {R}\), \(\varOmega \subset \mathbb {R}^2\), and for each image \(I_i\) a set of m landmarks \(\varvec{s}_i=[s_{i,1},\ldots ,s_{i,m}]^T\in \mathbb {R}^{2m}\) with \(s_{i,r}=[x_{i,r},y_{i,r}]^T\). These landmarks describe the shape of the object(s) of interest and are assumed to be in correspondence across the population and normalized using Procrustes analysis [3]. To generate the shape model from the shape vectors \(\varvec{s}_i\), the mean shape \(\varvec{s}_0\) and the orthonormal shape basis \(\mathbf {P_S}=(\varvec{p}_1|\ldots |\varvec{p}_k)\) given by the first \(k<n\) eigenvectors of the data covariance matrix \(\mathbf {C_S}\) are calculated. New shapes can now be generated using normally distributed shape parameters \(w_j\sim N(0,\lambda _j)\) with the variance \(\lambda _j\) equal to the corresponding eigenvalue:

$$\begin{aligned} \varvec{\hat{s}}=\varvec{s}_0+\sum _{j=1}^k w_j\varvec{p}_j. \end{aligned}$$
(1)

The appearance model of a SAM is defined with respect to the mean shape \(\varvec{s}_0\), i.e. each training image \(I_i\) is shape normalized by warping the shape vector \(\varvec{s}_i\) to \(\varvec{s}_0\). We use a multi-level B-spline scattered data approximation [8] to define the non-linear warp \(\varphi _i\) and choose a number of levels that fulfill \(\max _r\Vert s_{0,r} - \varphi _i(s_{i,r})\Vert <\epsilon \). In our experiments this approach leads to visually more realistic deformations compared to thin-plate-splines [3] or piecewise-affine warps [9]. The appearance covariance matrix \(\mathbf {C_A}\) is computed from the shape normalized images \(I_i\circ \varphi _i\) sampled at positions \(\varvec{x}_j\in \varOmega _0\). A PCA results in a mean image \(I_0\) and eigenimages \(\mathbf {P_A}=(A_1|\ldots |A_\kappa )\) defining the appearance model \(\hat{I}=I_0+\sum _{j=1}^\kappa \gamma _j A_j\). Again, the appearance parameters \(\gamma _j\) are assumed to be normally distributed and we can generate new image instances by (1) sampling shape parameters to define the shape \(\varvec{\hat{s}}\) and calculating the inverse warping function \(\hat{\varphi }^{-1}\), (2) sampling appearance parameters to generate \(\hat{I}\), and (3) warping the image \(\tilde{I}=\hat{I}\circ \hat{\varphi }^{-1}\). However, SAMs strongly suffer from the high-dimension-low-sample-size (HDLSS) problem because the dimension of the embedding space is high (\(\sim \) number of pixels and landmarks) compared to the number of training images. This results in a limited generalization ability and thus hampers their applicability in the intended deep learning scenario.

2.2 Locality-based Statistical Shape and Appearance Models

Recently, a new approach to tackle the HDLSS problem of SSMs was proposed [17]. This locality-based approach assumes that local shape variations have limited effects in distant areas. To measure the distance \(dist(\cdot ,\cdot )\) between landmarks simple euclidean or geodesic contour distances can be used, but more elaborate distances incorporating prior knowledge and multi-object scenarios are also possible (see [17]). To enforce the locality assumption during model generation a distance threshold \(\tau \) is defined and the correlation of distant landmark positions \(\bar{s}_{i}\) ,\(\bar{s}_{j}\) of the mean shape is set to zero:

$$\begin{aligned} \varvec{R}_\tau =\left\{ \rho \right\} _{i,j} \quad \text {with } \rho _{i,j}={\left\{ \begin{array}{ll} \frac{\mathrm {cov}(\bar{s}_{i},\bar{s}_{j})}{\sigma _i \sigma _j}, &{} \text {if } dist(\bar{s}_{i},\bar{s}_{j}) < \tau \\ 0, &{} \text {else} \end{array}\right. }\ . \end{aligned}$$
(2)

Here, \(\varvec{R}_\tau \) denotes a correlation matrix related to the modified covariance matrix \(\varvec{C}_\tau =(diag(\varvec{C}))^{1/2}\varvec{R}_\tau (diag(\varvec{C}))^{1/2}\). Finally, the eigenvectors of \(\varvec{C}_\tau \) form a new shape basis \(\mathbf {P}_\tau \). For small thresholds \(\tau \), each eigenvector tends to reflect only local shape variations present in the training set, and because \(rank(\hat{\varvec{C}_\tau })\gg rank(\varvec{C})\) now a large number \(k>n\) of eigenvectors can be selected for shape modeling in Eq. (1). The manipulation of the correlation matrix (instead of directly changing the covariances) will preserve the total variability in the training set.

By selecting a set of thresholds \(\tau _1>\tau _2>\ldots >\tau _l\), a single multi-level shape model can be build that incorporates shape variations at different levels of locality. Let \(span(\mathbf {P}_{\tau _1})=\mathcal {P}_1\in \mathcal {G}(k_1,2m)\) and \(span(\mathbf {P}_{\tau _2})=\mathcal {P}_2\in \mathcal {G}(k_2,2m)\) the subspaces of two locality-models (\(\mathcal {G}(k_i,2m)\) denotes a Grassmann manifolds) the \(k_2\)-dimensional subspace nearest to \(\mathcal {P}_2\) containing \(\mathcal {P}_1\) is sought (\(k_2\ge k_1\)):

$$\begin{aligned} \mathcal {P}_{1+2} =\arg \min _{\mathcal {P}\in \mathcal {G}(k_2,2m)} d_{\mathcal {G}(k_2,2m)}(\mathcal {P},\mathcal {P}_2)\quad \text {subject to } \mathcal {P}_1\subseteq \mathcal {P}\ , \end{aligned}$$
(3)

Here, \(d_{\mathcal {G}(k_2,2m)}(\cdot ,\cdot )\) denotes a geodesic distance between subspaces. The basis vectors of \(\mathcal {P}_{1+2}\) and the associated eigenvalues can be efficiently computed as shown in [17]. By successively solving Eq. (3) for the remaining levels of locality \(\tau _3,\ldots ,\tau _l\), a subspace \(\mathcal {P}_{1+2+\ldots +l}\), which includes global as well as very local shape variations is found (see [17] for details).

In [17], this locality-based approach is only defined for SSMs. Here, we extend it to appearance models, by using the Euclidean distance between sampling positions \(\varvec{x}_j\in \varOmega _0\) in the image plane and associated threshold \(\vartheta _1>\vartheta _2>\ldots \) to enforce uncorrelated image intensities in Eq. (2). To define the thresholds for multiple resolution levels, we found \(\vartheta _1=\max _{i,j}\Vert \varvec{x}_{i} - \varvec{x}_{j}\Vert \), \(\vartheta _i=\frac{\vartheta _{i-1}}{2}\) to be a reasonable choice where the number of levels depends on the required locality.

2.3 Model-based Data Augmentation for Learning Image Registration

The locality-based shape and appearance model defined in Sect. 2.2 elegantly combines global and local variabilities in a single model, described by shape vectors, eigenimages, and the associated eigenvalues. Assuming Gaussian distributions for the shape and appearance parameters, we can directly apply the method described in Sect. 2.1 to generate new random images. The shape vectors \(\varvec{\hat{s}}_i\) and \(\varvec{\hat{s}}_j\) associated with the random samples \(\tilde{I}_i\) and \(\tilde{I}_j\) are used to compute the dense deformation \(\phi _{i\rightarrow j}\) by a multi-level B-spline approximation [8], see Sect. 2.1. Clearly, the accuracy of this deformation decreases with increasing distance from landmarks, which will be discussed in Sect. 3.

Fig. 2.
figure 2

Exemplary illustration of both data sets and the data augmentation approaches. Top row: LBPA40 brain data with ground truth labels (1st image). Bottom row: Image pairs of the cardiac MRI data with overlayed deformations generated by the data augmentation approaches (random deformations and the novel model-based approach).

3 Experiments and Results

Data. To our knowledge, dense 3D registration with CNNs is currently computationally infeasible, and we, therefore, use two 2D inter-patient registration problems for our evaluation.

Brain MRI: We extract corresponding transversal slices from affinely pre-registered image volumes of 40 patients of the LPBA40 data set [13]; see Fig. 2 for examples. For each 2D image 100 landmarks on the brain contour and 12 inner brain landmarks are defined for shape modeling. The average Jaccard overlap of 20 brain structures is used to assess the registration accuracy.

Cardiac MRI: We extract end-diastolic mid-ventricular short-axis slices from 32 cine MRI images [1]. Shape correspondences are defined by 104 landmarks located on left ventricle (LV) epicardium, and right+left ventricle endocardium. For the evaluation we compute average symmetric contour distances for the RV+LV endocard and LV epicard contours.

Experimental setup. There are only few approaches for CNN-based end-to-end training for dense image registration, and currently, FlowNet [4] is the best known among these. Therefore, the pre-trained FlowNet-S is used as starting point for all CNN experiments, followed by a fine-tuning with ground truth image pairs generated as detailed below. We adapted the data augmentation steps included in the FlowNet architecture to fit our image data (e.g. by removing color manipulations)Footnote 1. The general goal of the 3 experiments conducted, is to investigate our initial hypotheses that (1) fast CNN-based registration can achieve competitive accuracy on medical data given sufficient training data, and that (2) the proposed data-driven, model-based augmentation approach outperforms available generic, but highly unspecific methods.

FlowNet-Reg: In this experiment, we define ground truth deformation fields by an affine landmark-based pre-registration followed by a diffeomorphic variational non-linear registration of all training image pairs. Pairwise registration will result in \(n(n-1)\) image pairs, which might be not sufficient for training if n is small. The chosen registration method is freely available in the ITK framework and among the best performing methods on LPBA40 (see [5] for parameters).

FlowNet-Random: Dense smooth random deformations as suggested in [12] are applied to all training images, and combined with smooth local brightness changes. With this approach, an arbitrary number of image pairs with known ground truth can be generated, but both images of each pair are deformed versions of the same input image (see Fig. 2) and the deformations are unspecific.

FlowNet-SAM: The proposed locality-based shape and appearance model (see Sect. 2.2) is applied to generate image pairs and corresponding ground truth deformation as detailed in Sect. 2.3. The multi-object distance suggested in [17] with 4 (Brain)/3 (Cardiac) levels of locality is used for SSM generation.

The accuracy of the multi-level B-spline deformations used to infer dense displacement fields from landmark correspondences in Sect. 2.1 decreases far away from landmarks, and this results in a blurred appearance model in these regions as visible in Fig. 2. One solution is to spread landmarks over the whole image region, however, this is impractical in many applications. Instead, we adapt FlowNet and use a weighted loss function during training, with weights of 1 inside the objects (e.g. heart) that decrease to 0 far away from the contour.

Results. A 5-fold cross-validation is applied for all experiments on both image data sets. To compute a baseline accuracy, variational registration is applied to the test data without any landmark information for the brain images and using heart ROI masks for the cardiac data. Note that cardiac inter-patient registration is very challenging for intensity-based registration methods due to the large anatomical variations between patients (see Fig. 2). The results are summarized in Table 1 and show that FlowNet trained with model-generated data (FlowNet-SAM) outperforms all other methods with high significance (paired t-test, \(p< 0.001\), except for brain images \(p< 0.01\)). The registration of one image pair (\(256\times 256\)) needs 0.05s on the GPU. FlowNet-Random and FlowNet-SAM were trained with ca. 10000 samples, which in our experiments was found to be a lower bound. The Jaccard coefficients for the brain scenario of the registration method and FlowNet-SAM are comparable to the 3D values of state-of-the-art methods [5]. Interestingly, for the difficult cardiac registration problem (see VarReg results), pre-trained FlowNet fails, which might suggest that the filters learned on the synthetic chair data (see [4]) are useless in this scenario. Fine-tuning with the proposed approach, however, greatly improves the results. As assumed, fine-tuning with random deformations does not provide much meaningful information for medical data, resulting in poor registration accuracy.

Table 1. Results of the experiments on both data sets. Given are mean Jaccard coefficients (Brains)/ contour distances in mm (Cardiac) over 5-folds with respect to the ground truth segmentations/landmarks. Shown is FlowNet trained on 4 data sets. Pre-trained: trained on synthetic chair data (see [4]); Reg: fine-tuned on VarReg (training data). Random: random deformations; SAM: data augmentation using the proposed models. Note the different number of training samples (2nd, 4th column). Superscripts indicate statistically significant differences to FlowNet SAM (\(\diamond \):\(p<0.01\), \(\star \):\(p<0.001\)).

4 Discussion and Conclusion

In this work, we propose the use of CNN-based image registration for medical image data and present a novel model-based data augmentation scheme to allow for deep learning on small training populations. The results of our evaluation confirm our initial hypotheses that CNN-based registration can achieve competitive accuracy on medical data and that the proposed model-based augmentation approach outperforms unspecific augmentation schemes. We can furthermore show that simple but specific fine-tuning of the FlowNet architecture designed and pre-trained for/with completely different data gives surprisingly good results. We, therefore, strongly believe that CNN-based image registration has the potential to outperform state-of-the-art medical image registration methods in the future. Currently, FlowNet is limited to 2D registration problems. However, this limitation does not apply to the proposed data augmentation approach, which readily generalizes to 3D.