1 Introduction

Image registration is the process of aligning two or more images of the same scene with respect to each other where the images are captured by different sensors, at different times and/or from different viewpoints [1]. It is an essential task for automated medical image analysis since the images are subject to unpredictable geometric distortions during the acquisition. Typically, an image is considered as the reference image and images which are to be aligned with respect to the reference image are referred to as the moving images. Image registration is broadly classified as: (i) rigid or parametric registration, and (ii) non-rigid or deformable registration [1]. In rigid registration, the algorithms typically depend on a small set of parameters, i.e., few matching points between moving and reference image, from affine group of transformations. The goal of such methods is to obtain optimal parameters describing the geometric mapping between the reference and moving image by maximizing a similarity metric. On the other hand, deformable image registration algorithms aim to estimate the underlying displacement vector field (DVF) considering the distortions across all the pixel locations in the moving image [2].

Recently, deep learning based registration techniques have found performing faster and are robust when compared to the classical approaches. Studies where deep learning based methods are used for image registration include convolutional stacked autoencoder [3] which is applied to LONI and ADNI brain MRI datasets. A Multi-resolution image registration framework which learns spatial transformation at different resolution in a self supervised fully convolutional network (FCN) is proposed in [4]. In [5], spatial transformer network to estimate the transformation parameters to align the image is proposed. In [6], FCN based techniques are proposed as Deformable Image Registration network (DIRnet). Image registration is also attempted by deep generative model in [7]. More recently, Voxelmorph is proposed where the network is trained with Brain MRI dataset, while incorporating auxiliary information such as segmentation in order to ease the registration task [8]. Note that, all this conventional deep learning models require substantial amount of data examples for the supervised training and may also prone to underfitting/overfitting. Moreover, once such CNN is trained, it relies on its fixed set of weights to predict the output of the test image. However, there is high chance of underfitting and uncertainity involved in medical data due to scarcity of data examples and unpredictable geometric distortions, respectively. To overcome these issues, in this paper, we propose a novel Bayesian deep learning architecture for achieving deformable medical image registration. Unlike existing approaches, our architecture learns the posterior distribution over the weights with limited examples by the backpropagation. The network then learns the DVF by analyzing reference and moving image pair. In turn the DVF together with moving image is used to perform grid resampling in order to register it with the reference coordinates.

2 Proposed Architecture

In this section, we discuss the proposed Bayesian deep architecture to perform deformable image registration of medical data. Given a pair of reference and moving image our objective is to register the moving image onto the reference image coordinates while maintaining the pixel distribution of the moving image.

In the conventional convolutional neural network (CNN) set-up, given a set of training examples \(\mathbf D =(\mathbf x _{i},\mathbf y _{i})_{i} \) where \(\mathbf x _{i}\) and \(\mathbf y _{i}\) are the \(i^{th}\) input and output respectively, the goal is to learn weights w that shall predict the output \(\mathbf y _\mathbf{test }\) corresponding to test input \(\mathbf x _\mathbf{test }\). The training is typically done by the backpropagation, where one assume that \(\log P(\mathbf D |\mathbf w )\) is differentiable in w. Note that, at the end of training, the estimated weights w are fixed real numbers. Hence, the network has to rely on these hard (fixed) weights (numbers) to predict the output which may lead to severe error especially in medical data.

Unlike the conventional CNNs, Bayesian neural networks estimate the posterior distribution of the weights given the training data \(P(\mathbf w |D)\) by backpropagation [9]. The predictive distribution of output \(P(\mathbf y _\mathbf{test })\) of a test data input item \(\mathbf x _\mathbf{test }\) is then given by

$$\begin{aligned} P(\mathbf{y _\mathbf{test }}|\mathbf x _\mathbf{test }) = {{\,\mathrm{\mathbb {E}}\,}}_{P(\mathbf w |\mathbf D )}[P(\mathbf y _\mathbf{test }|\mathbf x _\mathbf{test },\mathbf w )], \end{aligned}$$
(1)

where \({{\,\mathrm{\mathbb {E}}\,}}_{P(\mathbf w |\mathbf D )}\) is conditional expectation over posterior distribution of weights given the training data D. Thus taking the conditional expectation under the posterior distribution on weights (Eq. (1)) is equivalent to using a large number of neural networks for learning the weights. We propose one such deep Bayesian architecture Fig. 1 for achieving deformable medical image registration.

Fig. 1.
figure 1

Proposed deep Bayesian architecture for deformable medical image registration.

As shown in Fig. 1 a pair of reference and moving image of size \(150 \times 150\) pixels concatenated into 2-channel 2D image \((150 \times 150 \times 2)\) is then passed through downsampling and upsampling layers. It consists of 5 downsampling and 4 upsampling layers. Each downsampling layer consists of two convolutional layers with filters of size \(3\times 3\) where stride is 1 and pooling layer with a stride of 2. Whereas each upsampling layer consists of bilinear interpolation with scale size of 2 followed by two convolutional layers with filters of size \(3 \times 3\) where stride is 1. In order to incorporate non-linearity, we have used rectified linear units (ReLU) and tanh as activation functions. As shown in Fig. 1, the output of the architecture is the DVF for the moving images. It estimates the geometric displacement required by each pixel coordinate of moving image in order to get it registered with respect to the reference image. The estimated DVF along with grid resampling and bilinear interpolation of moving image gives us the final registered image as shown in Fig. 1.

2.1 Training/Test

The supervised training of proposed approach (Fig. 1) is done as follows: we use a differentiable loss function L that is a combination of structural similarity index (SSIM) [10] and mean-squared error (MSE) [11],

$$\begin{aligned} \text {L} (\mathbf I _\mathbf{ref },\mathbf I _\mathbf{reg }) = \text {SSIM} (\mathbf I _\mathbf{ref }, \mathbf I _\mathbf{reg }) + \text {MSE} (\mathbf I _\mathbf{ref }, \mathbf I _\mathbf{reg }), \end{aligned}$$
(2)

where \(\mathbf I _\mathbf{ref }\) is reference and \(\mathbf I _\mathbf{reg } \) is registered image, i.e., (\(\mathbf I _\mathbf{mov } \circ \varvec{\psi }\) ) where \(\varvec{\psi }\) is DVF, and

$$\begin{aligned} \text {MSE} (\mathbf I _\mathbf{ref }, \mathbf I _\mathbf{reg }) = \frac{1}{mn}\sum _{x=1}^{m}\sum _{y=1}^{n}[\mathbf I _\mathbf{ref }(x,y) -\mathbf I _\mathbf{reg }(x,y)]^{2}, \end{aligned}$$
(3)
$$\begin{aligned} \text {SSIM}(\mathbf I _\mathbf{ref },\mathbf I _\mathbf{reg }) = \frac{(2\mu _{I_{ref}}\mu _{I_{reg}} + C_{1})(2\sigma _{I_{ref}I_{reg}} +C_{2} )}{(\mu _{I_{ref}}^{2} +\mu _{I_{reg}}^{2} +C_{1})(\sigma _{I_{ref}}^{2}+\sigma _{I_{reg}}^{2} +C_{2})} , \end{aligned}$$
(4)

where \(\mathbf I _\mathbf{ref }(x, y)\) and \(\mathbf I _\mathbf{reg }(x, y)\) are reference and registered image coordinates, \(\mu _{I_{ref}}\) and \(\mu _{I_{reg}}\) are means and \(\sigma _{I_{ref}}^{2}\) and \(\sigma _{I_{reg}}^{2}\) are variances of reference and registered image, respectively. Here \(C_{1}\) and \(C_{2}\) are constants, mn are total number of pixels for images. Given the differentiable loss function (2), the weights w = \([\text {w}_{1}, \text {w}_{2},....,\text {w}_{n}] \) and bias \(\mathbf b =[\text {b}_{1}, \text {b}_{2},....,\text {b}_{m}] \) where \(\text {w}_{i}\sim \mathcal {N}(\mu _{w},\sigma _{w})\), \(\text {b}_{i}\sim \mathcal {N}(\mu _{b},\sigma _{b})\) [12] of each layers are updated using backpropagation. We have employed batch normalization after every convolution layer, learning rate of 0.001, and batch size of 1 throughout training process.

3 Experiments and Results

In this section, we discuss the results obtained by applying the proposed Bayesian deep learning approach for the deformable medical image registration and compared with recent state-of-the-art approaches. In our experiment, we have used publicly available lungs computerized tomography (CT) scan data from the deformable image registration laboratory [13] as well as cardiac magnetic resonance imaging (MRI) data [14]. We implement the algorithm in Python and pytorch 1.0 on a machine with Intel core i5, Nvidia Geforce GTX 1050 4GB with 24 GB RAM.

Fig. 2.
figure 2

A few challenging cases in lungs CT dataset [13]. (a) reference images, (b) moving images, (c) estimated DVF, and (d) corresponding registered images using proposed Bayesian deep architecture.

Fig. 3.
figure 3

Comparative results on the challenging cases in Lungs CT data [13]. (a) and (b) are reference and moving images, respectively, (c) DIRnet [6], and (d) proposed.

3.1 Results on Lung CT Dataset

In this subsection, we show results of our approach on the lung CT images [13] and compare it with state-of-the-art approaches. The dataset consists of Thoracic 4DCT images as well as inspiratory and exipratory breath-hold CT image pairs. Entire dataset is arranged in 10 different cases. The images are of size \(512 \times 512 \times 128\) voxels with a dimension of each voxel \(0.97 \times 0.97 \times 2.5\) mm. In this experiment we have used 15 slices out of them, and resized it to 150 \(\times \) 150 pixels i.e, it yields \(150 \times 150 \times 15\). In order to generate moving images with different degrees of geometric distortions, we applied deformable or elastic distortion of order 0.8. We have generated 1000 distorted images for a single slice, giving a total of 15000 image pairs where the original slice (image) is considered as the reference and the corresponding distorted images are treated as its moving images. For the validation purpose, we split the dataset into training and test of 5000 and 3000 pairs, respectively. We first show the visual results of few challenging cases in Fig. 2. We now compare the reference image with estimated registered images using our approach as well as state-of-the-art approaches with different similarity metrics i.e, normalized cross correlation, structural similarity index (SSIM) [10], Hausdorff distance (HD) [15], root mean-squared error (RMSE) [11] for the test data of 3000 image pairs. We first show visual comparisons in Fig. 3. The calculated average values are as listed in the Table 1 along with results using state-of-the-art registration frameworks such as selfsupervised FCN [4], DIRnet [6], and Voxelmorph [8]. Note that, \(\epsilon ^{100}\) and \(\epsilon ^{1000}\) are the means of 100 and 1000 samples from the posterior distribution. Since we are taking the mean of samples, the proposed approach with \(\epsilon ^{100}\) and \(\epsilon ^{1000}\) have shown better results when compared to existing state-of-the-art approaches.

Table 1. Results on Lung CT dataset.

3.2 Results on Cardiac Dataset

In this subsection, we discuss results of our approach on the cardiac MRI images [14] and compare it with state-of-the-art approaches. The dataset consists of short axis cardiac MR images and the ground truth of their left ventricles endocardial and epicardial segmentations. The cardiac magnetic resonance images are acquired from 33 subjects. Each subject’s sequence consist of 20 time points and 8–15 slices along the long axis.

Fig. 4.
figure 4

A few challenging cases in cardiac MRI dataset [14]. (a) reference images, (b) moving images, (c) estimated DVF and (d) corresponding registered images using proposed deep Bayesian architecture.

Fig. 5.
figure 5

Comparative results on the challenging cases of cardiac MRI results [14]. (a) and (b) are reference and moving images, respectively, (c) DIRnet [6], and (d) proposed.

Table 2. Results on cardiac MRI dataset.

We considered slices of \(256 \times 256\) voxels of one subject and resized it to \(150 \times 150\) pixels. In order to generate moving images with different degrees of geometric distortions, we applied deformable or elastic distortion of order 0.8. We have generated 1000 distorted images for a single slice giving a total of 10000 image pairs. For the validation purpose, we split the dataset into training and test sets of size 5000 and 3000 image pairs, respectively. We first show the visual results of few challenging cases as shown in Fig. 4 and then compare the reference image with estimated registered images using our approach as well as recent state-of-the-art approaches for the test data of 3000 image pairs. We show visual comparison results in Fig. 5 while calculated average values of similarity metrics are as listed in the Table 2.

4 Conclusion

We propose Bayesian deep learning approach for deformable medical image registration. Unlike existing deep learning frameworks, the proposed approach learns posterior distribution over the weights by backpropagation. Experimental results demonstrate the efficacy of proposed architecture when compared to state-of-the-art approaches.