1 Introduction

Due to the complementary information contained in different imaging modalities (e.g., CT images, T1- and T2-weighted MR images), multi-modal images are usually captured and fused for disease diagnosis, treatment planning, etc. However, acquisition of multimodal images can be time-consuming and costly. Furthermore, the fusion often requires accurate cross-modality registration and can be degraded by the deformation of the organs.

Cross-modality synthesis is thus valuable for both scientific research and clinical application. Although each modality presents different characteristic of the underlying anatomy, individual modalities are highly correlated when scanning the same anatomical structure and revealing the tissue appearance from different perspectives. Thus, synthesizing images of one modality based on the images of another modality is theoretically possible. However, the mapping between the two different modalities are highly nonlinear, which makes the synthesis task difficult to accomplish.

Over the past few years, various methods have been proposed for cross-modality medical image synthesis. Typical works include coupled sparse representation [1] and deep convolutional neural networks [2,3,4]. These methods usually require paired data for training, i.e., well-aligned source and target modalities from the same subject. However, it is not always easy to get the perfectly paired data, which thus strongly limits the application of cross-modality synthesis. Moreover, misalignment within the paired source/target data is sometimes inevitable (though tiny), and it could cause ambiguity or even devastate current synthesis methods.

Unsupervised synthesis has already been explored in [5], which only requires unpaired data for training. They used cross-modality nearest neighbor search to produce the candidate for each target voxel, then simultaneously maximized the global mutual information between candidate and source images. Local spatial consistency was enforced to generate the final target image. The performance of the method is highly dependent on the accuracy of the nearest neighbor searching.

Recently, unsupervised deep learning models have been applied for image synthesis. Cycle-GAN [6], for example, has been used to synthesize CT from MR [7]. However, it is insufficient to simply borrow the Cycle-GAN model while many properties of the medical images are ignored. We argue that the synthesis of medical images is quite different from natural images due to the 3D nature of many medical image modalities. Thus, in this work, we train the deep network in a quasi-3D way and design a 3D structural dissimilarity loss for several popular medical tasks. Particularly, inspired by the structural similarity metric (SSIM), we introduce a new structural dissimilarity loss to improve the boundary contrast of the synthesized image.

We also simplify the generator in GAN to decrease the number of the parameters, which leads to faster training yet better synthesis quality. Our generator combines the advantages of Unet [8] and deep residual net [9], and is termed as Res-Unet. Our simplified model can then be well trained within 3 h. We conduct abundant experiments to verify the promising performances of our method. Specifically, we perform brain MR-to-CT synthesis, prostate MR-to-CT synthesis and brain 3T-to-7T MR synthesis, respectively. Several examples of our datasets are shown in Fig. 1, where the differences between the paired and the unpaired data are clear. Note that in this paper we use the unpaired data only for all the experiments.

Fig. 1.
figure 1

Examples of the paired (top) and unpaired training data (bottom) for threes tasks: brain MR-to-CT, prostate MR-to-CT, and brain 3T-to-7T MR. In the paired data, the input images (X and Y) belong to the same subject and registered. In the unpaired data, the inputs images are clearly misaligned.

2 Method

2.1 Loss Design

We aim to accomplish the cross-modality synthesis by the Cycle-Consistent Adversarial Networks. Suppose we have two modality images \( X \) and \( Y \). Then, the goal of our method is to learn the mapping function between these two modalities. We define the training samples as \( \left\{ {x_{i} } \right\}_{i = 1}^{N} \in X \) and \( \left\{ {y_{j} } \right\}_{j = 1}^{M} \in Y \). As illustrated in Fig. 2(a), there are two mapping functions, i.e., \( G: X \to Y \) and \( F: Y \to X \) in this cross-modality synthesis task. The two mapping functions can be modeled by deep neural networks. Besides, two adversarial discriminators \( D_{X} \) and \( D_{Y} \) are trained, such that \( D_{X} \) tries to distinguish real images \( \left\{ {x_{i} } \right\} \) and the synthesized images \( \left\{ {F\left( {y_{j} } \right)} \right\} \). Similarly, \( D_{Y} \) tries to distinguish \( \left\{ {y_{j} } \right\} \) and \( \left\{ {G\left( {x_{i} } \right)} \right\} \). In order to quantify the variation of the anatomical structures between the real images and the synthesized images, we also introduce the new structural dissimilarity loss. Therefore, the objective of the network as shown in Fig. 2(a) mainly contains three terms: the adversarial loss (\( {\mathbf{\mathcal{L}}}_{GAN} \)), the cycle consistency loss (\( {\mathbf{\mathcal{L}}}_{CYC} \)) and the structural dissimilarity loss (\( {\mathbf{\mathcal{L}}}_{DSSIM} \)):

Fig. 2.
figure 2

The Cycle-Consistent Adversarial Networks (a) used for cross-modality synthesis are illustrated in (a). There are two cycle mappings as in (b) and (c).

$$ \begin{array}{*{20}c} {{\mathbf{\mathcal{L}}}\left( {G, F,D_{X} , D_{Y} } \right) = {\mathbf{\mathcal{L}}}_{GAN} \left( {G, D_{Y} , X, Y} \right) + {\mathbf{\mathcal{L}}}_{GAN} \left( {F, D_{X} , Y, X} \right)} \\ { +\uplambda{\mathbf{\mathcal{L}}}_{CYC} \left( {G,F} \right) + \beta {\mathbf{\mathcal{L}}}_{DSSIM} \left( {G,F} \right),} \\ \end{array} $$
(1)

where \( \uplambda \) and \( \beta \) control the relative importance of individual loss terms. We set \( \uplambda = 10 \) and set \( \beta = 1 \) in this work.

Adversarial Loss.

Adversarial loss is applied to both mapping functions \( G \) and \( F \). For the mapping function \( G: X \to Y \) and its corresponding discriminator \( D_{Y} \), the objective function is expressed as:

$$ {\mathbf{\mathcal{L}}}_{GAN} \left( {G, D_{Y} , X, Y} \right) = {\rm E}_{{y \sim P_{data} \left( y \right)}} \left[ {logD_{Y} \left( y \right)} \right] + {\rm E}_{{x \sim P_{data} \left( x \right)}} \left[ {log(1 - D_{Y} \left( x \right))} \right] $$
(2)

\( G \) intends to generate the target modality image \( G\left( x \right) \) that appears to be similar to real target image (\( Y \)), while \( D_{Y} \) aims to distinguish whether the input to the discriminator is the synthesized image \( G\left( x \right) \) or a real image \( y \in Y \). Therefore, \( G \) tries to minimize this objective function while the adversarial \( D \) tries to maximize it, i.e. \( G^{*} = \arg { \hbox{min} }_{G} { \hbox{max} }_{{D_{Y} }} {\mathbf{\mathcal{L}}}_{GAN} \left( {G, D_{Y} , X, Y} \right) \). Similar adversarial loss is also applied for the mapping function \( F:Y \to X \): i.e. \( F^{*} = \arg { \hbox{min} }_{F} { \hbox{max} }_{{D_{X} }} {\mathbf{\mathcal{L}}}_{GAN} \left( {F, D_{X} , Y, X} \right). \)

Cycle Consistency Loss.

To further reduce the ambiguity in solving the mapping functions, we enforce the cycle-consistency constraint, which means the difference between the input modality image and the cyclically synthesized image should be minimized. The illustration for the cycle consistency loss is shown in Fig. 2(b) and (c) for both synthesis direction, i.e., \( x \to G\left( x \right) \to F\left( {G\left( x \right)} \right) \) should be similar with \( x \) and \( y \to G\left( y \right) \to F\left( {G\left( y \right)} \right) \) should be similar with \( y \). This cycle-consistency loss can thus be defined as:

$$ {\mathbf{\mathcal{L}}}_{cyc} \left( {G,F} \right) = {\rm E}_{{x \sim P_{data} \left( x \right)}} \left[ {\left\| {F\left( {G\left( x \right)} \right) - x} \right\|_{1} } \right] + {\rm E}_{{y \sim P_{data} \left( y \right)}} \left[ {\left\| {F\left( {G\left( y \right)} \right) - y} \right\|_{1} } \right]. $$
(3)

Structural Dissimilarity Loss.

As the global L1 loss focuses on the entire image space, it ignores many local structural details. Structural information is usually critical in medical images as they are closely related to delineating the boundaries of tissues and organs. In order to further improve the quality of the synthesized images regarding anatomical details, we propose to take advantage of SSIM to restore the local structures in the synthesized image. This leads to the new structural dissimilarity loss (DSSIM), which is a distance metric extended from SSIM:

$$ \begin{aligned} {\mathbf{\mathcal{L}}}_{DSSIM} \left( {G,F} \right) & = {\rm E}_{{x \sim P_{data} \left( x \right)}} \left[ {\frac{{1 - SSIM\left( {x,F\left( {G\left( x \right)} \right)} \right)}}{2}} \right] \\ & + {\rm E}_{{y \sim P_{data} \left( y \right)}} \left[ {\frac{{1 - SSIM\left( {y,F\left( {G\left( y \right)} \right)} \right)}}{2}} \right]. \\ \end{aligned} $$
(4)

2.2 Architecture of the Generator/Discriminator

There are two networks in the Cycle-Consistent Adversarial Networks, i.e., the generator and the discriminator. The generator, which is critical to the quality of the generated images, has many layers with abundant parameters, making the training process very slow. In order to design a more efficient network, we intend to take advantages from two popular architectures, i.e., Unet [8] and deep residual network [9].

Unet is widely used in many medical image analysis researches, as it has shown promising results on various tasks, including image segmentation and image synthesis. Unet consists of an encoding path and a decoding path, with skip connection in each corresponding level. This design ensures the network to have a large receptive field to capture both local and global image appearances. Salient high-level features can thus be extracted, which is essential to the cross-modality mapping trained with unpaired samples. Deep residual network, is also adopted in many research tasks, such as image classification and image super-resolution. The most important component of deep residual network is the residual block, which consists of two convolutional layers with an identity mapping, as shown in Fig. 3. The residual block is designed to alleviate the gradient vanishing issue; in the meantime, it can also boost information exchange across different layers. Inspired by the two networks, we design a deep network called Res-Unet in this work. Our network fuses advantages of Unet and the residual block, as its architecture is illustrated in Fig. 3. There are 2 pooling stages, 2 deconvolution stages and 5 residual blocks in our generator.

Fig. 3.
figure 3

Illustration of the proposed Res-Unet architecture as the generator.

3 Experimental Results

3.1 Datasets and Tasks

We utilize three real datasets to evaluate our cross-modality synthesis method. The datasets and the relative synthesis tasks are introduced below.

  1. (1)

    Brain MR-to-CT dataset. This dataset consists of 16 subjects, each of whom comes with an MR and a CT scan. The voxel sizes of the CT and MR images are \( 0.59 \times 0.59 \times 0.59\,\,{\text{mm}}^{3} \) and \( 1.2 \times 1.2 \times 1\,\,{\text{mm}}^{3} \), respectively. We separate the 16 subjects into a training set containing 10 subjects and a testing set containing 6 subjects.

  2. (2)

    Prostate MR-to-CT dataset. The prostate dataset consists of 22 subjects. The voxel sizes of the CT and MR images are \( 1.17 \times 1.17 \times 1\,\,{\text{mm}}^{3} \) and \( 1 \times 1 \times 1\,\,{\text{mm}}^{3} \), respectively. We also separate the 22 subjects into two parts: a training set containing 14 subjects and a testing set containing 8 subjects.

  3. (3)

    Brain 3T-to-7T dataset. This dataset consists of 15 subjects. The voxel sizes of the 3T MR and 7T MR are \( 1 \times 1 \times 1\,\,{\text{mm}}^{3} \) and \( 0.65 \times 0.65 \times 0.65\,\,{\text{mm}}^{3} \), respectively. These 15 subjects are separated into a training set containing 10 subjects and a testing set containing 5 subjects.

For both brain and prostate MR-to-CT tasks, the CT images are linearly aligned (by FLIRT in FSL) to the corresponding MR images and resampled to the same size of the MR images. For brain 3T-to-7T dataset, corresponding 3T and 7T images are also linearly aligned. The nonlinear deformations between images in the same subject are left there. The intensities are normalized to \( \left[ {0,1} \right] \) in each image.

3.2 Implementation Details

In this paper, PyTorch implementation for the basic Cycle-GAN [6] is used in all the experimentsFootnote 1. The generator is replaced by the proposed Res-Unet in our method. In the training phase, we extract consecutive 2D axial slices from the 3D image as the training samples. The training samples from two different input modalities are drawn separately, such that the samples in each pair are totally independent. The sampling process results in the unpaired training dataset, which allows no alignment of the training images in practical usage. Horizontal flipping is used to augment the training datasets. We apply Adam optimization with momentum of 0.9 and perform 100 epochs in the training stage. The batch size is set to 1 and the initial learning rate is set to 0.0002. To quantitatively evaluate the results, we use the commonly accepted metrics of peak signal-noise ratio (PSNR), normalized mean squared error (NMSE) and structural similarity (SSIM). In general, higher PSNR, lower NMSE and high SSIM indicate better perceptive quality of the synthesis result.

3.3 Quantitative and Visual Comparisons

In this section, we compare the cross-modality synthesis results by our proposed method and the previously reported Cycle-GAN model. Comparisons are conducted on all three tasks: brain MR-to-CT, prostate MR-to-CT, and brain 3T-to-7T. First, we show the effectiveness of the proposed new generator Res-Unet. Comparisons between ‘Basic Cycle-GAN’ and ‘Res-Unet’ have been summarized in Table 1. We can see that with the new generator the synthesis results are improved on all three datasets, which demonstrates the superiority of the proposed generator. Moreover, with the new DSSIM loss added, the synthesis performance is further improved. In general, the quantitative results in Table 1 show that our proposed method (‘Res-Unet + DSSIM’) achieves best results on all three tasks, in terms of all evaluation metrics of PSNR, NMSE and SSIM.

Table 1. Comparisons of the synthesis results by different methods.

To give an intuitive view, visualization of the synthesized results using ‘Basic Cycle-GAN’ and the proposed ‘Res-Unet + DSSIM’ are presented in Figs. 4, 5 and 6 for prostate MR-to-CT, brain MR-to-CT, and brain 3T-to-7T, respectively. Compared to ‘Basic Cycle-GAN’, ‘Res-Unet + DSSIM’ can obtain better synthesis results with clearer tissue/organ boundaries. For example, in the coronal view of the prostate MR-to-CT task in Fig. 4, the two bones in the hip joint are successfully separated and synthesized by ‘Res-Unet + DSSIM’, while the boundaries of the bones appear blur in the synthesized result by ‘Basic Cycle-GAN’ (orange box in the figure). Also we could observe that the anatomical details pointed by the red and green boxes are clearer in ‘Res-Unet + DSSIM’. Similar observations can be found on brain MR-to-CT and brain 3T-to-7T dataset in Figs. 5 and 6.

Fig. 4.
figure 4

Visual comparison for the prostate MR-to-CT synthesis task. (Color figure online)

Fig. 5.
figure 5

Visual comparison for the brain MR-to-CT synthesis task. (Color figure online)

Fig. 6.
figure 6

Visual comparison for the brain 3T-to-7T synthesis task. (Color figure online)

Meanwhile, note that our training is based on 3 consecutive axial slices, while the synthesized results are consistent on all three views. That is, our method can well handle the synthesis task of 3D medical images, even though only 3 consecutive slices are used in training. In the testing stage, we process every 3 axial slices each time, while the final output of the 3D volume can be obtained by averaging all synthesis results.

We conduct another experiment on brain MR-to-CT dataset to show the fast convergence of our ‘Res-Unet + DSSIM’ compared to ‘Basic Cycle-GAN’. We show the synthesis result by training with same epoch. The results are shown in Fig. 7. We can see that with same training epoch, our proposed model gets better result. It takes 3 h to train our model, while Basic Cycle-GAN takes 17 h. And our model contains 13.3 M parameters, which is 1/4 of ‘Basic Cycle-GAN’. That means our proposed model could train fast with less parameters, but achieve best synthesis result. The testing time is 6 s for a 3D image of size \( 181 \times 234 \times 149 \).

Fig. 7.
figure 7

Visual comparison for Basic Cycle-GAN and proposed method during training.

4 Conclusion

We proposed a novel Res-Unet architecture as the generator and solve cross-modality image synthesis by GAN. In particular, we accomplish the synthesis tasks on three different scenarios by training with the unpaired data, which indicates that our method has great potentials to many real clinical applications. This Res-Unet generator, which benefits from the novel loss design, has shown its superior performances by mapping between different images modalities with large appearance variation. In our future work, we will conduct large-scale evaluation in clinical applications, and demonstrate that the proposed image synthesis technique can be used as a new tool to reshape multi-modal image fusion and subsequent analysis.