Keywords

1 Introduction

Medical image fusion plays an important role in medical clinical applications. Recently, medical image fusion mainly concentrates on computerized tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET) and single-photon emission computed tomography (SPECT) modalities [1]. Different imaging mechanism leads to different focus on medical images. For example, the CT has an advantage of dense resolution of denses structures like bones and implants, the MRI is good at soft-tissue details with high-resolution anatomical information, while the blood flow and metabolic changes can be supported by PET and SPECT images but with low spatial resolution [2]. Therefore, how to extract the salient features of different modalities and how to choose proper fusion strategy are main issues in medical image fusion.

Generally speaking, the most common method is multi-scale transform (MST) in image fusion. MST-based methods consist of decomposition, fusion and reconstruction. There are many MST-based methods for image fusion, such as discrete wavelet transform (DWT) [4], contourlet transform [3] and curvelet transform (CVT) [5] etc. Du et al. [17] presented a multiscale decomposition method based on local Laplacian filtering (LLF) with an information of interest (IOI)-based strategy for medical image fusion.

Representation learning [8, 23, 24] is widely used in image fusion in recent years. In sparse representation, Liu et al. [6] applied adaptive sparse presentation in image fusion and denoising. Yin et al. [7] proposed a novel multi-focus image fusion approach based on sparse representation which used joint dictionary. In low-rank representation (LRR), Li et al. [8] proposed a novel multi-focus image fusion based on dictionary learning and LRR to get a better performance.

With the development of deep learning, many image fusion methods based on deep learning are proposed. Because of the depth of network, we can use many deep features to make fusion. In [9], authors presented a new method based on a deep convolutional neural network (CNN). The two source images are fed to network and the score map will be obtained. In [10], Li et al. considered the features of middle layers of pretrained VGG-network [11] and used \(l_1\)-norm and weighted-average strategy to generate several candidates of the fused detail content.

In 2017, Prabhakar et al. [21] proposed an image fusion framework which consists of feature extraction layers, a fusion layer and reconstruction layers. Their feature extraction layers have Siamese network architecture and reconstruction layers consist of three CNN layers. In 2019, Li et al. [12] proposed a DenseFuse whose encoding network is combined with convolutional layers and dense block for infrared and visible images. Although this method achieves better performance, but it still has a drawback because it just considers a single scale to extract image feature in encoding network.

To solve this problem, in this paper, we improve the DenseFuse [12] with a multi-scale mechanism in encoding network. Then, we apply the improved architecture to the medical image fusion. We use three filters of different sizes to extract features of the end of original encoder of DenseFuse respectively. So that we can obtain more features maps of different scales, then, we adopt fusion strategy to fuse them respectively. Finally, we cascade the fused features of different scales into decoder to obtain the fused image.

The structure of the rest paper is organized as follows. In Sect. 2, we will briefly introduce the related work. In Sect. 3, the improved method will be presented in detail. Section 4 shows the experimental results. Finally, Sect. 5 is the conclusion of our paper.

2 Related Works

Recently, many deep learning methods are adopted in the field of image fusion. With the development of deep networks, some issues have arisen, such as the disappearance of gradients and the increase of parameters etc. In CVPR 2017, Huang et al. [13] presented the Dense Convolutional Network (DenseNet) in which the feature-maps of each layer are treated as input into all subsequent layers. This architecture has several advantages: they can make vanishing-gradient problem alleviated, make full use of the features of the middle layers and reduce the number of parameters.

Based on the advantages of DenseNet, In 2019, Li et al. [12] proposed a novel deep learning architecture, which consists of encoding network and decoding network. Their encoding network is constructed by convolutional layers and dense block. Then, the authors fuse them by fusion layer. Finally, the fused image is reconstructed by a decoder. In encoder, they apply dense block [13] to leverage more useful information from middle layers. A unique training strategy was developed in DenseFuse. Then, the output image is reconstructed by decoding network using the extracted features. Therefore, in the fusion phase, the two source images are fed to encoding network. Then, two feature maps are obtained and they are fused by fusion strategy. Ultimately, the fusion maps are transmitted to decoder network and the fused image can be obtained.

In this paper, we propose a multi-scale DenseNet (MSDNet), which is built on DenseFuse. We add multi-scale mechanism in encoder to extract features from different scales. Then, we apply fusion strategy to fuse them respectively. We utilize the decoder of Densefuse which is four CNN layers to reconstructed the fused image. We will introduce the algorithm in detail in Sect. 3.

3 Methodology

In this section, the improved method will be introduced in detail. First of all, we apply our method to medical image fusion. The framework of the method is shown in Fig. 1.

Fig. 1.
figure 1

The framework of the proposed method.

The input medical images are denoted as \(I_1\) and \(I_2\). We adopt the method [18] to convert the color images into YUV color space. If \(I_1\) is grayscale, we convert \(I_2\) to YUV space but we just leverage the Y channels of \(I_2\). We then input the two (converted) images into MSDNet. Finally, combining the output of MSDNet with U and V channels of \(I_2\), the fused image (f) is obtained by converting YUV to RGB color space.

Secondly, considering only a single scale feature is used in DenseFuse, our goal is adding multi-scale mechanism in encoder of DenseFuse. Therefore, our new framework is called MSDNet. MSDNet is made of encoder, a fusion layer and decoder as shown in Fig. 2.

Fig. 2.
figure 2

The architecture of the MSDNet.

Fig. 3.
figure 3

The diagram of \(l_1\)-norm strategy.

The encoder is constructed by a convolutional layer, dense block and a multi-scale layer. We add multi-scale layer whose filters’ size are \(5\times 5\), \(3\times 3\) and \(1\times 1\) respectively to extract features from coarse to fine at the end of Denseblock. Why we choose these sizes of filters? Because we adopt \(1\times 1\) filters to fuse the information of different channels at the same location and \(3\times 3\), \(5\times 5\) filters to fuse the information of different channels around the same location. When we choose a larger size of filters, the features extracted from the filters will not be obvious, so larger size of filters is not selected. Through the multi-scale layer, we will get three groups of multi-channel features. Then, we choose \(l_1\)-norm strategy [12], which is shown in Fig. 3, to fuse them respectively.

In Fig. 3, the feature maps are denoted as \(\varphi _i^{1:n}(x,y)\), where \(i \in \{1,\dots , k\}\) is the number of input images, \(n \in \{1,2,\dots ,N\}\), and \(N=64\) is the number of feature maps. \(\varphi _i^{1:n}(x,y)\) are processed by \(l_1\)-norm as Eq. 1.

$$\begin{aligned}&\alpha _i(x,y) = \sum {_1^n}{||\varphi _i^{1:n}(x,y)||_1} \end{aligned}$$
(1)

Then the average operator is utilized to calculate the weight map by Eq. 2.

$$\begin{aligned}&w_i(x,y) = \frac{\alpha _i(x,y)}{\sum _{i=1}^{k}\alpha _i(x,y)} \end{aligned}$$
(2)

Finally, \(f_k^n(x,y)\), where \(k \in \{1, 3, 5\}\) is the scale of filters, is calculated by Eq. 3.

$$\begin{aligned}&f_k^n = \sum \nolimits _{i=1}^k w_i(x,y) \times \varphi _i^n(x,y) \end{aligned}$$
(3)

The three groups of fused features \(f_1\), \(f_3\) and \(f_5\) will be concatenated together as input to decoder, lastly, we will get the reconstructed image which is the fused image.

In training phase, our aim is to train the network’s ability to reconstruct the input image, as shown in Fig. 4. The input image is extracted by dense convolutional layers, then, the convolutional kernels of different size are used to extract the features of different scales which are \(1\times 1\), \(3\times 3\) and \(5\times 5\) respectively. Finally, we concatenate the multi-channel features of different scales into the decoder. We adopt structural similarity (SSIM) loss [12] and pixel loss [12] to guarantee the reconstructed image will be closely to the input image, as Eq. 4.

$$\begin{aligned} L = \lambda L_{ssim}+L_p \end{aligned}$$
(4)

In this paper, \(\lambda = 1000\), the reason will be introduced in details in Sect. 4.2.

Our training images are from MS-COCO [14]. We choose 80000 images, which are resized to \(256\times 256\) and transformed to gray scale images, are utilized to train the network. Learning rate, batch size and epochs are set as \(1\times 10^{-4}\), 2 and 4 respectively.

4 Experiments and Analysis

4.1 Experiment Settings

In our experiment, there are three fusion categories of medical images, which are computerized tomography (CT) and magnetic resonance imaging (MRI), MRI and positron emission tomography (PET), and MRI and single-photon emission computed tomography (SPECT) [20].

Fig. 4.
figure 4

The framework of training process.

As shown in Fig. 1, in CT and MRI, CT is \(I_1\) and MRI is \(I_2\); in MRI and PET, MRI is \(I_1\) and PET is \(I_2\); in MRI and SPECT, MRI is \(I_1\) and SPECT is \(I_2\). Therefore, we fuse three groups of medical images and analyze them from objective and subjective points of views.

We compare the proposed method with seven prior methods, including a medical image fusion method based on convolutional neural networks (CNN) [2], IHS-PCA method [15] which adopted intensity-hue-saturation (IHS) transform and principal component analysis (PCA) to preserve more spatial feature and more required functional information with no color distortion, LES-DC [16], LLF-IOI [17], medical image fusion with PA-PCNN in nonsubsampled shearlet transform domain (NSST) [18], infrared and visible image fusion using a deep learning framework (VGG) [10], DenseFuse [12].

In order to evaluate our proposed method, we compare our method with seven existing methods and we choose six quality indicators. They are: \(SSIM_a\); \(PSNR_a\); \(FMI_{dct}\), \(FMI_w\), \(FMI_{edge}\) and \(FMI_{gradient}\) [19] which calculate mutual information for the discrete cosine, wavelet, edge and gradient features, respectively.

In our experiment, the \(SSIM_a\) and \(PSNR_a\) are calculated by Eqs. 5 and 6,

$$\begin{aligned}&SSIM_a(F) = (SSIM(F,I_1)+SSIM(F,I_2))\times 0.5 \end{aligned}$$
(5)
$$\begin{aligned}&PSNR_a(F) = (PSNR(F,I_1)+PSNR(F,I_2))\times 0.5 \end{aligned}$$
(6)

where \(SSIM(\cdot )\) denotes the structural similarity operation [22], \(PSNR(\cdot )\) denotes peak signal-to-noise ratio, F is the fused image and \(I_1\),\(I_2\) are source images. The values of \(SSIM_a\) and \(PSNR_a\) represent the ability to retain structural information and original information of source images, respectively.

With the increase of all these six measures, the image fusion performance will be improved.

We use Python for all experiments and adopt Tensorflow architecture. Our method was implemented with NVIDIA GTX 1080Ti GPU.

4.2 Loss of Training Phase

In [12], \(\lambda \in \{1, 10, 100, 1000\}\), according to the experimental comparison as shown in Fig. 5, we find that the model converges faster and more stable when \(\lambda =1000\). Therefore, in this paper, we choose \(\lambda =1000\).

Fig. 5.
figure 5

The graph plot of L.

Table 1. The average values of quality metrics for 60 fused images, including CT and MRI, MRI and PET and MRI and SPECT.

4.3 Baseline

Firstly, we compare the method with DenseFuse (\(\lambda =1000\)), which is a recently developed fusion method, separately. In Table 1, the values are the average results for 60 fused images, including CT and MRI, MRI and PET and MRI and SPECT. The best results are bloded. We can see that it is effective for adding multi-scale mechanism in DenseFuse.

4.4 Subjective Evaluation

The fused images which are obtained by the seven compared methods and proposed method are shown in Fig. 6.

As we can see from Fig. 6, LES-DC [16] and VGG [10] retain less valid information than other methods because some features are not very clear. The fused images obtained by LLF-IOI [17] are a little sharp and have some artificial noise. Compared with the existing methods, the fused images obtained by our method are more natural. We will leverage objective indicators to analyze the fusion performance.

Fig. 6.
figure 6

Fused results for medical images (RGB) images. Rows 1 to 3 of (a) and (b) are CT and MRI images; Rows 4 to 6 of (a) and (b) are MRI and PET images; Rows 7 to 9 of (a) and (b) are MRI and SPECT images; (c) CNN; (d) IHS-PCA; (e) LES-DC; (f) LLF-IOI; (g) NSST with PAPCNN; (h) VGG; (i) DenseFuse; (j) Ours

4.5 Objective Evaluation

We use SSIM, PSNR, \(FMI_{dct}\), \(FMI_w\), \(FMI_{edge}\), \(FMI_{gradient}\) to analyze fusion performance. We test three groups of medical images, they are: CT and MRT, MRI and PET and MRI and SPECT. The results are shown in Table 2.

In Table 2, the best results are bloded, the second-best results are marked in red and the third-best results are marked in blue. In Table 2, it can be seen that proposed method’s indicators are very high for the most part. It proves that the results of proposed method has more salient features and less artificial noise.

Table 2. The average values of quality metrics for 20 fused images of each fusion categories.

5 Conclusion

In this paper, we propose a multi-scale DenseNet by adding multi-scale mechanism into DenseFuse and apply the improved method for medical image fusion.

Our network consists of constructed by encoder, fusion layer and decoder. Encoder is made of a convolutional layer, dense block and multi-scale layer. Decoder is made of four CNN layers. After multi-scale layer, we obtain three groups of feature maps. Then, we utilize \(L_1\)-norm fusion strategy to fusion them respectively. We concatenate them as input into decoder. Finally, we obtain the fused image reconstructed by decoder.

We use subjective and objective quality metrics to evaluate the performance of fusion results. The results of experiments indicate that our method is effective for medical image fusion.