Keywords

1 Introduction

Positron emission tomography (PET) is a pervasively exploited nuclear imaging technique that can provide crucial information for early disease diagnosis and treatment [1, 2]. In clinic, a standard-dose radioactive tracer is usually needed to acquire PET images of diagnostic quality. However, the inherent tracer radiation inevitably induces an increased risk of cancer and raises potential health hazards [3], while reducing dose may involve more noise and artifacts during imaging, resulting in suboptimal image quality. Accordingly, obtaining diagnostic-quality PET image at low dose is of great research significance.

In the past decades, numerous machine learning based methods have been developed to obtain the high-quality standard-dose PET (SPET) from the low-dose PET (LPET) [4,5,6]. Although these traditional algorithms have gained promising results, they are all based on manually-extracted features and have shown limitation in offering complex latent information. Recently, deep learning methods represented by convolutional neural network (CNN) and generative adversarial network (GAN) have been widely applied in medical image analysis and achieved remarkable success [7,8,9,10,11,12,13,14,15,16,17]. For instance, Wang et al. [12] first proposed a 3D GAN model for PET image synthesis, and they further [15] put forward a 3D locality-adaptive GANs with an auto-context strategy for SPET prediction. Lei et al. [16] proposed a cycle-consistent GAN framework for whole-body PET estimation. These existing approaches always translate LPET to SPET directly and only consider pixel-level differences between the real and synthesized SPET images. However, since the LPET images and the SPET images are from the same subjects, the above simple paradigm ignores the shared semantic content and structure information between LPET and SPET domains, which may lead to the distortion of the image content during the translation process [18]. In light of this, how to preserve the shared information between LPET and SPET domains for boosting the PET synthesis performance is a key issue to be addressed in this paper.

On the other hand, the synthesized PET images are expected to provide crucial clinical information for the diagnosis of cognitive impairment. Yet, current methods focus more on improving image quality and do not take into account the applications of the synthetic images in analytical and diagnostic tasks. Therefore, how to effectively improve the clinical applicability of synthetic PET images in diagnosis is another key issue that needs to be addressed in this paper.

In this paper, motivated to address the aforementioned key issues in current PET synthesis methods, we propose a novel end-to-end classification-aided bidirectional contrastive GAN (BiC-GAN for short) framework for high-quality SPET images synthesis from corresponding LPET images. Specifically, the proposed model mainly consists of two similar GAN-based networks, i.e., a master network and an auxiliary network, each realizing both inter-domain synthesis and intra-domain reconstruction tasks to extract the shared content and structure information between the LPET and SPET domains. Moreover, a domain alignment module is employed to maximize the shared information extracted from the two domains. Then, considering that contrastive learning (CL) has superior performance in learning robust image representation [19,20,21,22,23], we also introduce the CL strategy into the encoding stage, enabling more domain-independent content information to be extracted. Additionally, to enforce the clinical availability of synthesized SPET, the proposed model incorporates the mild cognitive impairment (MCI) classification task into PET image synthesis, so that the classification results could be fed back to the image synthesis task for improving the quality of the synthesized images for the target classification task.

2 Methodology

The architecture of the proposed BiC-GAN is illustrated in Fig. 1, which mainly consists of a master network and an auxiliary network, receiving LPET and SPET as input, respectively, to fully exploit the shared information between LPET and SPET domains through intra-domain reconstruction and inter-domain synthesis tasks. Both networks are equipped with a contrastive learning module to enhance their feature extraction capability and explore more domain-independent content information. Moreover, we further design a domain alignment module to align the features extracted from the LPET and SPET domains, thus maximizing the shared information from the two domains. In addition, a discriminator network is introduced to ensure distribution consistency between the real and synthesized images. Finally, in the master network, we further incorporate an MCI classifier designed to distinguish whether the synthetic images are from normal control (NC) subjects or subjects diagnosed with MCI, allowing the model to synthesize PET images of high diagnostic quality. The detailed architecture of our model will be described as follows.

Fig. 1.
figure 1

Overall architecture of the proposed framework.

2.1 Master Network

Generator.

The generator \(G_{M}\) of the master network takes LPET \(l\) as input and outputs the reconstructed LPET \(l_{rec}\) and synthesized SPET \(s_{syn}\) images through a shared encoder \(LEncoder\) and two specific decoders, i.e., an intra-domain reconstruction decoder \(LRecDec\) and an inter-domain synthesis decoder \(SSynDec\), respectively, making the shared encoder can fully excavate the shared information between LPET and SPET domains. Concretely, the encoder contains seven down-sampling blocks structured as 3 × 3 Convolution-BatchNorm-LeakyRelu-Maxpool, except for the last block which removes the Maxpool layer. Through the shared encoder, the size of input image \(l\) is reduced from 256 × 256 to 2 × 2, while the channel dimension is increased to 512. Both decoders have the same structure, each containing seven up-sampling blocks with Deconvolution-Convolution-BatchNorm-Relu structure for gradually restoring the features extracted by the shared encoder to target images. The 2 × 2 deconvolution with stride 2 is applied as an up-sampling operator. Following [24], we only use the skip connections in the synthesis task and drop them in the reconstruction task to prevent the intra-domain reconstruction decoder copying the features directly from the encoder instead of learning the mapping between them. The L1 loss is adopted to ensure that \(l_{rec}\) and \(s_{syn}\) are close to their corresponding ground truth, i.e., \(l\) and \(s\), and to encourage less blurring, as formulated in Eq. (1):

$$L_{1} \left( {G_{M} } \right) = \alpha_{1} \left\| {s - s_{syn} } \right\|_{1} + \alpha_{2} \left\| {l - l_{rec} } \right\|_{1}$$
(1)

where \({\alpha }_{1}\) and \({\alpha }_{2}\) are hyperparameters to balance synthesis and reconstruction losses.

Discriminator.

The discriminator \(D_{M}\) of the master network receives a pair of images, including the input LPET \(l\) and the corresponding real/synthesized SPET \(s/s_{syn}\), aiming to distinguish the synthesized image pair from the real one. Specifically, the discriminator is designed with reference to pix2pix [25], whose structure is Conv-LeakyReLU-Conv-BatchNorm-LeakyReLU-Conv-Sigmoid. To enforce the distributions of the synthesized and real images to be consistent to fool the discriminator, we calculate an adversarial loss as follows:

$$L_{GAN} \left( {G_{M} ,D_{M} } \right) = E_{l,s} \left[ {\left( {D_{M} \left( {l,s} \right) - 1} \right)^{2} } \right] + E_{s} \left[ {D_{M} \left( {l,s_{syn} } \right)^{2} } \right]$$
(2)

Contrastive Learning Module.

As illustrated in Fig. 1, a contrastive learning module is introduced to enhance the representation ability of the shared encoder. The core idea of contrastive learning is to pull positive samples towards anchor samples while pushing negative samples away in the embedding space. To achieve this, we should construct reasonable anchor, positive, and negative samples first. Specifically, taking the master network for example, we first obtain a local feature with the size of 512 × 2 × 2 from the sixth down-sampling block of the shared encoder, then randomly select a vector of the local feature in the spatial dimension as the anchor sample \(f_{anchor}^{M} \in R^{512 \times 1 \times 1}\). Meanwhile, the local feature will be further processed by a 3 × 3 convolution kernel and a max-pooling layer to produce a global feature. Since the global feature comes from the same image as the local feature, it can be regarded as a positive sample \(f_{pos}^{M} \in R^{512 \times 1 \times 1}\). As for the global features extracted from other images in a batch, we treat them as negative features {\(f_{neg}^{M} \}\) with the same size as the anchor sample. Thus, we can calculate the contrastive loss as follows:

$$L_{CL} \left( {f_{anchor}^{M} ,f_{pos}^{M} ,f_{neg}^{M} } \right) = - log\frac{{exp\left( {f_{anchor}^{M} \cdot f_{pos}^{M} } \right)}}{{exp\left( {f_{anchor}^{M} \cdot f_{pos}^{M} } \right) + \mathop \sum \nolimits_{1}^{B - 1} exp\left( {f_{anchor}^{M} \cdot f_{neg}^{M} } \right)}}$$
(3)

where “\(\cdot\)” denotes dot product, \(B\) is the batch size. With contrastive learning, the shared encoder can be enhanced to extract more domain-independent content information.

MCI Classification.

Taking the clinical reliability of the synthesized image into account, we further incorporate an MCI classifier (Cls) into the master network and use the feedback from classification results to improve the diagnostic quality of the synthesized images. The structure of the classifier is designed as binary classification CNN. Specifically, the synthesized SPET image \(s_{syn}\) with the size of 128 × 128 is first passed into three convolutional blocks with a kernel size of 3 × 3, stride of 1, and padding of 1, to halve the feature map size and increase the number of channels. Subsequently, the obtained feature maps are fed into three linear layers, followed by a sigmoid function to classify whether the synthesized image is from MCI patient. The closer the classification result is to 0, the more likely the image is to be from a patient with MCI. The BCE loss is adopted as classification loss, and \(c_{{s_{syn} }} , c_{S}\) represent class labels of \(s_{syn}\) and \(s\).

$$L_{Classify} \left( {c_{{s_{syn} }} ,c_{S} } \right) = - \left[ {c_{S} *logc_{{s_{syn} }} + \left( {1 - c_{S} } \right)*log\left( {1 - c_{{s_{syn} }} } \right)} \right]$$
(4)

2.2 Auxiliary Network

Unlike the master network which aims to extract shared information from the LPET domain, the auxiliary network is intended to extract the shared information from the SPET domain. By constraining the consistency of the shared information extracted by the two networks, the inter-domain shared information can be maximized, thus helping the master network to further improve the synthesis performance. The structure of the auxiliary network is identical to that of the master network. We use the same losses as the master network to constrain the auxiliary network, which is calculated as follows.

$$L_{Auxiliary} = L_{GAN} \left( {G_{A} ,D_{A} } \right) + {\beta }_{1} L_{1} \left( {G_{A} } \right) + {\beta }_{2} L_{CL} \left( {f_{anchor}^{A} ,f_{pos}^{A} ,f_{neg}^{A} } \right)$$
(5)

where \(f_{anchor}^{A}\), \(f_{pos}^{A}\) and \(f_{neg}^{A}\) are the local feature and corresponding positive/negative features in auxiliary network, \({\beta }_{1}\) and \({\beta }_{2}\) are hyperparameters to balance the loss terms. \(L_{GAN} \left( {G_{A} ,D_{A} } \right)\), \(L_{1} \left( {G_{A} } \right)\) and \(L_{CL} \left( {f_{anchor}^{A} ,f_{pos}^{A} ,f_{neg}^{A} } \right)\) are calculated in the same manner as the corresponding loss in the master network detailed above.

2.3 Domain Alignment Module

In order to maximize the shared information between the LPET and SPET domains. We innovatively design a domain alignment (DA) module to align the features, \(F_{L}\) and \(F_{S}\), extracted by master network and auxiliary network from LPET and SPET domains, respectively. Specifically, we introduce a feature discriminator \(D_{F}\) using features \(F_{L}\) and \(F_{S}\) as input to determine whether the input feature is extracted from SPET or LPET domain. Through adversarial learning, the features from the two domains are encouraged to be consistent, thus enabling the maximization of inter-domain shared information. The architecture of \(D_{F}\) is constructed as Convolution-LeakyReLU-Convolution-BatchNorm-LeakyReLU-Convolution, with a sigmoid activation function to produce the final output. Moreover, considering the JS divergence could minimize the distribution differences of data, we further employ it to narrow the gap between the distributions of \(F_{L}\) and \(F_{S}\) to achieve domain alignment. The specific calculation of JS divergence and domain alignment adversarial loss are defined as:

$$JS\left( {F_{L} ,F_{S} } \right) = \frac{1}{2}\sum F_{L} log\frac{{F_{L} + \varepsilon }}{{\frac{{F_{L} + F_{S} }}{2} + \varepsilon }} + \frac{1}{2}\sum F_{S} log\frac{{F_{S} + \varepsilon }}{{\frac{{F_{S} + F_{L} }}{2} + \varepsilon }}$$
(6)
$$L_{GAN} \left( {F_{L} ,F_{S} } \right) = E_{{F_{L} }} \left[ {\left( {D_{F} \left( {F_{L} } \right) - 1} \right)^{2} } \right] + E_{l} \left[ {D_{F} \left( {F_{S} } \right)^{2} } \right]$$
(7)

The positive constant \(\varepsilon\) in Eq. (6) is introduced to avoid the null denominator and set as le-8 in our experiments. Based on all the above, the overall loss function of master network is shown as below:

$$\begin{aligned} L_{Master} = & \,{\lambda }_{1} L_{GAN} \left( {G_{M} ,D_{M} } \right) + L_{1} \left( {G_{M} } \right) + {\lambda }_{2} L_{CL} \left( {f_{anchor}^{M} ,f_{pos}^{M} ,f_{pos}^{M} } \right) \\ + & \,{\lambda }_{3} L_{Classify} \left( {c_{{s_{syn} }} ,c_{S} } \right) + {\lambda }_{4} JS\left( {F_{L} ,F_{S} } \right) + {\lambda }_{5} L_{GAN} \left( {F_{L} ,F_{S} } \right) \\ \end{aligned}$$
(8)

2.4 Implementation Details

The proposed BiC-GAN model is trained alternatively as the original GAN. Specifically, we first fix the \(G_{M}\), \(G_{A}\) and \(C\) to train the \(D_{M}\), \(D_{A}\) and \(D_{F}\), and then fix the \(D_{M}\), \(D_{A}\) and \(D_{F}\) to train the \(G_{M}\), \(G_{A}\) and \(C\). All experiments are conducted on the PyTorch framework with the NVIDIA GeForce GTX 1080Ti with 11 GB memory. And the whole training process lasts for 300 epochs, utilizing Adam optimizer with batch size 4. In the first 100 epochs, the learning rates for both generator and discriminator networks are fixed as 0.0002, then linearly decay to 0 in the next 200 epochs. And the classifier learning rate is fixed to 0.0002 in all 300 epochs.

Based on our parameter selection studies, \(\alpha_{1}\) and \(\alpha_{2}\) in Eq. (1) are set to 300 and 30 to boost the performance of the synthesis task, while the ratio of the two is set as 1:1 for better maximization of shared information in the auxiliary network. To balance the loss terms, \(\beta_{1}\) and \(\beta_{2}\) are set as 100 and 1, \({\lambda }_{1}\), \({\lambda }_{2}\), \({\lambda }_{3}\), \({\lambda }_{4}\) and \({\lambda }_{5}\) are respectively set as 3, 3, 0.1, 0.01, and 1. In the test stage, only \(G_{M}\) is required to synthesize the SPET image.

3 Experiments and Results

We evaluate the proposed method on a Real Human Brain dataset, which contains paired LPET and SPET images collected from 16 subjects, including 8 normal control (NC) subjects and 8 mild cognitive impairment (MCI) subjects. To prevent the over-fitting problem caused by the limited samples, we split each 3D scan with the size of 128 × 128 × 128 into 128 2D slices of size 128 × 128 and select 60 slices whose pixels are not all black as samples, thus extending the number of samples from 16 to 960. For quantitative comparison, three standard metrics are employed to study the performance of these methods, including peak signalo-noise ratio (PSNR), structural similarity index (SSIM), and normalized root-mean-square error (NRMSE). Note that, following AR-GAN [28], we compute these metrics on the entire 3D image including zero and non-zero pixels.

3.1 Ablation Studies

To verify the contributions of key components in the proposed BiC-GAN model, we break them up and reassemble them based on the GAN. Concretely, our experiment settings include: (1) the GAN network (i.e., LEncoder + SSynDec +  \( D_{M}\)); (2) the BiC-GAN without contrastive learning and classifier (denoted as BiC-GAN w/o CL&Cls), (3) the BiC-GAN without contrastive learning (denoted as BiC-GAN w/o CL), and (4) the proposed BiC-GAN model. The qualitative results are presented in Fig. 2, from where we can clearly see that the synthesized images by the proposed method (4) are more analogous to the ground truth and preserve more content details with respect to the previous variant for both NC and MCI subjects, especially the regions indicated by the red boxes. The quantitative results are given in Table 1, it can be found that our proposed method progressively boosts the SPET image synthesis performance with the incorporation of each key component. Specifically, compared with the GAN model, our proposed method boosts the PSNR by 1.438dB for NC subjects and by 1.957dB for MCI subjects, respectively. Furthermore, we also calculate the classification accuracy using LPET images and synthetic SPET images separately. On the classification model pre-trained by target images, the classification accuracy using only LPET images is 76.7%, while our synthetic SPET images achieve a higher accuracy of 86.7%.

Fig. 2.
figure 2

Qualitative comparison of the proposed method with three variant models.

Table 1. Quantitative comparison of the proposed method with three variant models.

3.2 Comparison with Existing State-of-the-Art Methods

To demonstrate the superiority of our proposed BiC-GAN method, we compare our method with four state-of-the-art image synthesis methods, including Stack GAN [12], GDL-GAN [26], Ea-GAN [27], and AR-GAN [28]. The quantitative results are reported in Table 2. As observed, the proposed method significantly outperforms the first three methods with comparable PSNR with the latest PET synthesis method AR-GAN. For NC subjects, our method enhances the synthesized image quality by 0.993dB PSNR, 0.006 SSIM, and 0.013 NRMSE in comparison with GDL-GAN. For MCI subjects, our method achieves the best overall performance with PSNR 28.264dB, SSIM 0.899, and NRMSE 0.168. Moreover, we have performed paired t-test to check whether the improvements are statistically significant. In most cases, the results show that the p-values are smaller than 0.05. Note that, although our BiC-GAN has minor improvement over AR-GAN, our BiC-GAN is lighter in model complexity. Concretely, our method contains 33M parameters while the AR-GAN has 8M parameters more. In terms of computational effort, our method costs 4.56 GFLOPs while the AR-GAN requires 7.57 GFLOPs.

We also present the qualitative comparison results in Fig. 3, where the first and second rows display the real SPET (ground truth) and synthesized images by five methods for NC and MCI subjects, respectively. We can find that the Ea-GAN method obtains the worst synthesis results with the most blurred structure, the GDL-GAN enriches the edge information of the generated image but its output still obviously differs from the ground truth. And the synthesized PET images of the proposed method for both NC and MCI subjects are most analogous to the ground truth, presenting sharper textures and more details compared with other generated results, especially the regions marked by the red boxes. In general, both qualitative and quantitative results demonstrate the superiority of our method in comparison with other advanced image synthesis methods.

Fig. 3.
figure 3

Visual comparison of the proposed method with four state-of-the-art approaches.

Table 2. Quantitative comparison of the proposed method with four state-of-the-art approaches. (*: T-test is conducted on proposed method and the comparison methods.)

4 Conclusion

In this work, we presented a classification-aided bi-directional contrastive GAN for high-quality PET image synthesis from corresponding LPET image. Considering the shared content and structure information between LPET and SPET domains is helpful for improving synthesis performance, we innovatively designed a master network and an auxiliary network to extract shared information from LPET and SPET domains, respectively. And the contrastive learning strategy was introduced to boost the image representation capability, thus acquiring more domain-independent information. To maximize the shared information extracted from the two domains, we further applied a domain alignment module to constrain the consistency of the shared information extracted by the master and auxiliary networks. Moreover, an MCI classification task was incorporated in the master network to further improve the clinical applicability of the synthesized PET image through direct feedback from the classification task. Extensive experiments conducted on Real Human Brain dataset had demonstrated that our method achieves state-of-the-art performance by both qualitative and quantitative results. Considering the existence of extensive unpaired SPET images in clinics, in our future work, we will extend our method with semi-supervised learning for superior performance.