Abstract
Although numerous improvements have been made in the field of image segmentation using convolutional neural networks, the majority of these improvements rely on training with larger datasets, model architecture modifications, novel loss functions, and better optimizers. In this paper, we propose a new segmentation performance boosting paradigm that relies on optimally modifying the network’s input instead of the network itself. In particular, we leverage the gradients of a trained segmentation network with respect to the input to transfer it to a space where the segmentation accuracy improves. We test the proposed method on three publicly available medical image segmentation datasets: the ISIC 2017 Skin Lesion Segmentation dataset, the Shenzhen Chest X-Ray dataset, and the CVC-ColonDB dataset, for which our method achieves improvements of 5.8%, 0.5%, and 4.8% in the average Dice scores, respectively.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Recently, there have been considerable advancements in semantic image segmentation using convolutional neural networks (CNNs), which have been applied to interpretation tasks on both natural and medical images [13]. The improvements are mostly attributed to exploring new neural architectures (with varying depths, widths, and connectivity or topology), designing new types of components or layers, adopting new loss functions, and training on larger datasets (via augmentation or acquisition). As one of the first high impact CNN-based segmentation models, Long et al. [14] proposed fully convolutional networks for pixel-wise labeling. Next, encoder-decoder (and similarly convolution-deconvolution) segmentation networks were introduced [1]. Soon after, Ronneberger et al. [18] showed that adding skip connections to the segmentation network improves model accuracy and addresses vanishing gradients. More recent advancements include densely connected CNN architectures [10], learnable skip connections [21], hybrid object detection-segmentation [9], and a new encoder-decoder architecture with atrous separable convolution [5].
Designing new loss functions also resulted in improvements in subsequent inference-time segmentation accuracy, e.g., optimizing various segmentation prediction metrics, such as the intersection over union [3] and the Dice score [15], controlling the level of false positives and negatives [22, 23], and adding regularizers to loss functions to encode geometrical and topological shape priors [2, 16].
Some previous works resorted to modifying the input image to improve the segmentation results. These modifications included applying conventional image normalization techniques prior to feeding the image to a segmentation network, e.g., Haematoxylin and Eosin pre-processing [7], edge-preserving smoothing [17], and whitening transformation [11]. Other works generated variants of the input image to augment the training data by applying radiometric and spatial image transformations, e.g., rotation, color shifting/normalization, and elastic deformation [19]. The shortcoming of such pre-processing methods is that they are not explicitly optimized to improve a specific task e.g., segmentation or classification. To the best of our knowledge, no previous work has optimized the manipulation of the input image in order to improve segmentation accuracy of a trained network. Recently, Drozdzal et al. [8] showed that attaching a pre-processing module at the beginning of a segmentation network improves the network performance. However, we argue that adding a pre-processor without any other explicit constraint(s) amounts to adding (or prefixing) more layers, i.e., essentially making the model deeper. Inspired by adversarial perturbations [12, 24], in this paper, we choose to optimally modify the input image prior to feed-forward inference. Our input-transformation is carried out via a novel gradient based method that leverages the computational processes of any trained segmentation network. After calculating these optimal transformations on training data, we then learn an image-to-image translation network that estimates an image modification mapping for novel test images. Note that our input transformation is not a data-augmentation method (albeit data augmentation may still be performed independently of our method), rather, we learn (from training data) a translation network that will pre-process novel input at inference time.
In this paper, we make the following contributions: (a) we introduce the first iterative gradient-based input pre-processing method, (b) we adopt an explicit objective to effectuate the purpose of the pre-processor, and (c) we show how targeted gradient-based adversarial perturbation methods can be leveraged for a better segmentation performance.
2 Method
Segmenting an input image \(\mathbf {I}\) of size \(n \times m\) assigns a label \(l_i \in \mathcal {C} = \{0,1,\cdots ,L-1\}\) to the each pixel in \(\mathbf {I}\), where L is the number of classes. Given a segmentation network with parameters \(\mathbf {\Theta }\), let \(f(\mathbf {I; \Theta })\) denote the pixel-wise activation for \(\mathbf {I}\) before the softmax normalization (denoted by \(\xi _{\mathcal {C}}\)), and let \(\mathbf {\hat{\mathcal {S}}} \in \mathbb {R}^{n \times m \times L}\) represent the segmented image as,
Let \(\mathbf {\mathcal {S}} \in \mathbb {R}^{n \times m \times L}\) denote the ground truth segmentation for \(\mathbf {I}\). For a perfect segmentation, \(\mathbf {\hat{\mathcal {S}}} = \mathbf {\mathcal {S}}\). Our goal is to introduce a perturbation \(\mathbf {\Delta }_{\mathbf {I}}\) to \(\mathbf {I}\), such that the segmentation output of the modified image \(\mathbf {I + \mathbf {\Delta }_{\mathbf {I}}}\) is equal to the ground truth, i.e.,
Let \(f_{\mathbf {\hat{\mathcal {S}}}}(\mathbf {I; \Theta })\) and \(f_{\mathbf {{\mathcal {S}}}}(\mathbf {I; \Theta })\) represent the pixel-wise activations corresponding to the segmentation outputs \(\hat{\mathcal {S}}\) and \({\mathcal {S}}\) respectively. We apply a gradient descent algorithm for estimating the perturbation \(\mathbf {\Delta }_{\mathbf {I}}\) to be added to \(\mathbf {I}\). The objective function \(\mathcal {G}\) for this can be written as
Starting with the original image \(\mathbf {I}\), we iteratively compute the gradient of the loss \(\mathcal {G}(\mathbf {I, \hat{\mathcal {S}}, \mathcal {S}, \Theta })\) with respect to \(\mathbf {I} + \pmb {\delta }'^{(k)}_{\mathbf {I}}\) and add it to the image. Note that \(\pmb {\delta }'^{(k)}_{\mathbf {I}}\) is zero for the first iteration. Let \(\mathbf {I}^{(k)}\) denote the perturbed image after the \(k^{th}\) iteration of gradient descent. For the \(k^{th}\) iteration, we have
where \(\gamma \) is a scaling constant, and \(\pmb {\delta }'^{(k)}_{\mathbf {I}}\) is the gradient obtained for the \(k^{th}\) iteration calculated by gradient descent update as
and then normalized using its \(L_\infty \) norm for numerical stability as
The algorithm terminates if the segmentation of the modified image is equal (or close enough within an error margin) to the ground truth, or it reaches a certain maximum number of iterations K. The total perturbation for image \(\mathbf {I}\) is then calculated as the sum of the individual perturbations, i.e., \(\mathbf {\Delta }_{\mathbf {I}} = \sum _k \pmb {\delta }'^{(k)}_{\mathbf {I}}\). The segmentation output of this perturbed image \(\mathbf {X + \Delta _{\mathbf {I}}}\) is denoted by \(\mathcal {S}^*\).
The calculation of \(\mathbf {\Delta }_{\mathbf {I}}\) requires knowledge of the ground truth segmentation mask and thus is only available for the training data. To test the hypothesis that the proposed gradient-based method improves the segmentation performance, we perturb the test images with their corresponding ground truths, and we obtain an almost perfect segmentation (Dice score \(\sim 1.0\)). However, since the ground truth segmentation masks for test images are not available in practice, we propose to reconstruct an estimate of the perturbed test images. In particular, given pairs of training images and their corresponding perturbations, \(\left\{ \left( \mathbf {I}_{train}, \mathbf {I}_{train} + \ \mathbf {\Delta }_{\mathbf {I}_{train}}\right) \right\} ,\) we train a deep model to learn \(\mathbf {\Phi }: \mathbf {I}_{train} \rightarrow \mathbf {I}_{train} + \ \mathbf {\Delta }_{\mathbf {I}_{train}}\). Subsequently, we apply the learned \(\mathbf {\Phi }\) to the test data to obtain the reconstruction \(\mathbf {I}_{test} \rightarrow \mathbf {I}_{test} + \ \mathbf {\Delta }_{\mathbf {I}_{test}}\). To learn the transformation function \(\mathbf {\Phi }(\cdot )\), an image-to-image translation network can be used. Figure 1 shows an overview of the proposed method.
Left: Passing image \(\mathbf {I}\) through the trained network N generates sub-optimal output \(\hat{\mathcal {S}}\). Gradient perturbation module (top) calculates perturbation \(\mathbf {\Delta }\) on \(\mathbf {I}\), such that passing \(\mathbf {I}+\mathbf {\Delta }\) through N yields results closer to ground truth S. Translation network (middle) is trained to learn mapping \(\mathbf {\Phi :I \rightarrow I+\Delta }\). Test images are first transformed via \(\mathbf {\Phi }\) before feeding them into N (bottom), which results in an improved segmentation output \(\mathcal {S}^*\). Right: Reducing the logits of background and increasing that of foreground.
3 Implementation Details and Data
3.1 Models
As the goal of this work is to demonstrate the effectiveness of the proposed gradient-based perturbation, we use a state-of-the-art baseline segmentation network i.e., the U-Net [18] and optimize it using Adadelta with a batch size of 64. To learn the transformations made by the proposed gradient-based perturbation method (GP) from the training data as discussed in Sect. 2, we use two image-to-image translation networks: Cycle-GAN (cG) [25] and the hundred-layers Tiramisu segmentation network (T) [10]. We modify the latter as an image-to-image translation network and replace the original Tiramisu network’s loss function with a loss function consisting of two terms: a Structural Similarity Index Measure (SSIM) term and a mean absolute error (L1 loss) term:
where \(\mathcal {SSIM}\left( \mathbf {I},\ \mathbf {I} + \mathbf {\Delta }\right) \) is the SSIM calculated between \(\mathbf {I}\) and \(\mathbf {I} + \mathbf {\Delta }\), and \(\lambda \) is a scaling constant. The SSIM loss captures the finer perceptual details to which the human visual system is sensitive, such as contrast, luminance, and structure for which the L1 loss fails. When reporting results, we use the following abbreviations (i) ORIG: the original U-Net; (ii) \(\mathrm {GP_{cG}}\): the proposed GP (\(\gamma = 1.0\)) + Cycle-GAN for reconstructing the test image perturbation; (iii) \(\mathrm {GP_T}\): GP (\(\gamma =0.1\)) + Tiramisu reconstruction with L1 loss; and (iv) \(\mathrm {GP_{Ts}}\): GP (\(\gamma =1.0\)) + Tiramisu reconstruction with SSIM loss (s; Eq. 7). We choose the maximum possible value of \(\gamma =1.0\) in Eq. 4 for \(\mathrm {GP_{Ts}}\) and \(\mathrm {GP_{cG}}\) methods to maximally perturb images with the goal of highest possible segmentation performance, and run the optimization for \(K=100\) iterations. Because the perturbations can have negative values, we use a linear activation function for the last layer in all the aforementioned image-to-image translation networks.
3.2 Data
We use three datasets to evaluate our method. (a) The ISIC 2017 Skin Lesion Segmentation Dataset [6], hereafter referred to as SKIN, consists of 2000 skin lesion images for training and 600 test images. For all the images, the lesions have been manually annotated by expert dermatologists for normal skin and other miscellaneous structures. (b) The Shenzhen Chest X-Ray Dataset [20], hereafter referred to as LUNG, consists of 662 frontal chest X-Ray images, out of which 336 are cases with manifestations of tuberculosis and the remaining 326 are healthy. The corresponding ground truth masks contain manually traced out boundaries for the left and the right lungs. (c) The CVC-ColonDB [4], hereafter referred to as COLON, is a database of frames along with the corresponding annotated polyp masks extracted from colonoscopy videos. The dataset consists of 300 training images and 50 test images.
4 Results and Discussion
To test the effectiveness of the proposed transformation method, in this section, we present both quantitative and qualitative results on the three aforementioned datasets. We start with SKIN and the different derivatives of the proposed gradient-based perturbation method as discussed in Sect. 3.1, i.e., \(\mathrm {GP_{cG}}\), \(\mathrm {GP_T}\), and \(\mathrm {GP_{Ts}}\) and compare them to ORIG. As shown in Fig. 2, the proposed method (i.e., \(\mathrm {GP_{Ts}}\)) produces the closest segmentation mask to the ground truth (GT) compared to other methods. Looking at the second row of Fig. 2, \(\mathrm {GP_{Ts}}\) perturbs the pixel values in a band surrounding the lesion, enhancing its contrast, making it more distinguishable from the background, thereby boosting the segmentation network’s result, especially around the critical lesion boundary pixels. In Fig. 2, we show qualitative results for SKIN obtained with U-Net. As shown, all the gradient-based perturbation methods outperform ORIG, with the proposed \(\mathrm {GP_{Ts}}\) method achieving a significant improvement of the mean Dice score by 5.8% compared to ORIG.
Moreover, a visual inspection of the ORIG and the \(\mathrm {GP_{Ts}}\) results (Fig. 2, right) shows that the latter is more adept at rejecting false negative pixels, which is also supported by the results from Fig. 3 where \(\mathrm {GP_{Ts}}\) improves the False Negative Rate (FNR) by a considerably large amount (5.87%). Next, we pick the best-performing method from SKIN experiments, i.e., \(\mathrm {GP_{Ts}}\), and evaluate its performance on the remaining two datasets: COLON and LUNG. Figure 3 shows the qualitative results obtained for ORIG and \(\mathrm {GP_{Ts}}\) applied to three sample images from LUNG and COLON. As can be seen from the results, \(\mathrm {GP_{Ts}}\) obtains segmentation results much closer (i.e., smoother with no perforated or disconnected blobs) to the ground truth segmentation (GT). This is also validated by the quantitative results reported in Table 1 where \(\mathrm {GP_{Ts}}\) outperforms ORIG in all the three metrics - Dice score, FPR, and FNR, obtaining 4.8% and 0.5% improvements in the mean Dice score for COLON and LUNG, respectively. To further capture the improvement in segmentation performance, we plot the Gaussian kernel density estimates to estimate the probability density functions of Dice score, FPR, and FNR for the three datasets. The plots in Fig. 4 support the quantitative results with higher peaks (which correspond to higher densities) at larger Dice values for \(\mathrm {GP_{Ts}}\) as compared to ORIG for all three datasets. Next, we look at range of the three metrics, and observe that they are in general more restricted to higher values for Dice score and lower values for FPR and FNR for \(\mathrm {GP_{Ts}}\) than ORIG. Although we achieve up to \(\sim 5\%\) improvement in Dice score, it is important to note that a much larger possible improvement is lost during the reconstruction phase. For example, for SKIN, when we perturb the test images with their corresponding ground truths as described in Sect. 2, we obtain almost perfect segmentation results i.e. Dice \(\sim 1.0\) (we emphasize that the performance improvement is solely from the perturbation and not from the image-to-image translation components), but the best results we obtain through reconstruction are far lesser. A SSIM of \(0.32 \pm 0.009\) between the \(\mathrm {GP}\) and the \(\mathrm {GP_{Ts}}\) images supports our hypothesis that a lot of information is lost in the reconstruction phase. Since training our input-perturbing mechanism requires gradients from an already-trained segmentation network, along with pixel-level class labels, training our perturbation module and training the segmentation network cannot be done simultaneously. In our next experiment, we set out to demonstrate that using (i) our pre-trained perturbation network as a pre-processor to a segmentation network leads to better results than (ii) an end-to-end training of the pre-processor network serially connected to the segmentation network. In other words, we compare the segmentation performance when training the pre-processor with (i) vs. without (ii) the proposed gradient based constraint. We find that the proposed method (i) achieves a higher Dice score of 0.8190 compared to 0.8019 using (ii).
5 Conclusion
We proposed a novel input image transformation optimized for improved segmentation. A network gradient-based method calculates signed input perturbations on training images, which are then used to train deep networks to infer test image transformations. Our evaluations showed that the proposed approach can improve the performance of baseline methods for a variety of medical imaging modalities. A direction for future work includes focusing on improving the translation step of the proposed method. Moreover, the proposed method can be extended to other tasks such as image classification and object detection.
References
Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561 (2015)
BenTaieb, A., Hamarneh, G.: Topology aware fully convolutional networks for histology gland segmentation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 460–468. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_53
Berman, M., Rannen Triki, A., Blaschko, M.B.: The Lovász-Softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: CVPR, pp. 4413–4421 (2018)
Bernal, J., Sánchez, J., Vilarino, F.: Towards automatic polyp detection with a polyp appearance model. Pattern Recogn. 45(9), 3166–3182 (2012)
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49
Codella, N.C., et al.: Skin lesion analysis towards melanoma detection: a challenge at the 2017 ISBI. In: ISBI, pp. 168–172 (2018)
Cui, Y., Zhang, G., Liu, Z., Xiong, Z., Hu, J.: A deep learning algorithm for one-step contour aware nuclei segmentation of histopathological images. arXiv:1803.02786 (2018)
Drozdzal, M., et al.: Learning normalized inputs for iterative estimation in medical image segmentation. Med. Image Anal. 44, 1–13 (2018)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV, pp. 2961–2969 (2017)
Jégou, S., Drozdzal, M., Vazquez, D., Romero, A., Bengio, Y.: The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. In: CVPR Workshops, pp. 11–19 (2017)
Kannan, S., et al.: Segmentation of glomeruli within trichrome images using deep learning. Kidney Int. Rep. 4(7), 955–962 (2019)
Kurakin, A., Goodfellow, I., Bengio, S.: Adversarial examples in the physical world. arXiv:1607.02533 (2016)
Litjens, G., et al.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015)
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV, pp. 565–571 (2016)
Mirikharaji, Z., Hamarneh, G.: Star shape prior in fully convolutional networks for skin lesion segmentation. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11073, pp. 737–745. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00937-3_84
Pal, C., Chakrabarti, A., Ghosh, R.: A brief survey of recent edge-preserving smoothing algorithms on digital images. arXiv:1503.07297 (2015)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Shen, X., et al.: Automatic portrait segmentation for image stylization. Comput. Graph. Forum 35, 93–102 (2016)
Stirenko, S., et al.: Chest X-ray analysis of tuberculosis by deep learning with segmentation and augmentation. In: 2018 IEEE 38th International Conference on Electronics and Nanotechnology (ELNANO), pp. 422–428 (2018)
Taghanaki, S.A., et al.: Select, attend, and transfer: light, learnable skip connections. arXiv:1804.05181 (2018)
Taghanaki, S.A., et al.: Combo loss: handling input and output imbalance in multi-organ segmentation. Comput. Med. Imaging Graph. 75, 24–33 (2019)
Wong, K.C.L., Moradi, M., Tang, H., Syeda-Mahmood, T.: 3D segmentation with exponential logarithmic loss for highly unbalanced object sizes. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11072, pp. 612–619. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00931-1_70
Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1369–1378 (2017)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: CVPR, pp. 2223–2232 (2017)
Acknowledgement
Partial funding for this project is provided by the Natural Sciences and Engineering Research Council of Canada (NSERC). The authors are grateful to the NVIDIA Corporation for donating Titan X GPUs and to Compute Canada for HPC resources used in this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Taghanaki, S.A., Abhishek, K., Hamarneh, G. (2019). Improved Inference via Deep Input Transfer. In: Shen, D., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science(), vol 11769. Springer, Cham. https://doi.org/10.1007/978-3-030-32226-7_91
Download citation
DOI: https://doi.org/10.1007/978-3-030-32226-7_91
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32225-0
Online ISBN: 978-3-030-32226-7
eBook Packages: Computer ScienceComputer Science (R0)