1 Introduction

Advances in deep learning segmentation methods have enabled faster 2D and 3D segmentation [1,2,3]. In these networks, compared to traditional methods, high-level deeply learned features from a receptive field are used. Compared to stacked 2D slice segmentation, 3D segmentation has a better chance of producing consistent and continuous object shapes. However, learning a 3D volume neural network segmentation faces two challenges. First, each voxel is classified using content from a receptive field with certain size but the overall shape of the object is not taken into account. Therefore, a post processing step to further refine the segmentation is usually needed. To address this problem, Kamnitasas et al. [3] used fully connected conditional random fields (CRF) to refine brain lesion segmentation in a post processing step. Lu et al. [4] used graph cut in the post processing. Level set is also often used as a post processing step to refine the segmentation output from deep learning networks [2, 5, 6]. In the deployment stage, the deep learning step takes milliseconds while the post processing step usually takes longer. Thus integrating the post processing step in the learning of the deep learning weights can further speed up and simplify the segmentation process in the deployment stage. Tang et al. [7] proposed a deep level set method for liver CT and left ventricle MRI segmentation. They use level set to refine an initial segmentation from a network trained with limited data, and then backpropagate the loss between the refined segmentation and the deep learning output. However, their method does not have an explicit mathematical formulation of the integration. Hu et al. [8] proposed a framework to let the network learn a levelset function for objects instead of a probability map. However, their levelset function is obtained by directly substracting 0.5 from the probability map, which is different from the commonly used signed distance transform method.

Second, 3D volume segmentation requires significant memory becuase of the huge number of weights learned. Constrained by the memory limit, usually small volumes, either from downsampling of the original image or smaller cropped regions are fed into a deep learning network. Milletari et al. [1] and Çiçek, et al. [9] downsampled the original image before feeding into the network and upsampled it back. However, this downsampling method results in lost information. Besides downsampling, Gibson et al. [2] also used batch-wise spatial dropout and Monte Carlo inference to reduce memory costs without affecting performance. Memory usage can also be reduced if fewer kernels in each layer or fewer layers are used in the network. However, reducing the number of kernels will reduce the number of learned latent features and increase the risk of getting a biased network. Reducing the number of layers will shorten the network depth and thus result in a smaller receptive field and lose part of the neighborhood information.

In this paper, our contribution focuses on addressing these two challenges. For the first challenge, we propose a novel way to integrate a level set energy function into Dice based loss. We show that the new proposed loss can drive the learning of the network weights such that the segmentation output of the network has the smooth property defined by a level set energy function. This smoothing energy is propagated back into the network to train a set of weights that can output a smoother segmentation. For the second challenge, our strategy of processing large volumes is integrating downsampling and upsampling into the network to process a larger volume. We evaluate the proposed method in 48 chest CTA datasets where 16 anatomies are manually segmented. We demonstrate the efficiency of integrating post processing into deep learning network while reducing the processing time for a volume to millisecond.

2 Method

We first propose the framework of integrating the surface smoothing into deep learning training, followed by a modified segmentation network that handles large volumes by adding very few parameters to the network.

2.1 Integrating Level Set Energy into Network Loss Function

The softmax output of a segmentation network is bounded between 0 and 1. As such, the \(n_{th}\) output can be treated as a Heaviside function \(H^n(\mathbf {x})\) of a latent surface S and its corresponding level set embedding function \(\phi ^n(\mathbf {x})\) can be obtained using signed distance transform. From a given \(\phi ^n(\mathbf {x})\), the corresponding Heaviside function is approximated as \(H^n(\mathbf {x})=\frac{1}{2}(1+\frac{2}{\pi }arctan(\frac{\phi ^n(\mathbf {x})}{\epsilon }))\).

In level set representation, smoothing a surface is equal to evolving its corresponding embedding function. Thus the level set loss used for smoothing a surface is defined as:

$$\begin{aligned} E(\phi ^n(\mathbf {x}))=\int _{\varOmega }\delta (\phi ^n(\mathbf {x}))\times |\triangledown \phi ^n(\mathbf {x})|d\mathbf {x} \end{aligned}$$
(1)

where \(\varOmega \) is the volume inside the surface S. \(\mathbf x \) is the voxel index. \(\delta ^n(\mathbf {x})\) is the gradient of \(H^n(\mathbf {x})\) with regard to \(\mathbf x \), and is equal to \(\frac{\epsilon }{\pi (\epsilon ^2+\phi ^n(\mathbf {x})^2)}\).

Different types of loss, such as cross entropy loss, Dice based loss for binary segmentation [1], or probabilistic Dice scores [2] were proposed to train a segmentation network. We took multi Dice, which is the sum of Dice for different organs as an example in this paper to integrate with level set based surface energy.

Using \(H(\mathbf {x})\) to denote the group of \(H^n(\mathbf {x})\) for all anatomies, the overall loss to minimize can be written as:

$$\begin{aligned}&E(H(\mathbf {x}))=E_1(H(\mathbf {x}))+E_2(H(\mathbf {x})) \\&\quad =-\sum _{n=0}^NDice(H^n(\mathbf {x}),g^n(\mathbf {x}))+\sum _{n=0}^Nw_n\times \int _{\varOmega }\delta (\phi ^n(\mathbf {x}))\times |\triangledown \phi ^n(\mathbf {x})|d\mathbf {x} \nonumber \end{aligned}$$
(2)

where \(E_1\) is the multi Dice based loss and \(E_2\) is the level set based loss. The level set based loss is defined to be the overall area of the segmentation surface for the \(n_{th}\) anatomy. \(g^n(\mathbf {x})\) is the ground truth binary mask of the \(n_{th}\) anatomy, \(w_n\) is the weight used for different anatomies. N is the number of anatomies.

For back propagation, we need to compute the gradient of the loss with respect to the network prediction \(H^n(\mathbf {x})\):

$$\begin{aligned} \frac{\partial E(H(\mathbf {x}))}{\partial H^n(\mathbf {x})}=\frac{\partial E_1(H(\mathbf {x}))}{\partial H^n(\mathbf {x})}+w_n\times \frac{\partial E_2(H(\mathbf {x}))}{\partial H^n(\mathbf {x})} \end{aligned}$$
(3)

in which the first part can be calculated as:

$$\begin{aligned} \frac{\partial E_1(H(\mathbf {x}))}{\partial H^n(\mathbf {x})} = 2\left( \frac{g_{j}^{n}(\mathbf {x})({\sum _{i}^{I}H_{i}^{n}(\mathbf {x})}^2+{\sum _{i}^{I}g_{i}^{n}(\mathbf {x})}^2)-2H_{j}^{n}(x)\sum _{i}^{I}H_{i}^{n}(\mathbf {x})g_{i}^{n}(\mathbf {x})}{({\sum _{i}^{I}H_{i}^{n}(\mathbf {x})}^2+{\sum _{i}^{I}g_{i}^{n}(\mathbf {x})}^2)^2}\right) \end{aligned}$$
(4)

where i and j are voxel indices. The second term can be calculated as:

$$\begin{aligned} \frac{\partial {E_2(H(\mathbf {x}))}}{\partial {H^n(\mathbf {x})}} = \frac{\partial {E_2(H(\mathbf {x}))}}{\partial {\phi ^n(\mathbf {x})}} \times \frac{\partial {\phi ^n(\mathbf {x})}}{\partial {H^n(\mathbf {x})}} \end{aligned}$$
(5)

where \(\frac{\partial \phi ^n(\mathbf {x})}{\partial H^n(\mathbf {x})}\) is difficult to be solved analytically, so we approximate it as \( \frac{\triangle \phi ^n(\mathbf {x})}{ H^n(\phi ^n(\mathbf {x})+\triangle \phi ^n(\mathbf {x}))-H^n(\phi ^n(\mathbf {x}))}\). The gradient of \(E_2(H(\mathbf {x}))\) with respect to \(\phi ^n(\mathbf {x})\) is given as [10]:

$$\begin{aligned} \frac{\partial E_2(H(\mathbf {x}))}{\partial \phi ^n(\mathbf {x})} =\delta (\phi ^n(\mathbf {x})\times div(\frac{\triangledown \phi ^n(\mathbf {x})}{|\triangledown \phi ^n(\mathbf {x})|}) \end{aligned}$$
(6)

\(div(\frac{\triangledown \phi ^n(\mathbf {x})}{|\triangledown \phi ^n(\mathbf {x})|})\) is the mean curvature of a surface. Equation 6 evolves \(\phi ^n(\mathbf {x})\) by the surface curvature in the direction of the surface norm, which will result in a smoother surface. The sign of the curvature determines whether a point on the surface should move inward or forward in the direction of surface normal.

2.2 Segmentation Network Architecture

Inspired by image guided filtering [11], which uses an additional image to guide the filtering of a target image, we propose the architecture in Fig. 1 to use the raw image to guide the upsampling of the low resolution segmentation maps.

Fig. 1.
figure 1

The network architecture modifies VNet to integrate downsampling and upsampling procedures with additional layers with few parameters.

The raw image is first downsampled with ONE kernal downsampling convolution (this can be replaced by average pooling), and the downsampled image is fed to VNet [1]. We replace the last softmax layer in standard VNet with PRelu layer. The raw image is then upsampled by a deconvolution layer with the number of channels preserved, which is equivalent to the number of anatomies + background. The deconvolution layer can be replaced by a bilinear resampling layer and a convolution layer. The deconvolution output is then concatenated with the raw image in the channel dimension, followed by two convolution and activation layers. The downsampling and upsampling added only 31698 weights when the number of anatomies n equals 16 (3 \(\times \) 3 \(\times \) 3 for the downsampling convolution layer, (n+1) \(\times \) 3 \(\times \) 3 \(\times \) 3 \(\times \) 2 \(\times \) (n+1) for the deconvolution layer and the followed convolution layer, as well as (n+2) \(\times \) 3 \(\times \) 3 \(\times \) 3 \(\times \) (n+1) + (n+1) \(\times \) 3 \(\times \) 3 \(\times \) 3 \(\times \) (n+1) for the last two convolution layers), this number can be reduced to 513 for a binary segmentation. Thus, most of the computation stays inside the VNet architecture whose input size is half the original input size in each dimension. This allows the processing of a large image without adding much to memory cost.

2.3 Implementation Details

This method is implemented in Caffe and runs on one TITAN X GPU with 12 GB of memory. The proposed architecture is first trained using multi Dice loss for 300 epochs until it converges. And then we continue training using the proposed loss which integrates the level set smoothing energy for 15 epochs. Since we have anatomies with naturally different surface curvature, we also set different smoothing weights for different anatomies. For the vertebrae, the myocardium and the left ventricle, we set the weights to be 1e–5, while for others we use weight of 1e–4.

3 Experiments and Results

3.1 Data

We collect 48 cardiac CTA images annotated for 16 anatomical structures by one annotator. The 16 anatomies are: sternum, ascending aorta, descending aorta, aortic arch, aortic root, left pulmonary artery, right pulmonary artery, trunk pulmonary artery, vertebrae, left atrium, right atrium, left ventricle, right ventricle, left ventricular myocardium, superior vena cava, and inferior vena cava similar to [12]. The cardiac CT studies used here were acquired by a Siemens CT scanner. All images have voxel size of 1.5 mm in all directions.

3.2 Results

For the first stage of training which does not have the level set integrated loss, we obtain average Dice of 79.3% for 4-fold cross validation. After continued training with level set based smoothing energy, we obtain Dice of 79.5%. We must specify that the sole expert that provided the ground truth segmentation was not asked to target smoothness, thus adding the smoothing term does not necessarily improve Dice coefficient. But we can visually observe a smoother segmentation. Figure 2 shows two superior vena cava segmentation outputs generated from two trained models with and without level set smoothing energy. We see that some false positives due to image noise and lack of shape information are removed because of their high curvature property.

As a qualitative way of understating the effects of the new loss function on smoothing the structures, consider the case of spine as illustrated in Fig. 3. The progressively smooth volume after epochs 8 and 15 are visible. To better visualize the smoothing effect, we use a large weight (1e–4) in this example only for demonstration. When applying this method to other applications, the number of training epochs and the weight \(w_n\) should be tuned as hyper parameters.

Fig. 2.
figure 2

Effect of adding a level set smoothing term in the loss for smoothing surfaces: (a) trained with multi Dice loss, (b) trained with the proposed loss which integrates level set smoothing energy, (c) ground truth segmentation

Fig. 3.
figure 3

Effect of adding a level set smoothing term in the loss for smoothing surfaces, (a) without smoothing, (b) smoothing effect after 8 epochs, (c) smoothing effect after 15 epochs.

3.3 Performance Comparison

We compared our results with the state of the art multi atlas based segmentation method followed by corrective learning as post processing [12]. Figure 4 shows the bar plot comparing Dice per anatomy for five different methods: the multi atlas based method, the standard VNet which takes resampled volumes, the standard VNet followed by a level set smoothing step as post processing, the proposed modified VNet architecture trained by multi Dice loss, and the proposed modified VNet architecture trained by our proposed loss. For the standard VNet, due to the memory limit, the input was downsample with voxel size of 2 mm  \(\times \) 2 mm \(\times \) 3.5 mm and volume size of 128 \(\times \) 192 \(\times \) 64.

From Fig. 4, we can see that the deep learning method is comparable to the state of the art multi atlas based segmentation. The proposed modified VNet architecture trained with the proposed loss performs the best among the deep learning methods. For small anatomies such as aortic root, left pulmonary artery and superior vena cava, we get a larger boost in the performance than large anatomies. We show the summary of the comparisons in Table 1. An example of the segmented volume compared to the ground truth is shown in Fig. 5.

Fig. 4.
figure 4

Results compared to using original VNet and state of the art multi atlas based segmentation from [12]

Fig. 5.
figure 5

Example of segmented anatomies (a) results from weights trained with 10 epochs using the proposed loss, (b) ground truth.

Table 1. Summary of performance of different methods. Method 1: multi atlas method followed by corrective learning. Method 2: standard VNet. Method 3: standard VNet + post processing. Method 4: modified VNet trained with multi dice loss. Method 5: modified VNet trained with proposed loss which integrates the level set smoothing energy.

4 Conclusion

In this paper, we propose a new loss function to integrate the level set smoothing energy into multi Dice loss to eliminate an additional post processing step. We also propose a new strategy for designing segmentation architectures that can process large volumes by adding very few parameters. This method produces accurate and fast anatomic segmentation in CTA images. The new network boosts the dice from 0.766 to 0.795. The proposed framework for integrating level set with network training is general and can be extended to other types of level set energy functions.