1 Introduction

Gliomas are the most frequent primary brain tumors in adults [5], and the accurate segmentation of glioma and its sub-regions is crucial in clinical diagnosis, treatment planning, and post-operation evaluation. However, as shown in Fig. 1, the multiclass segmentation of multimodal brain MR images is very challenging. The major obstacle includes the great variance in terms of tumor size, shape, and location, also the extreme class imbalance.

Recently, deep convolutional neural networks (CNNs) have achieved remarkable performance in automatic brain tumor segmentation. Specifically, Pereira et al. [6] trained a 2D CNN on patches with data augmentation. A 3D CNN with multi-scale and multi-stream architecture is performed on patches extracted by nonuniform sampling [1], and followed by a fully connected conditional random field (CRF) to refine segmentation output [2]. Based on the fully convolutional network (FCN) [4], Shen et al. [7] introduced a boundary-aware network to achieve multi-task learning on 2D image slices. Zhao et al. [12] integrated FNNs and CRFs, and trained on both patches and slices in multiple stages. Additionally, three modes are trained on images of axial, coronal and sagittal views respectively, and combined by voting-based fusion strategy.

Fig. 1.
figure 1

Different modalities and the ground truth of an HG Tumor. Left to right: Flair, T1, T1c, T2, and expert manually segmented labels: necrosis (red), edema (yellow), non-enhancing tumor (blue), and enhancing tumor (green). (Color figure online)

To sum up, all these methods except [7] operate at the patch level, and balance the data by controlling the sampling rate [1, 2, 6, 12]. Without prior knowledge, it is hard to extract test patches by the same sampling ratio. Moreover, the end-to-end (image to segmentation map) FCN frameworks like [7] are more computationally efficient comparing to the patch-based methods, but fail to handle the imbalance by nonuniform sampling or data augmentation.

To address the challenges above, we propose the Focal Dice Loss inspired by [3] and apply image dilation. To tackle the extreme class imbalance on image slices, our FDL down-weights the well-segmented classes during training. Instead of taking all classes into consideration like focal loss [3], the FDL emphasizes the imbalance among foreground classes. Meanwhile, dilation is applied to the ground truth of training samples that allows the network to learn the complex details of tumor structure in a coarse-to-fine approach. This differs from dilated convolution [11] that enlarges the receptive fields for convolutional kernels.

Our major contributions are as follows: (1) we propose Focal Dice Loss to address the class imbalance for multimodal brain tumor segmentation, and validated on publicly available dataset; (2) to the best of our knowledge, we are the first to apply image dilation to ground truth labels during training with gradually downsized structuring element, which obtains better high-level understanding; (3) we show that the proposed method achieves the state-of-the-art performance in Dice Coefficient on average, and with high computational efficiency.

2 Methodology

We employ the elegant U-Net that takes the full image context into account. As shown in Fig. 2, each block includes 3 convolutional layers of size \(3\,\times \,3\), and each layer followed by ReLU activation and batch normalization. Max-pooling and up-sampling of size \(2\times 2\) are adopted in the two paths. Feature maps from the contracting path are concatenated to the ones in the expanding path.

2.1 Focal Dice Loss for Highly Unbalanced Data

Focal loss [3] based on standard cross entropy, is introduced to address the data imbalance of dense object detection. It is worth noticing that for the brain tumor, the class imbalance exists not only between tumor and background, but among different sub-regions of the tumor (e.g., necrosis and edema in Fig. 1 and Table 1). It is stated by Sudre et al. [10] that with the increasing level of data imbalance, loss functions based on overlap measurements are more robust than weighted cross entropy. Our experiments in the next session also support this argument. Therefore, Dice Coefficient is adopted to focus on the tumor sub-regions.

Fig. 2.
figure 2

Network Architecture: U-Net.

Balanced Dice Loss. The Dice Coefficient (DICE), also called the overlap index, is a commonly used metric in validating medical image segmentation. For the binary ground truth images of each class, DICE can be written as:

$$\begin{aligned} DICE_{t} = \frac{2 \sum _{i=1}^{N} \,p_{it}g_{it} + \epsilon }{ \sum _{i=1}^{N}\, p_{it} + \sum _{i=1}^{N}\, g_{it} +\epsilon }. \end{aligned}$$
(1)

In the above, \(g_{it} \in \{0,1\}\) specifies the ground truth label of class t and pixel i, where N indicates the total number of pixels of the image. Similarly, \(p_{it} \in [0,1]\) denotes the output probability. In practice, the \(\epsilon \) term is adopted to guarantee the loss function stability by avoiding the numerical issue of dividing by 0.

A common method for class imbalance is introducing a weight \(w_{t} \geqslant 0\) for each class t. Therefore, we write the Dice Loss (DL) as:

$$\begin{aligned} DL = \sum _{t}\,w_{t}\,(\,1-DICE_{t}\,). \end{aligned}$$
(2)
Table 1. Average Class Frequencies. Average frequencies taken over the training set of HG images, approximate values. Classes are: background (0) necrosis (1), edema (2), non-enhancing tumor (3), and enhancing tumor (4).

Focal Dice Loss. As mentioned by [3], the extreme class imbalance overwhelms the cross entropy loss during training. We propose to assign lower weights to the well-segmented classes, and focus on the hard classes with lower DICE.

Formally, a factor \(1/ \beta \) is applied as the power of \(DICE_{t}\) for each class, where the exponent parameter \(\beta \geqslant 1\). We define the Focal Dice Loss (FDL) as:

$$\begin{aligned} FDL = \sum _{t}\,w_{t}\,(\,1-DICE_{t}^{ 1/ \beta }\,). \end{aligned}$$
(3)
Fig. 3.
figure 3

Visualization of Focal Dice Loss. A factor \(1/ \beta \) is applied as the power of \(DICE_{t}\), with the increase in \(\beta \), the well-segmented classes are down-weighted.

The following are three properties of the FDL. (1) If a pixel is misclassified to class t with a large \(DICE_{t}\) (i.e., the class is well segmented), then FDL is basically unaffected. On the contrary, if \(DICE_{t}\) is small (i.e., the class is poorly segmented) and a pixel is misclassified, then the FDL will decrease significantly. (2) The exponent parameter \(\beta \) smoothly adjusts the rate where better-segmented classes are lower weighted. FDL is equal to DL when \(\beta = 1\). With the increase in exponent factor \(\beta \), the network focuses more on the poorly segmented classes than the others. (3) Different from focal loss [3], the overlap measurement FDL focus on the object of interest instead of the entire image, which meets the demand of brain tumor segmentation.

The FDL is visualized for several values of \(\beta \in [1,4]\) in Fig. 3. (we found \(\beta =2\) to work best in our experiments). We have validated the FDL in the BRATS2015 dataset, which shows an obvious improvement, especially for the small classes.

2.2 Dilation for Coarse-to-Fine Learning

Dilation. Dilation is one of the operators in the area of mathematical morphology. The effect of this operator on binary or grayscale images is enlarging the boundaries of foreground pixels using a structuring element. Mathematically, the dilation of A by B, denoted \(A \oplus B \), is defined in terms of set operation:

$$\begin{aligned} A \oplus B = \{\, z \,| \, ( \hat{B})_{z} \cap A \ne \varnothing \}. \end{aligned}$$
(4)

where \(\varnothing \) is the empty set and B is the structuring element, \(\hat{B}\) is the reflection of set B and \((B)_{z}\) is the translation of B by point \(z = ( z_{1}, z_{2})\).

In image processing, one application of dilation is bridging the gaps of disconnected but close components, like broken characters. Similarly, we apply dilation to the ground truth to expanding the objects, and linking the disconnected parts. We aim at higher level feature extraction and therefore compromise on some low-level details in the early training stage.

Dilation on the Ground Truth. In our proposed method, dilation is applied to the binary ground truth images of each foreground class in the training set with a probability ratio \(\alpha \). Figure 4(f) to (j) show that the structuring element for dilation shrinks in size gradually during training, resulting in a coarse-to-fine learning process. Noted that eventually there is no dilation applied (dilation by structuring element in Fig. 4(j) remains no change to images). No dilation is applied to validation or test images in any of the experiments.

After dilation, it is possible that the dilated ground truth overlaps, and pixels (in the overlapping region) classified to all the intersected classes will result in a decrement of the loss function. Under this circumstance, the FDL is able to focus on the classes with lower DICE.

Fig. 4.
figure 4

Dilation on tumor sub-regions. (a) to (e): the dilated necrosis and non-enhancing tumor by structuring elements (f) to (j). The region in blue is the ground truth, and the region in yellow and blue is the dilated ground truth. (Color figure online)

In practice, the dilation has the following properties. (1) It expands the tiny regions and connects the close but separated pieces (Fig. 4(a) to (e)). Therefore, the ground truth of each foreground class shrinks from the dilated coarse features to the original fine labels. It also helps the network to focus on the higher level features. (2) Similar to Dropout that randomly discards units with its connections [9], the stochastic dilation on training labels prevents overfitting because of the dynamic changes during training. (3) The coarse-to-fine interface also boosts the learning speed as well as the training efficiency.

3 Evaluation

Our method has been evaluated on the BRATS2015 dataset. We use HG training set that contains MR images from 220 patients, and for each patient, there are 4 modalities (T1, T1-contrast (T1c), T2, and FLAIR) together with the ground truth. The label contains 5 classes: background, necrosis, edema, non-enhancing and enhancing tumor. The evaluation is performed on three different tumor sub-compartments: (1) the complete tumor (it contains all four tumor sub-regions); (2) the tumor core (it contains all tumor sub-regions except edema); (3) the enhancing tumor structure (it contains only the enhancing tumor sub-region).

Table 2. Performance on the BRATS 2015 44 test images.
Fig. 5.
figure 5

Example results. Left to right: (a) Flair, (b) Flair with ground truth, (c) results of our method, (d) U-Net results, (e) Boundary-aware [7] results. Best viewed in color: necrosis (red), edema (yellow), non-enhancing tumor (blue), and enhancing tumor (green). (Color figure online)

In our experiments, the 220 HG images are randomly split into three sets with a ratio of 6:2:2, therefore we have 132 training, 44 validation and 44 testing images. For all MR images, voxel intensities are normalized based on the mean and variance of the training set. We use 2D axial slices from MR volumes as input, and each slice is cropped into 192\(\,\times \,\)200. Besides, the symmetric intensity difference map [8] of each slice is also fed into the network, resulting in 8 input channels. In our experiments, we use exponent factor \(\beta = 2\) and dilation ration \(\alpha = 0.6\). The duration of applying each structuring element in Fig. 4(g) to (j) for dilation is 15 epochs, the matrix in Fig. 4(f) is not used in our experiments. The model is implemented with Keras and Tensorflow backend, and trained for 60 epochs using Adam optimizer, with learning rate \(8\times 10^{-5}\).

The evaluation results of the 44 test images are shown in Table 2 on three tumor sub-regions. The hyper-parameters of mentioned models in Table 2 are identical to the proposed ones. Based on U-Net, the FDL and image dilation shows improvement especially on rather small regions like tumor core and enhancing tumor. It shows the capability of the FDL in improving the accuracy of classes with lower Dice. Our proposed method that combines the FDL and dilation outperforms the other methods in average Dice of three tumor regions. The example results are annotated in Fig. 5. Our method achieves better high-level understanding instead of misled by complex details. [7] generates smooth boundary of entire tumor but not for each tumor sug-regions, and our method also outperforms it on some disconnected components.

Besides the improvement in accuracy, one more advantage of our method is the low computational cost for new test images. Recent methods reported 8 min [6], 2 to 4 min [1], and 2 min [12] respectively for the prediction of each 3D volumes on the modern GPU. Our method just takes around 3 s on the NVIDIA Titan X Pascal, and including image normalization and computing symmetric difference maps.

3.1 Results on the Focal Dice Loss

We have tested the performance of the proposed method with different values of \(\beta \) in the FDL, as shown in Table 3. We plot the dice curves of 44 validation images during training. Noted that the Dice in Figs. 6 and 7 is the average DICE of 4 foreground classes, which differs from our evaluation matrix of 3 regions.

Table 3. Results on different values of exponent factor.
Table 4. Results on different dilation ratios.
Fig. 6.
figure 6

Dice curves of different values of exponent factor.

3.2 Results on Dilation

We also conducted experiments to explore the properties of dilation on the ground truth. Table 4 shows that our model works best when \(\alpha = 0.6\). It is worth noticing that the stability of the network is degraded when the dilation rate is 0.45 and 1 in Fig. 7. If the ground truth is dilated by a small ratio (\(\alpha = 0.45\)), the corresponding input images may be considered as noise during training as the occurrence of dilated images is limited. For large dilation rate, like \(\alpha =1\), it is likely that the network experiences great changes when the structuring element is switched to a smaller one and results in the oscillation of Dice curves.

Fig. 7.
figure 7

Dice curves of different dilation ratios.

4 Conclusion

We introduced a FDL to address the data imbalance for multimodal brain tumor segmentation, which focuses on different objects of interest instead of the entire image (like focal loss). The experiments shows the capability of FDL in improving the class with lower accuracy. Dilation is also applied to training samples by a gradually downsized structuring elements to enlarge and connect the tiny regions for better high level feature extraction, which is a coarse-to-fine and incremental training approach with the structure of network unaffected. The performance of our method has been tested on the BRATS2015 dataset and achieves the state-of-the-art in Dice Coefficient with relatively low computational cost.