Keywords

1 Introduction

Gliomas are a type of tumor that start in glial cells and occur in the brain and spinal cord. They are fatal and may be classified as either high or low grade gliomas [19] based on their aggressiveness. While high grade gliomas (HGGs) are malignant, grow rapidly and result in high mortality, low grade gliomas (LGGs) develop more slowly, may be benign or malignant, but can develop into HGGs if no treatment is provided to the patient [18]. Thus, it is essential that both HGGs and LGGs should be treated sooner rather than later as the survival rate of the patient depends on the characteristics of the tumor. Based on the characteristics of the tumor, different methods of treatment, such as surgery, chemotherapy, or radiation therapy [10], are used. In order to diagnose, treat and predict the survival rate after surgery; various modes of MRI scans are used. The MRI scans are usually segmented into tumor regions by experts. However, manual delineation of the tumor is complex and time consuming due to the variation in tumor structure. As a result, it is important to develop algorithms and methods that can accurately perform automatic image segmentation in order to delineate the different regions of a brain tumor. However, due to the heterogeneity in the size, location, and shape of gliomas, developing algorithms for automatic segmentation is challenging.

With the advent of deep learning, it has become possible to develop networks that can achieve a moderate level of accuracy in tasks that involve automatic image segmentation. Moreover, the introduction of convolutional neural network architectures [16] have further improved the performance and reduced the complexity of performing image segmentation. This is because CNNs do not require the use of hand-crafted features and involve the use of fewer number of parameters than the number of parameters that are used by DNNs in order to perform image segmentation. CNNs have been used for segmenting sections of the brain [23] and for brain tumor segmentation [14, 15, 21] by several researchers.

To accomplish automatic segmentation, we propose to use the DeepLabv3+ framework [8], which has been shown to successfully identify objects in natural images. DeepLabv3+ achieves a mIoU of 89.0% on the test data for the PASCAL VOC 2012 challenge [11]. To do this, we utilize the pre-trained Xception network of DeepLabv3+ and further train it on multimodal MRI scans, treating the individual modes as image channels. DeepLabv3+ involves the use of atrous parallel convolutions at different strides to capture information at different scales and encoder-decoder pathways to achieve sharp delineation of object boundaries.

In this paper, we present our technique for performing brain tumor segmentation in order to participate in the MICCAI BraTS 2018 challenge. The data consists of MRI scans in four modes: FLAIR, T1CE, T1, and T2. The goal of the segmentation challenge is to segment the brain scans and identify the whole tumor, tumor core, and enhancing tumor regions. Our method involves the use of 21 models, where 18 models are trained directly on the input MRI scans and the other 3 models use the outputs from the first 18 models as their inputs. All our models are trained and validated using slices of the brain images and we considered the 3 possible orientations in each case. Thus, our initial 18 models can be considered to be made up of 6 sets of 3 models each. We use different combinations of inputs for each of the initial 18 models and then use bit packing to combine these results and prepare the inputs for the final 3 models. Each of the final 3 models, is used to segment one region of the tumor, and the results are combined to get the final segmentation. During this final combination, we consider the fact that the tumor core is a sub-region of the whole tumor and the enhancing tumor is a sub-region of the tumor core.

Our contributions are as follows:

  1. 1.

    Build a pipeline to perform brain tumor segmentation and identify the whole tumor, tumor core and enhancing tumor regions.

  2. 2.

    Combine several DeepLabv3+ models to achieve higher accuracy than can be achieved by using a single instance of DeepLabv3+.

  3. 3.

    Design and implement a bit packing-based algorithm to combine the results of several models and prepare the inputs for the final 3 models in the pipeline.

  4. 4.

    On the BraTS 2018 testing data, we are able to achieve Dice coefficients of 0.7086, 0.7897 and 0.8755 for the enhancing tumor, tumor core and whole tumor regions.

Fig. 1.
figure 1

DeepLabv3+ involves the use of atrous convolution at multiple scales in the encoder network as well as a decoder network

2 Background: DeepLabv3+ Architecture

Convolutional neural networks typically apply convolution with different strides as well as pooling. These methods cause a reduction in the sizes of the feature maps that are operated on by the succeeding layers, and as a result, the last feature map lacks comprehensive information associated with object boundaries. In order to overcome this challenge, an encoder-decoder pathway is used in several architectures [1, 17, 22]. While the U-Net [22] decoder concatenates the feature maps at corresponding scales in the encoder and decoder pathways, SegNet [1] saves and uses the max pooling indices from the encoder pathway. On the other hand, DeepLab [6] and Deeplabv3 [7] use parallel atrous convolutions at different rates in order to capture the information at different scales. Effectively, this technique, called the Atrous Spatial Pyramid Pooling (ASPP) allows the encoder network to use different fields of view. Figure 1 shows how DeepLabv3+ [8] combines these two techniques and uses an encoder-decoder pathway as well as ASPP in order to achieve precise delineation of object boundaries.

DeepLabv3+ utilizes ASPP, but, due to computation cost and complexity, it is not possible to extract features that have a resolution greater than \(\frac{1}{8}^{th}\) of the input image. The last feature map generated by the encoder network, which has 256 channels, is used as the input to the decoder network. The decoder network first uses bilinear upsampling, then a \(1\times 1\) convolution is performed followed by concatenation with the corresponding features from the encoder network, \(3 \times 3\) convolutions and upsampling.

DeepLabv3+ modifies the Xception model [9] by making the network deeper, using depth-wise separable atrous convolution instead of max pooling, and introducing additional batch normalization and non-linear activation layers. At present, the resulting architecture [12], which is implemented using Tensorflow, leads the Pascal VOC leaderboard for image segmentation, and hence, we decided to use this network in order to develop our automatic brain tumor segmentation software.

3 Brain Tumor Segmentation Using 21 Models

3.1 Data

The training dataset for the MICCAI BraTS 2018 challenge consists of 285 sets of NIfTI image files, with 210 sets of image files for HGG type of tumor and 75 sets for LGG. Each set of image files consists of MRI scans using four different modes: T1, T2, T1CE and FLAIR. Also, the corresponding segmentation map is provided for each of the 285 patients. The segmentation labels are assigned as follows:

  • Label 4 is assigned to the enhancing tumor region.

  • The edema is denoted by label 2.

  • The NCR+NET region is identified using label 1.

Our task is to identify the enhancing tumor, the tumor core, which consists of the enhancing tumor and the NCR+NET regions, and the whole tumor, which consists of the tumor core and the edema. Details about the BraTS challenge and the annotated dataset are provided in [3, 20]. The relevant data sources are [2, 4].

3.2 Data Preprocessing

In order to perform segmentation of the different regions of a brain tumor, we slice the 3-D NIfTI brain images along 3 orientations: sagittal, coronal, and axial. By doing so, we get 155 axial slices and 240 slices each in the sagittal and coronal directions. In the original image, the intensity of each voxel is stored as a 16 bit integer. In order to use DeepLabv3+ for segmenting the slices, we store each slice as a separate .png file and the maximum possible intensity of a pixel in a .png file is 255. Thus, we scale the intensities in the original NIfTI image to a value between 0 and 255. Each NIfTI image file stores the intensities of \(240 \times 240 \times 155\) voxels. Thus, each axial slice consists of \(240 \times 240\) pixels, while the coronal and sagittal slices have \(240 \times 155\) pixels each. We use zero padding to pad each of the coronal and sagittal slices to increase their size to \(240 \times 240\) pixels.

We use the following 6 combinations of images in the three separate orientations to create the inputs to the initial 18 DeepLabv3+ models. Since DeepLabv3+ uses only 3 channels, we use various combinations of image slices in order to obtain RGB images that contain useful data.

  1. 1.

    To generate the inputs for the first set, called the RGB set, we combine the T1CE, T2 and FLAIR images for the corresponding slices in our data and create RGB images, where the FLAIR image occupies channel 1 (R), T1CE occupies channel 2 (G) and T2 occupies channel 3 (B). The FLAIR images contain information that helps to identify the whole tumor and the T2 and T1CE images help to identify the tumor core and enhancing tumor sections respectively.

  2. 2.

    For the second set, we use the FLAIR image as the first channel, the T2 image as the second channel and, for the third channel, we use slices created by taking the pixel-wise differences of the corresponding T1CE and the T1 slices. We use the pixel-wise difference of T1CE and T1 images because it contains information that identifies the enhancing tumor and the NCR regions. We refer to this as the T1CE-T1 set.

  3. 3.

    The third set of images, called FLAIRs3, consists of 3 FLAIR slices. Thus, to segment slice n, we use slices n-1, n, n+1 as the 3 slices for the 3 channels. This combination contains information from the two neighboring slices and should increase the accuracy of the prediction.

  4. 4.

    The fourth set of images, called T1CEs3, consists of 3 consecutive T1CE slices, prepared in the same way as the FLAIRs3 slices.

  5. 5.

    The fifth set of images, called T2s3, consists of 3 consecutive T2 slices.

  6. 6.

    The sixth set of images, called T1s3, consists of 3 consecutive T1 slices.

For the training data, we use the labels provided in the training data set, while assigning label 3 to all pixels that are labeled with label 4. This is because, DeepLabv3+ uses only consecutive labels and the label 3 is not used in the data provided.

Fig. 2.
figure 2

Brain Tumor Segmentation Architecture: There are 18 initial classifiers that belong to 6 sets, each containing classifiers for slices in the sagittal (S), coronal (C) and axial (A) directions. The results of these are packed to form packed inputs for final three classifiers for the 3 different target classes.

3.3 Brain Tumor Segmentation Using DeepLabv3+

As shown in Fig. 2, we use 18 initial classifiers and use bit packing to pack the pixel-wise predictions from these models. Then, we use these packed results as the inputs to our final three models, one for each of the three target classes. Finally, a combiner is used to combine the results of the three final models to get the combined prediction.

Training and Prediction Using the Initial 18 Classifiers: We use 18 initial models, which consist of 6 sets of 3 models each (one model for each of the three orientations). The input data for each of these 18 models have been described in Sect. 3.2.

During training, we only use the slices in our training set which actually have tumors. On an average, in the training data provided, about 1.07% of the brain is occupied by tumor in case of HGG and about 1.24% in case of LGG. As a result, the portion of the brain that contains tumors is significantly smaller than the healthy part. Thus, to minimize the class imbalance, we only use the slices that have tumors during the training phase. However, the validation and testing sets consist of all the slices in a specific direction (axial, sagittal or coronal).

We train our models using an existing DeepLabv3+ pre-trained model. Specifically, we use a checkpoint of the Xception model [13], which is pre-trained on the augmented Pascal VOC dataset. We use a learning rate of 0.005, a batch size of 8 (due to GPU memory constraints) and train for a total of 75000 steps. We use poly decay and a decay rate of 0.1 to decrease the learning rate after 30000 iterations.

Using the above method, we train our model separately on the 6 sets of input slices in the axial, sagittal and coronal orientations. Thus, we get three separate trained models for each set and use each model to predict the label of each voxel in the corresponding validation slices. Since there are 6 sets, we get a total of 18 predictions for each voxel in the validation data after we train the initial 18 classifiers and use them to perform prediction on the validation data.

Fig. 3.
figure 3

A diagrammatic representation to show the exact bits occupied by the predictions from different initial classifiers indicating the channels occupied by sagittal (S), coronal (C) and axial (A) predictions from the same set. The most significant bits are the left most ones. For example, sagittal RGB predictions occupy the MSBs of channel 1. \(FLAIRs3\_m\), \(T2s3\_m\) and \(T1s3\_m\) are generated after using our majority voting-based combiner.

Bit Packing of the Results from the Initial 18 Classifiers to Prepare the Inputs to the Final 3 Classifiers: Since the results of the initial 18 classifiers assign a label from 0 to 3 to each pixel, the number of bits required to represent or store the labels is 2. However, DeepLabv3+ takes images that use 8 bits for each pixel as the input. So the value of each pixel can have a value between 0 and 255. Thus, for each pixel, we can use bit packing to pack the results of 4 initial classifiers to get the value for one channel of the input pixel to the final classifiers. Also, since DeepLabv3+ uses 3 channels, the total number of initial classifiers whose results we can directly pack to get a valid input slice for the final classifiers is 12. In Fig. 3, for each pixel, each channel stores 8 bits, which is depicted by dividing the 8 bits into 4 blocks which store 2 bits each. The 2 bits in each block correspond to the 2 bit predictions from each of the 18 initial classifiers.

Preparation of Packed Input for the Final Classifiers to Segment the Tumor Core and Whole Tumor: For the input to the final classifiers that detect the whole tumor and the tumor core sections, we use the results from 9 of the initial classifiers. The least significant two bits of the input to the final classifiers for each channel are set to zero. For the whole tumor region, we rank the 6 sets of initial classifiers in accordance with the accuracy achieved for the whole tumor region, select the top three sets and use their results in order to prepare the packed input for the final classifier. Also, after packing, the bits occupied by the results of a specific set depends on the rank of the set, such that, the results from the set that gives most accurate results occupy the most significant bits. We tried to do the same for the tumor core, but found that though the T1CE images using 3 consecutive slices have better performance than 3 consecutive slices of FLAIR images for the tumor core, the packed model that uses the results from the FLAIR slices shows better performance. Thus, for the tumor core and whole tumor classifiers, we use the following 3 sets in 3 different orientations:

  • The RGB set, which consists of the FLAIR, T1CE and T2 slices and gives most accurate results.

  • The T1CE-T1 set, which consists of the FLAIR and T2 slices for two channels and the difference of T1CE and T1 as the third channel, and achieves rank 2 based on the accuracy of the result.

  • The FLAIRs3 set constructed using 3 consecutive FLAIR slices, which is ranked third among the three selected sets.

The results for the slices belonging to the same set but in different orientations occupy the same bits for all the three channels. Also, the predictions from the classifiers that use the sagittal and coronal slices are recombined and then sliced axially to create slices in the axial direction. This ensures that the predictions from the three different orientations can be combined to form RGB images after the packing is done.

Preparation of Packed Input for the Final Classifiers to Segment the Enhancing Tumor: In case of the enhancing tumor region, we use the results from all of the initial 18 classifiers for creating the packed input for the final classifier. For the most significant 6 bits of each channel for the packed input, we use the results from the RGB, T1CE-T1, and T1CEs3 classifiers. This decision is also made based on the accuracy of the results of each of the 18 initial classifiers for the enhancing tumor region. Based on the rank achieved by each set, the RGB set achieves maximum accuracy and its results occupy the most significant 2 bits, while the results of T1CEs3 have a rank of 3 among the 3 top-ranked classifier sets, and thus occupy bit 5 and bit 6 of the packed input to the final classifier. In all these cases the results obtained from the 3 different orientations in the same classifier set are used to create the inputs for the 3 different channels. Finally, we use majority voting in order to combine the results of each of the 9 remaining classifiers. We use intra-set majority voting to combine the results from the 3 models (one each for the slices in each orientation) in a set and get a single 2 bit result. Thus, we get 2 bits each after combining the results from the 3 orientations of T1s3, T2s3, and FLAIRs3. Then, we create the packed input to the final classifier by using the results from FLAIRs3, T2s3, and T1s3 as the least significant 2 bits of channel 1, 2, and 3 respectively.

Our simplistic majority voting scheme works by first recombining all the predicted slices in the axial direction to get the 3-D brain segmentation. We also perform the same operation along different axes to reconstruct the segmented images using the sagittal and coronal slices. Here we refer to the prediction for voxel at position i, j, k considering the sagittal direction slices as \(p^{sagittal}_{i,j,k}\), the axial direction slices as \(p^{axial}_{i,j,k}\) and coronal direction slices as \(p^{coronal}_{i,j,k}\). If, for any voxel, any two of these predictions are the same, that is either \(p^{sagittal}_{i,j,k}\) = \(p^{coronal}_{i,j,k}\) or \(p^{sagittal}_{i,j,k}\) = \(p^{axial}_{i,j,k}\) or \(p^{axial}_{i,j,k}\) = \(p^{coronal}_{i,j,k}\), then we assign the label predicted by the majority to that voxel. However, if all the three predictions differ, we give priority to the prediction that uses the slices in the orientation that achieves the best performance. From our experiments, we found that, other than the classifiers that use T1CE-T1 slices, the highest average Dice coefficient is achieved by the model that uses the cross-sectional slices.

Training and Generation of Final Prediction Using the 3 Final Classifiers: Using the above inputs, we trained DeepLabv3+ models for 150000 iterations and checked the performance on the validation set after every 5000 iterations. For the model that is trained for identifying the enhancing tumor section, we get the best performance after training for 115000 iterations. While the inputs for the models for the whole tumor and the tumor core were the same, we achieve the best performance for the whole tumor section after training for 100000 iterations, and for the tumor core after training for 130000 iterations. All the other training parameters are the same as the ones used for the initial 18 classifiers. For the testing phase, we save and use the best performing checkpoints to perform our predictions.

Finally, we combine the outputs of the final 3 classifiers by using the fact that the tumor core is a subset of the whole tumor and the enhancing tumor is a subset of the tumor core. Thus, we combine the results and ensure that if there are any pixels that are predicted as tumor core but lie outside the mask for the whole tumor, then it is not considered to belong to tumor core. We use a similar logic to allocate the labels for the enhancing tumor region.

Our code can be found at: https://ar16m@bitbucket.org/ar16m/brats.

4 Results

We evaluate the performance of our network by training the 18 initial classifiers consisting of 6 sets of 3 classifiers each and then providing their performance on the validation set, which consists of 66 cases. The models are used to perform predictions on the validation slices of the corresponding orientations. We report the Dice coefficient of each of the trained models by creating a separate table for each of the 3 orientations. We also report the Dice coefficient for the results of each set after using majority voting to combine the results obtained from the 3 orientations.

In Tables 1, 2, 3 and 4, we provide the mean Dice coefficient achieved for the three classes: whole tumor, tumor core, and enhancing tumor.

Table 1. Dice coefficient achieved by the 6 models trained on the sagittal slices
Table 2. Dice coefficient achieved by the 6 models trained on the coronal slices
Table 3. Dice coefficient achieved by the 6 models trained on the axial slices
Table 4. Dice coefficient achieved by the 6 majority-voted predictions (one corresponding to each set)

From the above results, we see that the use of the basic majority voting-based combiner improves the Dice coefficient. Thus, in the packed input used by the final classifier for the enhancing tumor region, we use the results obtained after performing majority voting for the 3 lowest ranked models. In this way, we reduce the number of bits occupied by the results from the classifiers that perform poorly, ensure that we can pack the results of the 18 initial classifiers into the \(8 \times 3\) bits available for each pixel, and utilize more accurate results for the final classifier inputs.

Finally, we provide some measures of the accuracy achieved by combining the results from our final 3 classifiers. In Table 5, we first provide the mean Dice coefficient achieved by the 3 final classifiers for the corresponding regions and then we provide the Dice coefficient, sensitivity, specificity and Hausdroff distance achieved by the combined prediction on the validation data as well as the Dice coefficient and Hausdroff distance for testing data.

Table 5. Performance of our combined model using four common metrics.

We also provide here a sample image in order to visualize our image segmentation results after combining the results from the three final models. Figure 4 depicts the T1, FLAIR, T1CE and T2 images as well as our predicted segmentation and the ground truth segmentation. The figure shows the level of accuracy that we have achieved after combining the results from the three final models. However, in some cases the uneven boundaries are not identified accurately.

Fig. 4.
figure 4

Visualization of segmentation: The figure depicts the T1, FLAIR, T1CE and T2 images as well as the ground truth and predicted segmentations.

5 Conclusion and Future Work

We used the DeepLabv3+ deep learning framework to perform brain tumor segmentation. We trained our network separately on the coronal, axial and sagittal slices and developed a network consisting of 18 initial classifiers, which are divided into 6 sets of 3 classifiers each (one for each orientation). Every set has a different combination of input slices as the input, and the outputs of these 18 classifiers are combined using bit packing to prepare the inputs for the final 3 classifiers, one for each of the three target regions. The outputs of these 3 classifiers are combined to produce the final segmented brain images. On the testing data, we achieve Dice coefficients of 0.7086, 0.7897 and 0.8755 for the enhancing tumor, tumor core and whole tumor regions respectively. A summary of the BraTS 2018 challenge can be found in [5].

In future, we intend to extend our model so that we can use patches of at least 5 slices to perform the segmentation of the central slice. This neighborhood information should enable our model to achieve higher accuracy because, in general, if there is a tumor in the neighboring slices, the slice under consideration has a higher probability of having a tumor. However, this information is not considered in cases where the segmentation is done for individual slices or the information may be insufficient in cases where we use only two neighboring slices.