Keywords

1 Introduction

Ischemic stroke is one of the most prevalent causes for death and disability worldwide [2]. It is caused by an occlusion of the cerebral arteries, which results in hypoxia and finally in death of the affected brain tissue. Brain imaging plays a crucial part for diagnosis of ischemic stroke and respective therapeutic decisions. Detection and evaluation of stroke lesions require, however, a significant amount of time from the radiologist. To overcome or at least alleviate this problem, reliable methods to automatically segment the lesions are strongly desired.

The Ischemic Stroke Lesion Segmentation (ISLES) [3] strives to provide open benchmark data to objectively evaluate and compare respective algorithms. The first ISLES challenges and evaluation data sets focussed on magnetic resonance imaging (MRI) data [1, 4]. Compared to computed tomography (CT), MRI comes with the advantages of better visibility of stroke lesions and the absence of radiation; yet, CT has higher availability in most hospitals [5]. Taking related clinical relevance of (perfusion) CT to triage stroke patients into account, the 2018 ISLES challenge therefore now focuses on methods to segment acute ischemic stroke lesions in native CT images and contrast medium-enhanced perfusion maps. Thus, it is now to be answered whether and to what extent the 2015–2017 methods can be adapted to core lesion segmentation using such data. Therefore, a training dataset including radiologists’ ground-truth segmentations and a testing dataset without known ground truth were provided by the challenge organizers. The participants were asked to upload their stroke segmentation results for the testing data set.

In this article, we describe our participation in the ISLES 2018 challenge. The final choice of the applied method(s) was based on the original MRI-based 2015 ISLES challenge [1]. At that time, the majority of participants applied (either as single solution or as part of a pipeline) random forest for segmentation [6,7,8,9,10,11,12,13,14,15]. In addition, first convolutional neural network (CNN)-based approaches were also applied [16, 17], and it seems to be the case that (at least at the moment) encoder-decoder-style CNNs [18] manifest as quasi-standard for medical image segmentation. We therefore decided to implement both approaches and to report on their performance on the ISLES 2018 training data set – either applied individually or in combination. Being interested in the pure algorithm performance, we did neither make use of additional data other than the provided challenge training data nor performed extensive data augmentation. Algorithmic details are described in depth in succeeding sections.

2 Materials and Methods

2.1 Image Data Description

The ISLES 2018 training data contained 94 individual data sets (subsequently called ‘cases’) from 63 patients, i.e. image data from patients with separated lesions were split into different cases. The individual cases consisted of only slices around the stroke lesions, with the number of slices per case ranging from 2 to 16. For each case, native CT image data as well as contrast medium enhanced 4D CT image data were provided. From the latter, cerebral blood flow (CBF), cerebral blood volume (CBV), mean transit time (MTT), and time-to-maximum flow (T\(_{\max }\)) perfusion maps were calculated and also provided.

Ground truth data were binary masks manually drawn by radiologists on the basis of co-registered diffusion weighted imaging (DWI)-MRI data.

The DWI-MRI data were not available for the ISLES 2018 challenge testing data; we therefore did not consider them for training purposes and the experiments described in the present paper. As dealing with the 4D perfusion image series is also challenging from a computational perspective (e.g. GPU memory handling), we did not make use of the respective explicit temporal information; we only computed a corresponding maximum intensity projection (MIP) along the temporal axis as additional image data.

2.2 Image Preprocessing

Image preprocessing consisted of two parts: intensity normalization and fast brain segmentation. Image intensity normalization was applied independently for each case and image sequence by rescaling the original image intensity dynamics to the range [0; 1]. Fast brain segmentation was based on a 3D-connected component (CC) analysis of the CBV images. The largest non-zero CC was considered an initial brain segmentation, which was further refined for the individual axial slices by filtering out small 2D-CCs based on their size and position (CC centroid should be rather posterior than anterior; the latter corresponded, e.g., to close-to-eyes CCs in inferior slices).

2.3 Random Forest (RF) Implementation

For random forest (RF) implementation, we used the Scikit-learn Python library [19], Gini impurity as impurity measure, and 150 trees. For RF feature extraction, we applied the Python medical image processing library MedPy [20]. Features were computed for each case and voxel inside the case-specific brain mask (see Sect. 2.2); class imbalance due to small lesions (compared to healthy brain tissue) was not accounted for.

As voxel features, we used the image intensity as well as Gaussian-weighted local mean intensity values (standard deviations \(\sigma = 3,7,13\)). These intensity features were computed for each of the rescaled image sequences (i.e. native CT, CBF, CBV, MTT, and T\(_{\max }\)), resulting in 20 features. Despite the fact that the MedPy implementation to derive hemispheric difference values may lead to partially erroneous values due to the images not being normalized/registered to symmetric atlas space, we also computed voxel-wise hemispheric difference values for the different image sequences, yielding additional five features. Thirdly, we extracted a 10-bin local histogram (\(5\times 5\) neighborhood) from the T\(_{\max }\) and used the ten frequency values as additional features. Fourth and finally, we computed the distance of the voxel to the brain mask boundary and the distance of the voxel to the image center. Thus, in total, we extracted 37 features for each voxel.

For classification of test cases, i.e. application of the trained RF, case-specific RF probability maps were generated based on the voxel-wise probability of belonging to the lesion class. The probability map was rescaled to the range [0; 1] in such a way that the 99th percentile (and values above) of the probability map values corresponded to a value of 1. For lesion segmentation, the resulting map was thresholded by a value of 0.5. Given the resulting binary map, we excluded all connected areas with size smaller than 70 voxels to avoid false positives related to noise or image artifacts. Finally, a hole-filling algorithm was applied for each axial slice.

2.4 CNN Architecture and Workflow

Different CNN architectures were proposed in recent years for semantic image segmentation via deep learning. In this study, we decided to employ the currently popular DeepLabv3+ network that is designed to learn multi-scale contextual image features and controlling signal decimation [21]. For this purpose, it utilizes three key concepts:

  • Feature extraction. Feature extraction is conducted by a ResNet, pre-trained on the ImageNet data set, which employs residual learning to allow for efficient training of deeper neural networks [22].

  • Atrous convolutions. Efficient computation of feature responses by adding larger context information without increasing the number of parameters, i.e. sparse convolution kernels, is achieved using atrous convolutional layers.

  • Atrous spatial pyramid pooling. In order to combine the multi-scale information yielded by different atrous convolution layers, the atrous spatial pyramid pooling is applied.

For the problem defined by the ISLES 2018 challenge, the original DeepLabv3+ structure had to be slightly modified. To convert the network output into a binary segmentation, a convolutional filter was randomly chosen from the 21 pre-trained filters regarding the last convolutional layer. Further, we had to change the number of input color channels from 3 to 6 as we concatenated the given native CT, CBF, CBV, MTT, T\(_{\max }\) and the computed MIP image data, yielding a tensor for every case and slice of shape \(n_x \times n_y \times 6\). To allow for a robust and efficient training, 256 patches of shape \(64\times 64\times 6\) of each \((n_x,n_y)\)-slice were randomly selected and used as input data set. Based on this data, the model was re-trained in 5 epochs.

2.5 Combining RF and CNN Results

To analyze potential segmentation improvement by combining RF and CNN outputs, we averaged the RF probability map after rescaling by the 99th percentile and the CNN probability map. Thresholding and post-processing of the binary maps remained unchanged compared to previous explanations.

2.6 Experiments

To evaluate the implemented stroke segmentation approaches, we performed a 5-fold cross validation based on the underlying patient collective of the ISLES training dataset. Splitting the 63 patients instead of the 94 cases ensured that the same patient was not part of both the training and testing data, and thereby prevented a related bias and overestimation of the segmentation performance.

The same folds were used for RF-only training and evaluation, CNN-only training and evaluation, and evaluation of the combination of RF and CNN output. Following previous ISLES contributions, we focused our evaluation on three metrics: Dice coefficient, Hausdorff distance (HD), and average symmetric surface distance (ASSD).

Table 1. RF-only, CNN-only and results (mean ± standard deviation over cases) after combining respective probability maps for the individual folds and corresponding average data. HD and ASSD values refers multiples of the in-plane pixel side length.

3 Results

The results for RF-only, CNN-only, and combined RF and CNN stroke lesion segmentation are summarized in Table 1. RF-only segmentation resulted in a mean Dice coefficient of \(0.42\pm 0.26\) (averaged over all testing cases and folds). The average CNN-only Dice coefficient was slightly higher (\(0.44\pm 0.27\)); however, this was consistently observed for all folds. In contrast, the Dice coefficient after combining RF and CNN outputs was for all folds at least as high as the higher single-approach value, indicating the potential of improving segmentation quality by combining the two approaches. Similar observations hold true for the Hausdorff distance values. ASSD was, in turn, on average lower for CNN-only than for RF-only and the combined approach. An illustration of the complementary information of the individual approaches can be found in Fig. 1.

Fig. 1.
figure 1

Illustration of complementary information by RF-only and CNN-only stroke lesion segmentation, and the potential by combining respective outputs. All images are shown for case 40 and the fifth fold. Top row, left: probability map (blue: low probability; ref: max. values) and ground truth segmentation contours for RF-only (Dice 0.44); top row, right: same information for CNN-only (Dice 0.31). Bottom row, left: combined probability map (Dice 0.66); bottom row, right: ground truth segmentation. (Color figure online)

4 Discussion and Conclusion

We applied and evaluated a classical random forest classifier employing hand-crafted image features commonly used in the context of stroke lesion segmentation [1], a state-of-the-art CNN architecture for image segmentation, and a combination of the two algorithms. Given the still relatively small amount of training data especially for deep learning purposes, the underlying hypothesis was that the two methodically different approaches extract complimentary information and that, as a consequence thereof, the performance of the joint approach leads to an improved segmentation performance. The presented results (especially in terms of the Dice coefficient) support the hypothesis. Further improvement of segmentation performance could, for instance, be reached by using additional image data during training, exploiting data augmentation strategies and extensive use of additional methods (i.e. consideration of larger ensembles); nevertheless, the presented baseline performance is considered promising.