Keywords

1 Introduction

Assessing per-case image segmentation quality is an important issue when researchers want to develop a computer-aided diagnosis (CADx) system or integrate automated image analysis methods into large-scale medical studies. Since image segmentation usually serves as a low-level module in a CADx system or a clinical study pipeline, errors incurred by segmentation algorithms will be delivered or even amplified in the subsequent calculation of image-based measurements and other downstream procedures, which may result in misleading statistical conclusions or slow down the diagnosis process. Automatic quality assessment is an appealing solution to such problems, which, ideally, should not only report the per-case segmentation quality, but also highlight those misclassified pixels, so that doctors (or an automatic system) can easily verify the reliability of a segmentation result and decide whether to keep it for further analysis. Besides, an automatic pixel-wise error prediction algorithm has a great potential in medical training, where it can provide inexperienced students with fine-grain feedback by pointing out which pixels are mislabeled.

Although the quality assessment for segmentations can be done by simply comparing with experts’ annotation, this method is too costly to be applied in large-scale studies or automated pipelines. In natural image analysis area, there have been a number of unsupervised image segmentation evaluation methods [7], which employ low-level features, e.g., color error, texture, entropy, and their combinations to measure a segmentation’s visual consistency with human observers. However, the application of such methods in medical area remains unclear [6]. Reserve validation method [8] trains classifiers with pseudo ground truth to quantify how well a classifier performs on a target domain, but it can only give a single quality measurement for a classifier across the whole test set, which cannot meet the per-case demand of a quality assessment algorithm. Recently, Valindria et al. proposed reverse classification accuracy (RCA) [5, 6] method which is able to evaluate the quality for each single case, but this method has high computational cost and cannot predict a fine-grain error map. Robinson et al. developed a deep learning model to directly regress the Dice Similarity Coefficient (DSC) of a segmentation mask, which is much faster but still can only provide an image-level measurement and requires a large-scale training set.

In this study, we build an automatic quality assessment framework that is capable of simultaneous pixel-wise and per-case evaluation for segmentation masks. Specifically, we first formally define the pixel-wise error map prediction problem, and then show the capacity of a modern deep learning (DL) model in predicting error maps for auto-generated segmentation masks. We also derive a quality indicator (QI) from the output error maps, which can measure segmentation quality in a per-case manner. To generate diverse and representative segmentation masks, we train a VoxResNet [1] and its 2D version on the training sets, and collect all their side (induced by the deep supervision [3] paths) and final outputs as sample segmentations. To demonstrate the efficacy of our method, we evaluate it on a public 3D whole-heart segmentation dataset, i.e., MICCAI 2017 MMWHS. The quality and quantity results of a 5-fold cross validation show that our framework is able to identify the misclassified pixels in an input mask with satisfiable accuracy. We also observe a strong correlation between QI and the actual segmentation accuracy (Acc), as well as between QI and DSC score, evidencing the capacity of our framework working as an image-level segmentation quality evaluator. To the best of our knowledge, this is the first time that the segmentation quality assessment problem is addressed in a pixel-wise manner for medical images, and we are also a pioneer who manages to predict per-case segmentation quality accurately only based on relatively small training sets (e.g., 16 training MRI scans in each fold).

2 Method

Our method is mainly composed of a mask generation part (segmentor) and an error map prediction part (error map predictor), as shown in Fig. 1(a). In this section, we will first define the error map prediction problem, then elaborate the mask generation and error map prediction methods, and finally detail the training of the proposed framework.

Fig. 1.
figure 1

Flowchart of the proposed error map prediction framework and the DL model architecture (VoxResNet) employed in this paper.

2.1 Formulation of Error Map Prediction Problem

We define the error map \(\mathcal {E}\) of a segmentation mask S as

$$\begin{aligned} \mathcal {E}(i) = {\left\{ \begin{array}{ll}1, &{}S(i)\ne GT(i), \\ 0, &{}S(i)=GT(i),\end{array}\right. } \end{aligned}$$
(1)

where GT is the ground truth segmentation and i specifies the pixel (voxel) location. \(S(i), GT(i)\in \{0,1,\cdots ,C\}\), where C is the number of foreground classes and 0 denotes the background class. When ground truth segmentation is not available, we build a model M that is parameterized by \(\theta \) to estimate the error map:

$$\begin{aligned} \widehat{\mathcal {E}} = M(I,S;\theta ), \end{aligned}$$
(2)

where I denotes the original image, \(\widehat{\mathcal {E}}\) is the predicted error map for the segmentation mask S. Thus, given a dataset \(\mathcal {D}=\{I_i,\{S_i^k\}_{k=1}^{m},GT_i\}_{i=1}^N\), the error map prediction problem can be formulated as an optimization task over model parameter \(\theta \):

$$\begin{aligned} \min _\theta \frac{1}{mN}\sum _{i=1}^{N}\sum _{k=1}^{m}d\left( \mathcal {E}(S_i^k,GT_i), \widehat{\mathcal {E}}(I_i,S_i^k;\theta )\right) , \end{aligned}$$
(3)

where N is the number of images, m is the number of generated segmentation masks for each image, \(d(\cdot ,\cdot )\) is a distance metric such as cross entropy, which measures the difference between the true and the predicted error maps. An example error map is shown in Fig. 2(a).

Fig. 2.
figure 2

(a) An example error map, where (i), (ii), (iii) and (iv) are the original image, auto-generated mask, ground truth and error map, respectively. Note that the intensity of the error map is reversed (0 for error and 1 for correct pixels) for better visualization. (b) Histogram of DSC score of the generated segmentation masks.

2.2 Mask Generation

To enable the training and evaluation of the error map predictor, we train two different CNN models on the training sets and collect all their outputs to form a mask set \(\{S_i^k\}_{i,k}\). One segmentation model is VoxResNet [1], a representative and state-of-the-art CNN model designed for volumetric medical image segmentation tasks, which leverages residual connections [2] and combines multi-scale features to make a quality prediction, as schematically illustrated in Fig. 1(b). To encourage the diversity of collected masks, we also employ a 2D version of VoxResNet to perform the generation work, as it is believed that the outputs of 2D and 3D models can be significantly different since their receptive fields are distinctive. Further, the side outputs of these models, which are generated by the deep supervision pathways (corresponding to Loss-2 to Loss-5 in Fig. 1(b)), are also added to the mask collection to involve more examples with various segmentation qualities. Overall, the number of auto-generated masks for each scan is \(m= 2\times (4 + 1) = 10\), where 4 is the number of side outputs in a segmentor. The DSC score histogram of the generated segmentations is shown in Fig. 2(b).

2.3 Error Map Predictor and Quality Indicator

Since an error map has the same size as the corresponding segmentation mask and takes 0–1 values, the error map predictor can be implemented by a binary segmentation model. Without loss of generality, we employ another VoxResNet to carry out the prediction, which has the same architecture as the 3D segmentor but different input and output channels. As shown in Fig. 1(a), the error map predictor takes the concatenation of the segmentation mask (with one-hot coding) and the original image as inputs, and output a probabilistic map indicating the probability of each pixel being misclassified by the segmentor (then we can get the binary error map by thresholding). Since the mean value across a true error map is exactly the segmentation accuracy, we derive a quality indicator (QI) by averaging the predicted binary error map to measure the overall quality of the input mask (denoted by the QI node following the predicted error map in Fig. 1(a)). The training procedure for the predictor is similar to a standard segmentation model, with the difference that we generate its input masks on the fly by the segmentors (with fixed parameters) to save RAM space.

2.4 Training Details

For the training of the segmentors, we optimize the summation of a cross entropy (CE) loss and a multi-class Dice loss on the training set with a standard segmentation pipeline. Then we apply the trained segmentors to both training and test sets, and collect their outputs on the training set for the subsequent training of the predictor, and save the outputs on the test set for evaluation of the error map predictor. In this way, both the segmentors and predictor can only access the training set during the learning phase, and their performance are evaluated solely on the test set. Besides, since the predictor is essentially a binary segmentation model, we optimize it by the summation of a CE loss and a binary Dice loss (error pixels as the positive class).

3 Experiments and Results

Dataset. We evaluated our framework on a public whole-heart segmentation dataset, i.e., MICCAI 2017 MMWHSFootnote 1 [9]. We employed the 20 MRI scans in this dataset which are paired with manual annotations with 7 foreground classes. For preprocessing, we resampled each scans to an isotropic voxel resolution of \(2\times 2\times 2\) mm, normalized their intensity to \([-1,1]\) and then extracted the heart region with a method similar to [4]. The preprocessed scans have a size of around \(120\times 120\times 90\). Standard data augmentation methods are applied during the training of segmentors and the predictor, including random cropping (to a patch size of \(96\times 96\times 80\), or \(96\times 96\) for 2D), flipping along each axis and scaling. For all experiments, we run 5-fold cross validations since the dataset is relatively small.

Metrics. As the error map prediction problem is essentially a segmentation task, we employ common segmentation metrics to measure its performance, which include Dice similarity coefficient (DSC), accuracy (Acc), precision (Prec) and recall (Recl). Note that in the metric calculation we regard the error pixels as the positive class, though in the figures we show error pixels by low intensity for the sake of clear visualization.

Table 1. Error map prediction performance

Error Map Predictor. To thoroughly investigate the performance of the error map predictor, we evaluate it on different kinds of segmentation masks and report their performance in Table 1. The predictor’s performance on the final outputs of the 3D VoxResNet is tagged by ‘3D-Final’, and its side outputs are tagged by ‘3D-2’ to ‘3D-5’, respectively. Ground truth masks are referred to as ‘GT’. We omit the detailed mask categories of 2D generators as they are similar to the 3D case, and only report their average values to save space. The last two rows show the predictor’s overall performance on all masks with or without GT, respectively. We also list segmentation metrics in the table (last two columns) in order for better interpretation of the prediction results. Considering both prediction and segmentation metrics, we find that the error map predictor performs worse on those masks with better quality, and vice versa. This is because, for masks with good quality, the error regions are usually small (or thin) and near to the class boundaries (as the first case shown in Fig. 3), where it is hard to tell which pixel is wrong if GT is not available. On the other hand, low-quality masks tend to have larger error regions that do not concentrate on subtle boundaries, which can be easily identified by our error prediction model (an example is illustrated in the second row of Fig. 3). Another observation is that our error map predictor performs well for GT masks, on which the it achieves the highest prediction accuracy, i.e., 0.9967, meaning that our model can ‘feel’ that a GT mask is of high quality, even though it has never seen this GT mask before. Note that the prediction DSC for GT masks is low because the true error map for a GT mask is all zero, such that even a single wrong pixel will lead to a zero DSC score. Overall, our error map predictor achieves a good DSC of 0.626 (0.592 if considering GT masks), demonstrating the efficacy of the proposed error map prediction model. Qualitative results can be found in Fig. 3, where several representative error map predictions are present.

Fig. 3.
figure 3

Representative error map predictions. The last two columns show the raw (probabilistic) predicted error maps and the thresholded binary maps, respectively.

Fig. 4.
figure 4

Scatter plots of (a) Real Acc and QI, and (b) Real DSC and QI. Different colors denotes different mask types (0: GT; −1: final output; −2 to −5: side outputs).

Quality Indicator. As mentioned in Sect. 2.3, we derive a QI by averaging the predicted error map. To measure how well QI can represent the segmentation quality, we compute the Pearson correlation coefficient (PCC), mean absolute error (MAE) between QI and real segmentation accuracy, as well as the PCC between QI and real DSC using all 220 masks, and the results are as follows:

$$\begin{aligned} PCC_{QI,Acc} = 0.972,\,PCC_{QI, DSC} = 0.856, \, MAE_{QI, Acc} = 0.0048, \end{aligned}$$
(4)

where both correlations are significant with \(p<0.0001\). We also draw two scatter plots (QI-Acc and QI-DSC) in Fig. 4 to visualize the relationship between QI and the segmentation measurements. Considering the high PCC and low MAE between QI and Acc, as well as the strong linear relationship observed in Fig. 4(a), QI can be regarded as a precise approximator to the real segmentation accuracy, thus can work as a good segmentation quality measurement.

4 Conclusion

Per-case and fine-grain segmentation quality assessment plays a crucial role in automating the image-based pipelines in medical research or clinical diagnosis, but this area has not yet been fully studied. This work formally defines a fine-grain error map prediction problem, and attempts to address it using a DL framework. The evaluation results on a public whole-heart segmentation dataset demonstrates the efficacy of our error map predictor, and also shows that a per-case quality measurement can be derived from the predicted error maps, which approximates the real segmentation accuracy well with a small MAE. Future work will investigate the potential of error map predictors in improving segmentor’s robustness, where an error map predictor can serve as a critic to regularize the segmentation model. The proposed framework is inherently generic, in which the segmentors and predictors can be replaced with different types of models (e.g., random forest), and it can be adapted to other segmentation applications easily.