A Fine-Grain Error Map Prediction and Segmentation Quality Assessment Framework for Whole-Heart Segmentation

Zhang, Rongzhao; Chung, Albert C. S.

doi:10.1007/978-3-030-32245-8_61

Rongzhao Zhang¹⁶ &
Albert C. S. Chung¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11765))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

12k Accesses

Abstract

When introducing advanced image computing algorithms, e.g., whole-heart segmentation, into clinical practice, a common suspicion is how reliable the automatically computed results are. In fact, it is important to find out the failure cases and identify the misclassified pixels so that they can be excluded or corrected for the subsequent analysis or diagnosis. However, it is not a trivial problem to predict the errors in a segmentation mask when ground truth (usually annotated by experts) is absent. In this work, we attempt to address the pixel-wise error map prediction problem and the per-case mask quality assessment problem using a unified deep learning (DL) framework. Specifically, we first formalize an error map prediction problem, then we convert it to a segmentation problem and build a DL network to tackle it. We also derive a quality indicator (QI) from a predicted error map to measure the overall quality of a segmentation mask. To evaluate the proposed framework, we perform extensive experiments on a public whole-heart segmentation dataset, i.e., MICCAI 2017 MMWHS. By 5-fold cross validation, we obtain an overall Dice score of 0.626 for the error map prediction task, and observe a high Pearson correlation coefficient (PCC) of 0.972 between QI and the actual segmentation accuracy (Acc), as well as a low mean absolute error (MAE) of 0.0048 between them, which evidences the efficacy of our method in both error map prediction and quality assessment.

You have full access to this open access chapter, Download conference paper PDF

Efficient Model Monitoring for Quality Control in Cardiac Image Segmentation

Segmentation with Multiple Acceptable Annotations: A Case Study of Myocardial Segmentation in Contrast Echocardiography

Quality-Aware Semi-supervised Learning for CMR Segmentation

Keywords

1 Introduction

Assessing per-case image segmentation quality is an important issue when researchers want to develop a computer-aided diagnosis (CADx) system or integrate automated image analysis methods into large-scale medical studies. Since image segmentation usually serves as a low-level module in a CADx system or a clinical study pipeline, errors incurred by segmentation algorithms will be delivered or even amplified in the subsequent calculation of image-based measurements and other downstream procedures, which may result in misleading statistical conclusions or slow down the diagnosis process. Automatic quality assessment is an appealing solution to such problems, which, ideally, should not only report the per-case segmentation quality, but also highlight those misclassified pixels, so that doctors (or an automatic system) can easily verify the reliability of a segmentation result and decide whether to keep it for further analysis. Besides, an automatic pixel-wise error prediction algorithm has a great potential in medical training, where it can provide inexperienced students with fine-grain feedback by pointing out which pixels are mislabeled.

Although the quality assessment for segmentations can be done by simply comparing with experts’ annotation, this method is too costly to be applied in large-scale studies or automated pipelines. In natural image analysis area, there have been a number of unsupervised image segmentation evaluation methods [7], which employ low-level features, e.g., color error, texture, entropy, and their combinations to measure a segmentation’s visual consistency with human observers. However, the application of such methods in medical area remains unclear [6]. Reserve validation method [8] trains classifiers with pseudo ground truth to quantify how well a classifier performs on a target domain, but it can only give a single quality measurement for a classifier across the whole test set, which cannot meet the per-case demand of a quality assessment algorithm. Recently, Valindria et al. proposed reverse classification accuracy (RCA) [5, 6] method which is able to evaluate the quality for each single case, but this method has high computational cost and cannot predict a fine-grain error map. Robinson et al. developed a deep learning model to directly regress the Dice Similarity Coefficient (DSC) of a segmentation mask, which is much faster but still can only provide an image-level measurement and requires a large-scale training set.

In this study, we build an automatic quality assessment framework that is capable of simultaneous pixel-wise and per-case evaluation for segmentation masks. Specifically, we first formally define the pixel-wise error map prediction problem, and then show the capacity of a modern deep learning (DL) model in predicting error maps for auto-generated segmentation masks. We also derive a quality indicator (QI) from the output error maps, which can measure segmentation quality in a per-case manner. To generate diverse and representative segmentation masks, we train a VoxResNet [1] and its 2D version on the training sets, and collect all their side (induced by the deep supervision [3] paths) and final outputs as sample segmentations. To demonstrate the efficacy of our method, we evaluate it on a public 3D whole-heart segmentation dataset, i.e., MICCAI 2017 MMWHS. The quality and quantity results of a 5-fold cross validation show that our framework is able to identify the misclassified pixels in an input mask with satisfiable accuracy. We also observe a strong correlation between QI and the actual segmentation accuracy (Acc), as well as between QI and DSC score, evidencing the capacity of our framework working as an image-level segmentation quality evaluator. To the best of our knowledge, this is the first time that the segmentation quality assessment problem is addressed in a pixel-wise manner for medical images, and we are also a pioneer who manages to predict per-case segmentation quality accurately only based on relatively small training sets (e.g., 16 training MRI scans in each fold).

2 Method

Our method is mainly composed of a mask generation part (segmentor) and an error map prediction part (error map predictor), as shown in Fig. 1(a). In this section, we will first define the error map prediction problem, then elaborate the mask generation and error map prediction methods, and finally detail the training of the proposed framework.

2.1 Formulation of Error Map Prediction Problem

We define the error map $\mathcal {E}$ of a segmentation mask S as

$$\begin{aligned} \mathcal {E}(i) = {\left\{ \begin{array}{ll}1, &{}S(i)\ne GT(i), \\ 0, &{}S(i)=GT(i),\end{array}\right. } \end{aligned}$$

(1)

where GT is the ground truth segmentation and i specifies the pixel (voxel) location. $S(i), GT(i)\in \{0,1,\cdots ,C\}$, where C is the number of foreground classes and 0 denotes the background class. When ground truth segmentation is not available, we build a model M that is parameterized by $\theta $ to estimate the error map:

$$\begin{aligned} \widehat{\mathcal {E}} = M(I,S;\theta ), \end{aligned}$$

(2)

where I denotes the original image, $\widehat{\mathcal {E}}$ is the predicted error map for the segmentation mask S. Thus, given a dataset $\mathcal {D}=\{I_i,\{S_i^k\}_{k=1}^{m},GT_i\}_{i=1}^N$, the error map prediction problem can be formulated as an optimization task over model parameter $\theta $:

$$\begin{aligned} \min _\theta \frac{1}{mN}\sum _{i=1}^{N}\sum _{k=1}^{m}d\left( \mathcal {E}(S_i^k,GT_i), \widehat{\mathcal {E}}(I_i,S_i^k;\theta )\right) , \end{aligned}$$

(3)

where N is the number of images, m is the number of generated segmentation masks for each image, $d(\cdot ,\cdot )$ is a distance metric such as cross entropy, which measures the difference between the true and the predicted error maps. An example error map is shown in Fig. 2(a).

2.2 Mask Generation

To enable the training and evaluation of the error map predictor, we train two different CNN models on the training sets and collect all their outputs to form a mask set $\{S_i^k\}_{i,k}$. One segmentation model is VoxResNet [1], a representative and state-of-the-art CNN model designed for volumetric medical image segmentation tasks, which leverages residual connections [2] and combines multi-scale features to make a quality prediction, as schematically illustrated in Fig. 1(b). To encourage the diversity of collected masks, we also employ a 2D version of VoxResNet to perform the generation work, as it is believed that the outputs of 2D and 3D models can be significantly different since their receptive fields are distinctive. Further, the side outputs of these models, which are generated by the deep supervision pathways (corresponding to Loss-2 to Loss-5 in Fig. 1(b)), are also added to the mask collection to involve more examples with various segmentation qualities. Overall, the number of auto-generated masks for each scan is $m= 2\times (4 + 1) = 10$, where 4 is the number of side outputs in a segmentor. The DSC score histogram of the generated segmentations is shown in Fig. 2(b).

2.3 Error Map Predictor and Quality Indicator

Since an error map has the same size as the corresponding segmentation mask and takes 0–1 values, the error map predictor can be implemented by a binary segmentation model. Without loss of generality, we employ another VoxResNet to carry out the prediction, which has the same architecture as the 3D segmentor but different input and output channels. As shown in Fig. 1(a), the error map predictor takes the concatenation of the segmentation mask (with one-hot coding) and the original image as inputs, and output a probabilistic map indicating the probability of each pixel being misclassified by the segmentor (then we can get the binary error map by thresholding). Since the mean value across a true error map is exactly the segmentation accuracy, we derive a quality indicator (QI) by averaging the predicted binary error map to measure the overall quality of the input mask (denoted by the QI node following the predicted error map in Fig. 1(a)). The training procedure for the predictor is similar to a standard segmentation model, with the difference that we generate its input masks on the fly by the segmentors (with fixed parameters) to save RAM space.

2.4 Training Details

For the training of the segmentors, we optimize the summation of a cross entropy (CE) loss and a multi-class Dice loss on the training set with a standard segmentation pipeline. Then we apply the trained segmentors to both training and test sets, and collect their outputs on the training set for the subsequent training of the predictor, and save the outputs on the test set for evaluation of the error map predictor. In this way, both the segmentors and predictor can only access the training set during the learning phase, and their performance are evaluated solely on the test set. Besides, since the predictor is essentially a binary segmentation model, we optimize it by the summation of a CE loss and a binary Dice loss (error pixels as the positive class).

3 Experiments and Results

Dataset. We evaluated our framework on a public whole-heart segmentation dataset, i.e., MICCAI 2017 MMWHS^{Footnote 1} [9]. We employed the 20 MRI scans in this dataset which are paired with manual annotations with 7 foreground classes. For preprocessing, we resampled each scans to an isotropic voxel resolution of $2\times 2\times 2$ mm, normalized their intensity to $[-1,1]$ and then extracted the heart region with a method similar to [4]. The preprocessed scans have a size of around $120\times 120\times 90$. Standard data augmentation methods are applied during the training of segmentors and the predictor, including random cropping (to a patch size of $96\times 96\times 80$, or $96\times 96$ for 2D), flipping along each axis and scaling. For all experiments, we run 5-fold cross validations since the dataset is relatively small.

Metrics. As the error map prediction problem is essentially a segmentation task, we employ common segmentation metrics to measure its performance, which include Dice similarity coefficient (DSC), accuracy (Acc), precision (Prec) and recall (Recl). Note that in the metric calculation we regard the error pixels as the positive class, though in the figures we show error pixels by low intensity for the sake of clear visualization.

Table 1. Error map prediction performance

Full size table

Error Map Predictor. To thoroughly investigate the performance of the error map predictor, we evaluate it on different kinds of segmentation masks and report their performance in Table 1. The predictor’s performance on the final outputs of the 3D VoxResNet is tagged by ‘3D-Final’, and its side outputs are tagged by ‘3D-2’ to ‘3D-5’, respectively. Ground truth masks are referred to as ‘GT’. We omit the detailed mask categories of 2D generators as they are similar to the 3D case, and only report their average values to save space. The last two rows show the predictor’s overall performance on all masks with or without GT, respectively. We also list segmentation metrics in the table (last two columns) in order for better interpretation of the prediction results. Considering both prediction and segmentation metrics, we find that the error map predictor performs worse on those masks with better quality, and vice versa. This is because, for masks with good quality, the error regions are usually small (or thin) and near to the class boundaries (as the first case shown in Fig. 3), where it is hard to tell which pixel is wrong if GT is not available. On the other hand, low-quality masks tend to have larger error regions that do not concentrate on subtle boundaries, which can be easily identified by our error prediction model (an example is illustrated in the second row of Fig. 3). Another observation is that our error map predictor performs well for GT masks, on which the it achieves the highest prediction accuracy, i.e., 0.9967, meaning that our model can ‘feel’ that a GT mask is of high quality, even though it has never seen this GT mask before. Note that the prediction DSC for GT masks is low because the true error map for a GT mask is all zero, such that even a single wrong pixel will lead to a zero DSC score. Overall, our error map predictor achieves a good DSC of 0.626 (0.592 if considering GT masks), demonstrating the efficacy of the proposed error map prediction model. Qualitative results can be found in Fig. 3, where several representative error map predictions are present.

Quality Indicator. As mentioned in Sect. 2.3, we derive a QI by averaging the predicted error map. To measure how well QI can represent the segmentation quality, we compute the Pearson correlation coefficient (PCC), mean absolute error (MAE) between QI and real segmentation accuracy, as well as the PCC between QI and real DSC using all 220 masks, and the results are as follows:

$$\begin{aligned} PCC_{QI,Acc} = 0.972,\,PCC_{QI, DSC} = 0.856, \, MAE_{QI, Acc} = 0.0048, \end{aligned}$$

(4)

where both correlations are significant with $p<0.0001$. We also draw two scatter plots (QI-Acc and QI-DSC) in Fig. 4 to visualize the relationship between QI and the segmentation measurements. Considering the high PCC and low MAE between QI and Acc, as well as the strong linear relationship observed in Fig. 4(a), QI can be regarded as a precise approximator to the real segmentation accuracy, thus can work as a good segmentation quality measurement.

4 Conclusion

Per-case and fine-grain segmentation quality assessment plays a crucial role in automating the image-based pipelines in medical research or clinical diagnosis, but this area has not yet been fully studied. This work formally defines a fine-grain error map prediction problem, and attempts to address it using a DL framework. The evaluation results on a public whole-heart segmentation dataset demonstrates the efficacy of our error map predictor, and also shows that a per-case quality measurement can be derived from the predicted error maps, which approximates the real segmentation accuracy well with a small MAE. Future work will investigate the potential of error map predictors in improving segmentor’s robustness, where an error map predictor can serve as a critic to regularize the segmentation model. The proposed framework is inherently generic, in which the segmentors and predictors can be replaced with different types of models (e.g., random forest), and it can be adapted to other segmentation applications easily.

Notes

1.
http://www.sdspeople.fudan.edu.cn/zhuangxiahai/0/mmwhs17/index.html.

References

Chen, H., Dou, Q., Yu, L., Qin, J., Heng, P.A.: VoxResNet: deep voxelwise residual networks for brain segmentation from 3D MR images. NeuroImage 170, 446–455 (2018)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets. In: Artificial Intelligence and Statistics, pp. 562–570 (2015)
Google Scholar
Payer, C., Štern, D., Bischof, H., Urschler, M.: Multi-label whole heart segmentation using CNNs and anatomical label configurations. In: Pop, M., et al. (eds.) STACOM 2017. LNCS, vol. 10663, pp. 190–198. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75541-0_20
Chapter Google Scholar
Robinson, R., et al.: Automatic quality control of cardiac MRI segmentation in large-scale population imaging. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10433, pp. 720–727. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66182-7_82
Chapter Google Scholar
Valindria, V.V., et al.: Reverse classification accuracy: predicting segmentation performance in the absence of ground truth. IEEE Trans. Med. Imaging 36(8), 1597–1606 (2017)
Article Google Scholar
Zhang, H., Fritts, J.E., Goldman, S.A.: Image segmentation evaluation: a survey of unsupervised methods. Comput. Vis. Image Underst. 110(2), 260–280 (2008)
Article Google Scholar
Zhong, E., Fan, W., Yang, Q., Verscheure, O., Ren, J.: Cross validation framework to choose amongst models and datasets for transfer learning. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS (LNAI), vol. 6323, pp. 547–562. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15939-8_35
Chapter Google Scholar
Zhuang, X., Shen, J.: Multi-scale patch and multi-modality atlases for whole heart segmentation of MRI. Med. Image Anal. 31, 77–87 (2016)
Article Google Scholar

Download references

Author information

Authors and Affiliations

The Hong Kong University of Science and Technology, Hong Kong, China
Rongzhao Zhang & Albert C. S. Chung

Authors

Rongzhao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Albert C. S. Chung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rongzhao Zhang .

Editor information

Editors and Affiliations

University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Dinggang Shen
University of Georgia, Athens, GA, USA
Tianming Liu
Western University, London, ON, Canada
Terry M. Peters
Yale University, New Haven, CT, USA
Lawrence H. Staib
University of Strasbourg, Illkirch, France
Caroline Essert
United Imaging Intelligence, Shanghai, China
Sean Zhou
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Pew-Thian Yap
Western University, London, ON, Canada
Ali Khan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, R., Chung, A.C.S. (2019). A Fine-Grain Error Map Prediction and Segmentation Quality Assessment Framework for Whole-Heart Segmentation. In: Shen, D., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science(), vol 11765. Springer, Cham. https://doi.org/10.1007/978-3-030-32245-8_61

Download citation

DOI: https://doi.org/10.1007/978-3-030-32245-8_61
Published: 10 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32244-1
Online ISBN: 978-3-030-32245-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)