Elsevier

Medical Image Analysis

Volume 75, January 2022, 102261
Medical Image Analysis

DGMSNet: Spine segmentation for MR image by a detection-guided mixed-supervised segmentation network

https://doi.org/10.1016/j.media.2021.102261Get rights and content

Highlights

  • We present a detection-guided mixed-supervised segmentation network (DGMSNet) to achieve spine segmentation for MR image. The generalization of segmentation path in DGMSNet is improved under the guidance of the detection path with the weakly-supervised dataset.

  • We introduce a detection-guided learner to produce a semantic feature for spine segmentation, which alleviates the inter-class similarity and improves the performance of spine segmentation.

  • We propose a detection-guided label fusion approach to obtain the final segmentation prediction for the inference phase, which improves the robustness of the method.

Abstract

Spine segmentation for magnetic resonance (MR) images is important for various spinal diseases diagnosis and treatment, yet is still a challenge due to the inter-class similarity, i.e., shape and appearance similarities appear in neighboring spinal structures. To reduce inter-class similarity, existing approaches focus on enhancing the semantic information of spinal structures in the supervised segmentation network, whose generalization is limited by the size of pixel-level annotated dataset. In this paper, we propose a novel detection-guided mixed-supervised segmentation network (DGMSNet) to achieve automated spine segmentation. DGMSNet consists of a segmentation path for generating the spine segmentation prediction and a detection path (i.e., regression network) for producing heatmaps prediction of keypoints. A detection-guided learner in the detection path is introduced to generate a dynamic parameter, which is employed to produce a semantic feature map for segmentation path by adaptive convolution. A mixed-supervised loss comprised of a weighted combination of segmentation loss and detection loss is utilized to train DGMSNet with a pixel-level annotated dataset and a keypoints-detection annotated dataset. During training, a series of models are trained with various loss weights. In inference, a detection-guided label fusion approach is proposed to integrate the segmentation predictions generated by those trained models according to the consistency of predictions from the segmentation path and detection path. Experiments on T2-weighted MR images show that DGMSNet achieves the state-of-the-art performance with mean Dice similarity coefficients of 94.39% and 87.21% for segmentations of 5 vertebral bodies and 5 intervertebral discs on the in-house and public datasets respectively.

Introduction

Spine segmentation (i.e., multi-class segmentation of the vertebral bodies (VBs) and intervertebral discs (IVDs) for spine image) for magnetic resonance (MR) image plays a significant role in spine diseases diagnosis, surgical treatment planning, spine pathologies locating (Chang, Zhao, Zheng, Chen, Li, 2020, Pang, Pang, Zhao, Chen, Su, Zhou, Huang, Yang, Lu, Feng, 2021), and spine indices estimation (Pang, Leung, Nachum, Feng, Li, 2018, Pang, Su, Leung, Nachum, Chen, Feng, Li, 2019, Lin, Tao, Pang, Su, Lu, Li, Feng, Chen, 2020, Lin, Tao, Yang, Pang, Su, Lu, Li, Feng, Chen, 2021). Specifically, spine segmentation for the 2D middle sagittal MR image is capable of assisting physicians in grading disc herniation (Fardon, 2001, Williams, Murtagh, Rothman, Sze, 2014). Nevertheless, manual spine segmentation is tedious, time-consuming, and subjected to inter- and intra- observer variabilities caused by expertise. Automated spine segmentation provides a potential to circumvent these issues.

The inter-class similarity, i.e., the shape and appearance similarities appear in the neighboring vertebrae (intervertebral discs) of a subject, is an intractable challenge in spine segmentation for MR images (Pang et al., 2021). To reduce the inter-class similarity of spinal MR images, enhancing the semantic information of images in a segmentation network is a feasible solution. Han et al. (2018) introduced long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) into the segmentation network named Spine-GAN to generate semantic image representation for semantic segmentation of multiple spinal structures by exploiting the long-range spatial correlation of the pixels in the feature maps. Moreover, some researchers (Chang, Zhao, Zheng, Chen, Li, 2020, Pang, Pang, Zhao, Chen, Su, Zhou, Huang, Yang, Lu, Feng, 2021) employed the graph convolutional network (GCN) (Kipf and Welling, 2016) to generate semantic image representation for spine segmentation by capturing the spatial correlations between spinal structures. Ordinarily, annotating the spine image at pixel-level is high-cost, while obtaining the keypoints-detection annotated spine image is low-cost. The aforementioned existing approaches in terms of spine segmentation focus on supervised learning, which limits their generalization if the pixel-level annotated dataset is inadequate.

Given a pixel-level annotated dataset named strongly-supervised dataset (see Fig. 1(a)) and a keypoints-detection annotated dataset named weakly-supervised dataset (see Fig. 1(b)), how to improve the generalization of the spine segmentation model by increasing the extracted semantic information from the weakly-supervised dataset is an attractive issue to be studied in this paper.

Existing related works with regard to this study include mixed-supervised segmentation, multi-task learning, and loss weights learning for multi-task learning.

The mixed-supervised segmentation is a task which achieves segmentation by combining a part of fully-annotated images with the weakly-annotated images. These weak annotations usually emerge in the form of bounding boxes (Wang, Li, Ben-Shlomo, Corrales, Cheng, Zhang, Jayender, 2019, Wang, Li, Ben-Shlomo, Corrales, Cheng, Zhang, Jayender, 2021, Shah, Merchant, Awate, 2018), image-level annotations (Hong, Noh, Han, 2015, Mlynarski, Delingette, Criminisi, Ayache, 2019), pseudo mask (Luo and Yang, 2020), and boundary landmarks (Shah et al., 2018). Previous researches on mixed-supervised segmentation addressed the problem from a multi-task objective perspective. Specifically, Wang, Li, Ben-Shlomo, Corrales, Cheng, Zhang, Jayender, 2019, Wang, Li, Ben-Shlomo, Corrales, Cheng, Zhang, Jayender, 2021 proposed a Mixed-Supervised Dual-Network (MSDN), which consisted of two separate networks for the segmentation and bounding boxes detection tasks respectively, and a series of connection modules between the layers of the two networks. The performance of their segmentation network was improved by the bounding boxes detection network. Luo and Yang (2020) achieved mixed-supervised semantic segmentation via a strong-weak dual-branch network (SWDN), which consisted of a strong branch and a weak branch. They utilized Deep Seeded Region Growing (DSRG) (Huang et al., 2018) to generate the pseudo masks for the image-level annotated images, which were used to train the weak branch. The strong branch was regularized by the weak branch, thus the performance of semantic segmentation was improved.

The abovementioned approaches are subjected to two limitations including: 1) they didn’t employ the keypoints-detection annotated dataset, which is low-cost for annotation and has a potential to assist the spine segmentation task since it contains sufficient semantic information for spine segmentation; 2) the auxiliary task trained by the weakly-supervised dataset only explicitly affects the main task (i.e., segmentation) in the feature space, but not in the prediction space. Differently, our work focus on exploiting the keypoints-detection annotated dataset to facilitate spine segmentation and the auxiliary task (i.e., keypoints detection) in our approach explicitly guides the main task in both the feature space and prediction space. Moreover, no study in terms of the mixed-supervised segmentation network exists in spine segmentation for MR images.

The multi-task learning network is aimed at improving network performance by learning multiple tasks simultaneously. Zhang et al. (2020) presented a multi-task relational learning network (MRLN) for vertebrae detection (i.e., localization and identification) and segmentation. They introduced a co-attention module in the forward propagation procedure to learn the correlation information between two tasks, which alleviated the overfitting of a single task. Nie et al. (2018) proposed a novel parsing induced learner (PIL) to exploit human parsing information to effectively assist keypoints detection by adaptive convolution, whose dynamic parameters were generated by PIL. Inspired by PIL but different from it, the proposed approach exploits the keypoints detection task to assist the parsing (i.e., segmentation) task.

The loss function of multi-task learning is usually a weighted combination of multiple losses for the tasks. How to set the loss weights of different tasks is a challenge. Handcrafted task weighting (e.g., grid search) is a simple approach for setting the loss weights. But the calculation cost is exponentially increasing with the number of tasks. Thus increasing researches emerge in automated loss weights learning for multi-task learning. Existing approaches are divided into three categories including gradient-based weights learning (Chen, Badrinarayanan, Lee, Rabinovich, 2018, Jha, Kumar, Banerjee, Chaudhuri, 2020), uncertainty-based weights learning (Kendall et al., 2018), and loss value-based weights learning (Liu et al., 2019). Among these approaches, a fixed trained model is used for testing. In other words, all test images share the same loss weights learning procedure, which limits the generalization of the model. Moreover, these loss weights learning approaches would introduce additional hyper-parameters, which is not cost-effective if there is only one loss weight to be learned.

In this study, a detection-guided mixed-supervised segmentation network (DGMSNet) as shown in Fig. 2 is proposed to achieve spine segmentation for MR image. The DGMSNet is comprised of a segmentation path for generating the spine segmentation prediction and a detection path (i.e., regression network) for producing the heatmaps prediction of the keypoints. The detection-guided learner (DGL) in the detection path generates the dynamic parameter, which is used as the convolution kernel of the adaptive convolution to extract semantic information for the segmentation path. The two paths are simultaneously trained end to end. Specifically, the segmentation path is trained by the strongly-supervised dataset, while the detection path is trained by both the strongly-supervised dataset and weakly-supervised dataset. Note that the keypoints coordinates in the strongly-supervised dataset are generated by the mask of spinal structures. The loss function is a weighted combination of the losses of the segmentation task and detection task. During the training phase, a set of models are trained and saved with various values of loss weights. During the inference phase, these trained models are used to generate a set of segmentation predictions and heatmaps predictions by the segmentation path and detection path respectively. Based on the assumption that the detection path outperforms the segmentation path, the final segmentation prediction is obtained by the detection-guided label fusion according to the consistency of the predicted heatmaps from the detection path and those generated by the segmentation prediction.

The main contributions of this paper are listed as follows:

  • We present a detection-guided mixed-supervised segmentation network (DGMSNet) to achieve spine segmentation for MR image. The generalization of the segmentation path in DGMSNet is improved under the guidance of the detection path in feature space with the weakly-supervised dataset.

  • We introduce a detection-guided learner (DGL) to produce a semantic feature for spine segmentation, which alleviates the inter-class similarity and improves the performance of spine segmentation.

  • We propose a detection-guided label fusion (DGLF) approach to obtain the final segmentation prediction for the inference phase by deciding whether to use majority voting or adaptive model selection according to the sensitivity of segmentation performance to the loss weight. The segmentation path is guided by detection path in prediction space, which improves the generalization and robustness of the method.

Section snippets

DGMSNet

The proposed DGMSNet as shown in Fig. 2 consists of a detection path gφ parameterized by φ and a segmentation path f[θ,ϕ] parameterized by θ and ϕ. Note that we omit the parameters of the networks hereinafter and replace gφ and f[θ,ϕ] with g and f respectively to simplify notation. The detection path aims at extracting semantic feature, which guides the segmentation path to generate accurate segmentation result. Both two paths utilize an encoder-decoder architecture. Moreover, the detection

Datasets

Two datasets denoted as Dataset-A and Dataset-B respectively were used to evaluate the segmentation performance of the proposed approach.

Overall performance

As shown in Fig. 5 and Fig. 6, the proposed DGMSNet achieves accurate spine segmentation for MR images. Specifically, DGMSNet achieves mean Precisions of 95.87±4.85%, 92.44±4.57%, and 94.16±3.88% for VBs, IVDs, and all 10 spinal structures segmentation respectively. The corresponding mean Recalls are 95.83±6.03%, 94.03±4.84%, and 94.93±5.09% respectively. The Precision of IVDs segmentation is significantly lower than the corresponding Recall, which demonstrates that the false-positive rate of

Conclusion

We have presented an accurate and robust detection-guided mixed-supervised segmentation network (DGMSNet) to achieve spine segmentation for MR images. In the training phase, the proposed DGL learned the semantic information of spinal structures from the weakly-supervised dataset by the mixed-supervised learning strategy, which guided the segmentation path in feature space to generate accurate segmentation prediction. In the inference phase, based on Assumption 1, the DGLF was presented to

CRediT authorship contribution statement

Shumao Pang: Conceptualization, Methodology, Software, Writing – original draft, Writing – review & editing. Chunlan Pang: Investigation, Software, Writing – original draft. Zhihai Su: Resources, Data curation. Liyan Lin: Visualization. Lei Zhao: Visualization. Yangfan Chen: Visualization. Yujia Zhou: Writing – review & editing. Hai Lu: Resources, Data curation. Qianjin Feng: Supervision, Project administration, Funding acquisition, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

Financial support for this work was provided by the National Natural Science Foundation of China (No. 62001207, 81974275), the China Postdoctoral Science Foundation (No. 2020M672712), the Zhuhai City Innovation and Innovation Team Project, Guangdong Province, China (No. ZH0406190031PWC), and the Guangdong Provincial Key Laboratory of Medical Image Processing (No. 2020B1212060039). No other potential conflict of interest relevant to this article was reported.

References (34)

  • S. Hochreiter et al.

    Long short-term memory

    Neural Comput

    (1997)
  • S. Hong et al.

    Decoupled deep neural network for semi-supervised semantic segmentation

    arXiv preprint arXiv:1506.04924

    (2015)
  • Z. Huang et al.

    Weakly-supervised semantic segmentation network with deep seeded region growing

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • S. Ioffe et al.

    Batch normalization: Accelerating deep network training by reducing internal covariate shift

    International conference on machine learning

    (2015)
  • A. Jha et al.

    Adamt-net: An adaptive weight learning based multi-task learning model for scene understanding

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops

    (2020)
  • A. Kendall et al.

    Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • D.P. Kingma et al.

    Adam: a method for stochastic optimization

    arXiv preprint arXiv:1412.6980

    (2014)
  • Cited by (24)

    View all citing articles on Scopus
    View full text