Abstract
This paper addresses the problem of segmentation of the hip joint including both the acetabulum and the proximal femur in three-dimensional magnetic resonance images. We propose a fully convolutional volumetric auto encoder that learns a volumetric representation from manual segmentation in order to regularize the segmentation results obtained from a fully convolutional network. We further introduce a super resolution network to improve the segmentation accuracy. Comprehensive results obtained from 24 patient data demonstrated the effectiveness of the proposed framework.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Femoroacetabular impingement (FAI) is a cause of hip pain in adults and has been recognized recently as one of the key risk factors that may lead to the development of early cartilage and labral damage [1] and a possible precursor of hip osteoarthritis [2]. Several studies [2, 3] have shown that the prevalence of FAI in young populations with hip complaints is high. Although there exist a number of imaging modalities that can be used to diagnose and assess FAI, magnetic resonance (MR) imaging does not induce any dosage of radiation at all and is regarded as the standard tool for FAI diagnosis [4]. While manual analysis of a series of two-dimensional (2D) MR images is feasible, automated segmentation of the proximal femur in MR images will greatly facilitate the applications of MR images for FAI surgical planning and simulation.
The topic of automated MR image segmentation of the hip joint has been addressed by a few studies which relied on atlas-based segmentation [5], graph-cut [6], active models [7, 8] or statistical shape models [9]. While these methods reported encouraging results for bone segmentation, further improvements are needed. For example, Arezoomand et al. [8] recently developed a three-dimensional (3D) active model framework for segmentation of the proximal femur in MR images and they reported an average recall of 0.88.
Recently, machine-learning based methods, especially those based on convolutional neural networks (CNNs), have witnessed successful applications in natural image processing [10, 11] as well as in medical image analysis [12,13,14,15]. For example, Prasoon et al. [12] developed a method to use a triplanar CNN that can autonomously learn features from images for knee cartilage segmentation. More recently, 3D volume-to-volume segmentation networks were introduced, including 3D U-Net [13], 3D V-Net [14] and a 3D deeply supervised network [15].
In this paper, we propose a fully convolutional volumetric auto-encoder that learns a volumetric latent representation from manual segmentation in order to regularize the segmentation results obtained from a fully convolutional network (FCN). To derive the volumetric latent representation from such a fully convolutional volumetric auto-encoder and to further use the derived latent representation to regularize segmentation are memory intensive as both steps require to take volumes containing the complete structures as input. To address such a difficulty, we propose to conduct the latent space constrained segmentation in a downsampling space and then introduce a super resolution network to improve the segmentation resolution, leading to improved segmentation accuracy. The details about our method will be presented in Sect. 2. We will then describe the experiments and results in Sect. 3, followed by a discussion in Sect. 4.
2 Method
Figure 1 illustrates the proposed method for hip joint segmentation. It contains three networks, i.e. a FCN-based segmentation network, a denoising autoencoder and a super resolution network. Our method takes a down-sampled low resolution (LR) MR image as input and directly generate high resolution (HR) segmentation. Below we first present an overview of the proposed method, followed by the presentation of the loss functions, network architectures, and the implementation details.
2.1 Overview
The FCN-based segmentation network is to segment the input low resolution MR image \(x_l\) into 3 classes, i.e. acetabulum, femur and background. The denoising autoencoder is to learn a latent representation from manual volumetric segmentation in order to regularize the output from the FCN-based segmentation network.
The low resolution MR image and the segmentation from the denoising autoencoder provide complementary information: the low resolution MR image provides original density information, while the segmentation from the denoising autoencoder provides the anatomically constrained segmentation probabilities. The super resolution network takes both the low resolution MR image and the anatomically constrained segmentation from the denoising autoencoder as input to generate the final segmentation in high resolution.
2.2 Loss Functions
We train our network in three stages: in the first stage, the segmentation network (Snet) and autoencoder (AE) were trained separately. Then the segmentation network was trained jointly with the autoencoder. In the third stage, the super resolution network was trained to upscale the output of the decoder to a high resolution segmentation.
Before we introduce the details of loss functions, we need to do some definitions. The input MR image in high resolution (HR) and low resolution (LR) are defined as \(x_h\) and \(x_l\), respectively. The segmentation ground truth in HR and LR are defined as \(y_h\) and \(y_l\), respectively. The function \(f(\cdot )\) and \(g(\cdot )\) are defined as the encoder and decoder components of the AE, respectively. \(\phi (\cdot )\) and \(\varPhi (\cdot )\) are defined as the segmentation network and super resolution network, respectively.
Loss Function of Snet and AE at Stage One. Snet is to segment the low resolution input MR Images \(x_l\) into several sub-classes, including acetabulum, femur and background. AE is to learn a latent representation so that the segmentation input can be reconstructed. Specifically, the training objective of AE is shown below:
where \(\theta _e\) denotes all trainable parameters of the AE model and \(w_e\) denotes all trainable parameters except bias term in the AE network, \(\lambda _w\) determines the weight of decay terms, and \(L_{mce}\) a multi-class cross entropy loss (mce) defined as:
where N is the number of samples, C is the number of classes, \(\hat{y_{ic}}\) is the predicted probability and \(y_{ic}\) is the corresponding ground truth label.
For the segmentation network trained in the first stage, which aims to minimizes a voxel-wise multi-class cross entropy loss, i.e. \(L_{l}\), which is shown below:
where \(\theta _s\) denotes all trainable parameters of Snet and \(w_s\) denotes all trainable parameters except bias term in Snet.
Loss Function of Snet at Stage Two. In stage two, the AE network was integrated into the segmentation network. The segmentation network was trained not only to minimize the voxel-wise cross-entropy loss, but also to minimize the Euclidean loss \(L_e\) of the latent representation learnt by AE. \(L_e\) is defined as below:
The training of the segmentation network in the second stage is shown as follows:
Loss Function of Snet at Stage Three. In the third stage, the super resolution network was trained to upscale the low resolution segmentation, i.e. the output of decoder in AE, to high resolution segmentation. The loss of the super resolution network is defined as below:
where \(\theta _r\) denotes all trainable parameters of the super resolution network and \(w_r\) denotes all trainable parameters except bias term in the super resolution network.
2.3 Network Architectures
Below we present the network architectures of different sub-networks involved in our method.
Segmentation Network. Figure 2 illustrates the architecture of the segmentation network. It is a variant of the 3D U-net [13].
Denoising Autoencoder. Figure 3 shows the architecture of the fully convolutional denoising auto encoder to learn an end-to-end, voxel-to-voxel mapping. The left half of our network can be seen as an encoder stage that results in a condensed representation (indicated by “latent vector representation”). In the second stage (right half), the network reconstructs back the input from the latent vector representation by deconvolutional (3DDeconv) layers. The network is trained using cross-entropy loss. After training, the encoder \(f(y;{\theta _f})\) can be used to map a noisy volumetric label to a vector representation h in the latent space.
Super Resolution Network. Figure 4 illustrates the architecture of super resolution network. The input of the super resolution network is the concatenation of the low resolution MR image and the segmentation probability map reconstructed from the denoising autoencoder. After one convolutional layer to the input, the feature maps were upscaled by \(2\times 2\times 2\) times through a deconvolutional layer. Followed by another two convolutional layers, the segmentation in high resolution was generated.
All convolutional and deconvolutional layers are followed by a BN layer [16] and a ReLU layer [17], except the last one before the final output layer. All filter size of convolutional and deconvolutional layers are \(3\times 3\times 3\).
2.4 Implementation Details
The proposed method was implemented in Python using TensorFlow framework on a desktop with a 3.6 GHz Intel(R) i7 CPU and a GTX 1080 Ti graphics card with 11 GB GPU memory.
Training. The whole volume of MR images in low resolution are fed into the neural network, and the size is \(120\times 120\times 40\). Each training volumetric image was normalized as zero mean and unit variance. As we train our three networks (segmentation network, autoencoder, and super resolution network) in an end-to-end way, the batch size of 1 was adopted. We trained our networks from scratch, and all weights were initialized from a Gaussian distribution (\(\mu = 0,\sigma = 0.01\)). As mentioned above, the training was done in three stages. The segmentation network and autoencoder were trained separately for 10, 000 iterations, and all weights were updated by the stochastic gradient descent (SGD) algorithm (momentum = 0.9, weight decay = 0.005). After that, we started the second stage training, i.e. the segmentation network and autoencoder were jointly trained for another 10000 iterations. In the final stage, the super resolution network was trained for another 10000 iterations. In total, we trained our neural network for 30000 iterations. For each stage of the training, the initial learning rate was \(1\times 10^{-3}\) and halved by 3000 every training iterations.
Testing. In the inference phase, the high resolution MR image was downscaled to a low resolution MR image in the size of \(120\times 120\times 40\), and the trained model can directly generate the high resolution segmentation results.
3 Experiments and Results
We conduct a series experiments: First we evaluate the reconstruction performance of the trained denoising autoencoder model. Then we compared different methods on the hip joint segmentation. Metrics used to evaluate the performance include the Dice overlap coefficient (Dice) [18], Hausdorff distance (HD) [19], average surface distance (ASD) [20], and precision and recall [21].
3.1 Dataset and Preprocessing
The dataset contains 24 hip joint data of FAI patients. All data are in the format of volumetric MR images and in the size of \(240\times 240\times 80\) (high resolution). To feed the whole volume data into the neural network, all MR images were downscaled to \(120\times 120\times 40\) (low resolution). Further, to enlarge the training samples and mitigate possible over-fitting problem, random noise was injected: random value between (\(-3,3\)) was added to each voxel. Finally, each training sample was normalized as zero mean and unit variance before being fed into the network.
3.2 Denoising Autoencoder Experiments
In this experiment, we conduct experiments to evaluate the performance of the denoising autoencoder in terms of shape completion and false positive exclusion. Standard two-fold cross validation were adopted. The dataset was randomly split into two separate parts, 12 were used for training and 12 were used for testing.
The qualitative results from the denoising autoencoder are shown in Fig. 5. The top half of the figure illustrates the shape completion function of the denoising autoencoder. Specifically, the input labels are randomly dropped by \(50\%\) percent and become incomplete-shaped labels with lots of “holes”. And the denoising autoencoder generates labels without holes but with complete shape. The bottom half of the figure shows the false positive exclusion function of denoising autoencoder. The input labels are added with \(50\%\) random noise and become noisy input with lots of false positive predictions. Again the denoising autoencoder generates labels with complete shape and almost no false positive predictions.
We further explore how much shape corruption and noise that the denoising autoencoder can tolerate. Specifically, we are going to find the minimum percentage of original labels that the model needed and the maximum percentage of noise that the model can tolerate when it can still make accurate reconstruction. Table 1 shows the quantitative results about the relationship between the dropout rate of the input labels and the reconstruction accuracy. We can observe that even with a dropout rate of \(95\%\), i.e. only \(5\%\) of the original labels are kept, the model can still make good reconstruction. But when the dropout rate was increased to \(99\%\), ASD of both acetabulum and femur are larger than 1 mm, which is assumed as a bad reconstruction.
Based on the maximum dropout rate (\(95\%\)), we explore the maximum noise level that the model can tolerate. Table 2 shows the quantitative results about the relationship between the noise level and the reconstruction accuracy. As one can observe from the results, even with \(75\%\) of random noise, the denoising autoencoder can still make good reconstruction. When the noise level is increased to \(100\%\), the average ASD of acetabulum is increased to 1.07 mm and the reconstruction is corrupted.
Based on above analysis, we would say the trained denoising autoencoder model can make good reconstructions with only \(5\%\) of original input labels and have a tolerance of \(75\%\) noise. Figure 6 shows an example of reconstruction with both corruption labels and noise.
The results demonstrate that the model can reconstruct hip shapes based on little image information and with high noise tolerance. This raises the question of whether it simply represents a fixed shape. Figure 7 shows four reconstructions randomly selected from the 12 test cases. Even though the shapes of the acetabulum vary considerably, the model fits each image accurately, demonstrating that it models the full range of acetabular shapes represented in the input data.
3.3 Segmentation Experiments
In this experiment, the proposed method is applied to segment the hip joint from 3D MR images using the dataset described in Sect. 3.1. We conduct a standard two-fold cross validation to evaluate the performance.
The proposed method is compared with: patch-based 3DUnet [13] model in high resolution (HR-3DUnet), whole-volume based 3DUnet model in low resolution (LR-3DUnet) and the cascaded 3DUnet with AE model regularization (AE-Seg) [22]. The experiment results are shown in Table 3.
From the experiments results shown in Table 3, we can observe HR-3DUnet significantly under-performs compared with other methods in terms of Dice (0.842 for acetabulum and 0.873 for femur), HD (65.91 mm for acetabulum and 40.67 mm for femur) and ASD (7.80 mm for acetabulum and 6.49 mm for femur). The poor performance is also shown in the qualitative comparison in Fig. 8. The segmentation results from HR-3DUnet contains many more false positive detections compared with LR-3DUnet, especially in the axial view and sagittal view. This is due to the fact that HR-3DUnet adopts a patch-based strategy, and a patch cannot take advantage of the global context information. But global context information is critically important in the task of hip joint segmentation, because the gray value of acetabulum, femur and background are very similar, and the boundary is also very fuzzy.
LR-3DUnet takes the whole low resolution volume as input, encouraging the segmentation results to make more use of global context. All evaluation metrics shown in Table 3 from LR-3DUnet are better than HR-3DUnet. But LR-3DUnet suffer from the voxel-wise loss, which cannot guarantee the global consistency and meaningful structure. For the segmentation results from LR-3DUnet, the mean ASD of acetabulum and femur are 1.79 mm and 1.60 mm, respectively, which are quite large compared to good segmentations. In addition, there are many false positives predictions, especially in the coronal view, which can be seen in the third line of Fig. 8.
AE-Seg [22] makes use of an implicit way to constrain the segmentation shape and implements it as a regularization term in the training phase. However, it is too weak to force the network to generate segmentations that completely lie in the latent space. In the fourth row of Fig. 8, we can see that AE-Seg can partially remove some false positives in the sagittal view, but the false positive predictions in the axial view and the coronal view still cannot be eliminated. For the acetabulum of a hip joint, its shape is complex and has large differences from each other. This is due to regularization term on the latent representation works in an implicit way and cannot really guarantee segmentation with meaningful shape. That is also why though there is an improvement in Dice for acetabulum, compared with LR-3DUnet, the average HD metric (29.23 mm) from AE-Seg is even worse than LR-3DUnet (24.76 mm).
Our method explicitly constrains the segmentation shape via the denoising autoencoder. The outputs from the denoising autoencoder were shown in the fifth row of Fig. 8. The denoising autoencoder can automatically remove false positives and fill some missing parts by prior knowledge.
The comparison between the segmentation results by our method and the ground truth is shown in last two bottom lines in Fig. 8. This comparison clearly demonstrates the effectiveness of incorporating the denoising autoencoder and super resolution network to generate final segmentation results. The quantitative comparison in Table 3 also demonstrated the efficacy of the proposed method. Compared with HR-3DUnet, our method has an improvement of \(5.7\%\) and \(6.0\%\) over HR-3DUnet in terms of Dice for segmenting the acetabulum and the proximal femur, respectively.
4 Discussion
Incorporating prior knowledge into image segmentation algorithms has proven useful in order to obtain more accurate and plausible results [23]. In this paper, we proposed to use a fully convolutional volumetric auto encoder to learn volumetric representation in order to regularize the segmentation output of a segmentation network. We further introduced a super resolution network to improve the segmentation accuracy. We conducted comprehensive experiments on 3D MR images of 24 patients to validate the efficacy of the proposed method.
References
Laborie, L., Lehmann, T., Engesæter, I., Eastwood, D., Engesæter, L., Rosendahl, K.: Prevalence of radiographic findings thought to be associated with femoroacetabular impingement in a population-based cohort of 2081 healthy young adults. Radiology 260(2), 494–502 (2011). https://doi.org/10.1148/radiol.11102354
Leunig, M., Beaulé, P., Ganz, R.: The concept of femoroacetabular impingement: current status and future perspectives. Clin. Orthop. Relat. Res. 467(3), 616–622 (2009). https://doi.org/10.1007/s11999-008-0646-0
Clohisy, J., Knaus, E., Hunt, D., Lesher, J., Harris-Hayes, M., Prather, H.: Clinical presentation of patients with symptomatic anterior hip impingement. Clin. Orthop. Relat. Res. 467(3), 638–644 (2009). https://doi.org/10.1007/s11999-008-0680-y
Perdikakis, E., Karachalios, T., Katonis, P., Karantanas, A.: Comparison of MR-arthrography and MDCT-arthrography for detection of labral and articular cartilage hip pathology. Skeletal Radiol. 40(11), 1441–1447 (2011). https://doi.org/10.1007/s00256-011-1111-9
Xia, Y., Fripp, J., Chandra, S., Schwarz, R., Engstrom, C., Crozier, S.: Automated bone segmentation from large field of view 3D MR images of the hip joint. Phys. Med. Biol. 58(20), 7375–7390 (2013). https://doi.org/10.1088/0031-9155/58/20/7375
Xia, Y., Chandra, S., Engstrom, C., Strudwick, M., Crozier, S., Fripp, J.: Automatic hip cartilage segmentation from 3D MR images using arc-weighted graph searching. Phys. Med. Biol. 59(23), 7245–7266 (2014). https://doi.org/10.1088/0031-9155/59/23/7245
Gilles, B., Magnenat-Thalmann, N.: Musculoskeletal MRI segmentation using multi-resolution simplex meshes with medial representations. Med. Image Anal. 14(3), 291–302 (2010). https://doi.org/10.1016/j.media.2010.01.006
Arezoomand, S., Lee, W., Rakhra, K., Beaulé, P.: A 3D active model framework for segmentation of proximal femur in MR images. Int. J. Comput. Assist. Radiol. Surg. 10(1), 55–66 (2015). https://doi.org/10.1007/s11548-014-1125-6
Chandra, S., Xia, Y., Engstrom, C., Crozier, S., Schwarz, R., Fripp, J.: Focused shape models for hip joint segmentation in 3D magnetic resonance images. Med. Image Anal. 18(3), 567–578 (2014). https://doi.org/10.1016/j.media.2014.02.002
Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., et al. (eds.) Proceedings of Neural Information Processing Systems – NIPS 2012, vol. 25, pp. 1097–1105. NIPS (2012)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition – CVPR 2015, pp. 3431–3440. IEEE (2015). https://doi.org/10.1109/CVPR.2015.7298965
Prasoon, A., Petersen, K., Igel, C., Lauze, F., Dam, E., Nielsen, M.: Deep feature learning for knee cartilage segmentation using a triplanar convolutional neural network. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 246–253. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40763-5_31
Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_49
Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: fully convolutional neural networks for volumetric medical image segmentation. In: Proceedings of 4th International Conference on 3D Vision – 3DV 2016, pp. 565–571. IEEE (2016). https://doi.org/10.1109/3DV.2016.79
Dou, Q., et al.: 3D deeply supervised network for automated segmentation of volumetric medical images. Med. Image Anal. 41, 40–54 (2017). https://doi.org/10.1016/j.media.2017.05.001
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of 32nd International Conference on Machine Learning – ICML 2015, vol. 37, pp. 448–456. PLMR (2015)
Dahl, G., Sainath, T., Hinton, G.: Improving deep neural networks for LVCSR using rectified linear units and dropout. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing – ICASSP 2013, pp. 8609–8613. IEEE (2013). https://doi.org/10.1109/ICASSP.2013.6639346
Karasawa, K., et al.: Multi-atlas pancreas segmentation: atlas selection based on vessel structure. Med. Image Anal. 39, 18–28 (2017). https://doi.org/10.1016/j.media.2017.03.006
Huttenlocher, D., Klanderman, G., Rucklidge, W.: Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 15(9), 850–863 (1993). https://doi.org/10.1109/34.232073
Yeghiazaryan, V., Voiculescu, I.: An overview of current evaluation methods used in medical image segmentation. Technical report CS-RR-15-08, University of Oxford, Department of Computer Science, UK (2015)
Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: Proceedings of 23rd International Conference on Machine Learning – ICML 2006, pp. 233–240. ACM (2006). https://doi.org/10.1145/1143844.1143874
Ravishankar, H., Venkataramani, R., Thiruvenkadam, S., Sudhakar, P., Vaidya, V.: Learning and incorporating shape models for semantic segmentation. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10433, pp. 203–211. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66182-7_24
Nosrati, M., Hamarneh, G.: Incorporating prior knowledge in medical image segmentation: a survey. arXiv:1607.01092 (2016). https://arxiv.org/abs/1607.01092
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Zeng, G., Zheng, G. (2019). Deep Volumetric Shape Learning for Semantic Segmentation of the Hip Joint from 3D MR Images. In: Vrtovec, T., Yao, J., Zheng, G., Pozo, J. (eds) Computational Methods and Clinical Applications in Musculoskeletal Imaging. MSKI 2018. Lecture Notes in Computer Science(), vol 11404. Springer, Cham. https://doi.org/10.1007/978-3-030-11166-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-11166-3_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11165-6
Online ISBN: 978-3-030-11166-3
eBook Packages: Computer ScienceComputer Science (R0)