Keywords

1 Introduction

Femoroacetabular impingement (FAI) is a cause of hip pain in adults and has been recognized recently as one of the key risk factors that may lead to the development of early cartilage and labral damage [1] and a possible precursor of hip osteoarthritis [2]. Several studies [2, 3] have shown that the prevalence of FAI in young populations with hip complaints is high. Although there exist a number of imaging modalities that can be used to diagnose and assess FAI, magnetic resonance (MR) imaging does not induce any dosage of radiation at all and is regarded as the standard tool for FAI diagnosis [4]. While manual analysis of a series of two-dimensional (2D) MR images is feasible, automated segmentation of the proximal femur in MR images will greatly facilitate the applications of MR images for FAI surgical planning and simulation.

The topic of automated MR image segmentation of the hip joint has been addressed by a few studies which relied on atlas-based segmentation [5], graph-cut [6], active models [7, 8] or statistical shape models [9]. While these methods reported encouraging results for bone segmentation, further improvements are needed. For example, Arezoomand et al. [8] recently developed a three-dimensional (3D) active model framework for segmentation of the proximal femur in MR images and they reported an average recall of 0.88.

Recently, machine-learning based methods, especially those based on convolutional neural networks (CNNs), have witnessed successful applications in natural image processing [10, 11] as well as in medical image analysis [12,13,14,15]. For example, Prasoon et al. [12] developed a method to use a triplanar CNN that can autonomously learn features from images for knee cartilage segmentation. More recently, 3D volume-to-volume segmentation networks were introduced, including 3D U-Net [13], 3D V-Net [14] and a 3D deeply supervised network [15].

In this paper, we propose a fully convolutional volumetric auto-encoder that learns a volumetric latent representation from manual segmentation in order to regularize the segmentation results obtained from a fully convolutional network (FCN). To derive the volumetric latent representation from such a fully convolutional volumetric auto-encoder and to further use the derived latent representation to regularize segmentation are memory intensive as both steps require to take volumes containing the complete structures as input. To address such a difficulty, we propose to conduct the latent space constrained segmentation in a downsampling space and then introduce a super resolution network to improve the segmentation resolution, leading to improved segmentation accuracy. The details about our method will be presented in Sect. 2. We will then describe the experiments and results in Sect. 3, followed by a discussion in Sect. 4.

2 Method

Figure 1 illustrates the proposed method for hip joint segmentation. It contains three networks, i.e. a FCN-based segmentation network, a denoising autoencoder and a super resolution network. Our method takes a down-sampled low resolution (LR) MR image as input and directly generate high resolution (HR) segmentation. Below we first present an overview of the proposed method, followed by the presentation of the loss functions, network architectures, and the implementation details.

2.1 Overview

The FCN-based segmentation network is to segment the input low resolution MR image \(x_l\) into 3 classes, i.e. acetabulum, femur and background. The denoising autoencoder is to learn a latent representation from manual volumetric segmentation in order to regularize the output from the FCN-based segmentation network.

The low resolution MR image and the segmentation from the denoising autoencoder provide complementary information: the low resolution MR image provides original density information, while the segmentation from the denoising autoencoder provides the anatomically constrained segmentation probabilities. The super resolution network takes both the low resolution MR image and the anatomically constrained segmentation from the denoising autoencoder as input to generate the final segmentation in high resolution.

2.2 Loss Functions

We train our network in three stages: in the first stage, the segmentation network (Snet) and autoencoder (AE) were trained separately. Then the segmentation network was trained jointly with the autoencoder. In the third stage, the super resolution network was trained to upscale the output of the decoder to a high resolution segmentation.

Before we introduce the details of loss functions, we need to do some definitions. The input MR image in high resolution (HR) and low resolution (LR) are defined as \(x_h\) and \(x_l\), respectively. The segmentation ground truth in HR and LR are defined as \(y_h\) and \(y_l\), respectively. The function \(f(\cdot )\) and \(g(\cdot )\) are defined as the encoder and decoder components of the AE, respectively. \(\phi (\cdot )\) and \(\varPhi (\cdot )\) are defined as the segmentation network and super resolution network, respectively.

Fig. 1.
figure 1

A schematic view of the proposed method for the hip joint segmentation.

Loss Function of Snet and AE at Stage One. Snet is to segment the low resolution input MR Images \(x_l\) into several sub-classes, including acetabulum, femur and background. AE is to learn a latent representation so that the segmentation input can be reconstructed. Specifically, the training objective of AE is shown below:

$$\begin{aligned} \min _{\theta _e} ~ \{ L_{mce} ( g(f(y_l)), y_l) + \lambda _w || w_e ||^2_2 ~\}, \end{aligned}$$
(1)

where \(\theta _e\) denotes all trainable parameters of the AE model and \(w_e\) denotes all trainable parameters except bias term in the AE network, \(\lambda _w\) determines the weight of decay terms, and \(L_{mce}\) a multi-class cross entropy loss (mce) defined as:

$$\begin{aligned} { L_{mce}= - \frac{1}{N} \sum _{i=1}^{N}\sum _{c=1}^{C} y_{ic}\log \hat{y_{ic}}}, \end{aligned}$$
(2)

where N is the number of samples, C is the number of classes, \(\hat{y_{ic}}\) is the predicted probability and \(y_{ic}\) is the corresponding ground truth label.

For the segmentation network trained in the first stage, which aims to minimizes a voxel-wise multi-class cross entropy loss, i.e. \(L_{l}\), which is shown below:

$$\begin{aligned} L_{l}(x_l, y_l ; \theta _s) = L_{mce} (\phi (x_l), y_l) + \lambda _w || w_s ||^2_2, \end{aligned}$$
(3)

where \(\theta _s\) denotes all trainable parameters of Snet and \(w_s\) denotes all trainable parameters except bias term in Snet.

Loss Function of Snet at Stage Two. In stage two, the AE network was integrated into the segmentation network. The segmentation network was trained not only to minimize the voxel-wise cross-entropy loss, but also to minimize the Euclidean loss \(L_e\) of the latent representation learnt by AE. \(L_e\) is defined as below:

$$\begin{aligned} L_e(x_l, y_l ) = || f(\phi (x_l)) - f(y_l) ||^2_2, \end{aligned}$$
(4)

The training of the segmentation network in the second stage is shown as follows:

$$\begin{aligned} \min _{\theta _s} ~ \{ L_{mce} (\phi (x_l), y_l) + \lambda _1 L_e + \lambda _w || w_s ||^2_2 ~\}. \end{aligned}$$
(5)

Loss Function of Snet at Stage Three. In the third stage, the super resolution network was trained to upscale the low resolution segmentation, i.e. the output of decoder in AE, to high resolution segmentation. The loss of the super resolution network is defined as below:

$$\begin{aligned} L_{h}(x_l, y_h; \theta _r) = L_{mce} (\varPhi (g(f(\phi (x_l)))), y_h) + \lambda _w || w_r ||^2_2, \end{aligned}$$
(6)

where \(\theta _r\) denotes all trainable parameters of the super resolution network and \(w_r\) denotes all trainable parameters except bias term in the super resolution network.

2.3 Network Architectures

Below we present the network architectures of different sub-networks involved in our method.

Segmentation Network. Figure 2 illustrates the architecture of the segmentation network. It is a variant of the 3D U-net [13].

Fig. 2.
figure 2

The architecture of the segmentation network.

Denoising Autoencoder. Figure 3 shows the architecture of the fully convolutional denoising auto encoder to learn an end-to-end, voxel-to-voxel mapping. The left half of our network can be seen as an encoder stage that results in a condensed representation (indicated by “latent vector representation”). In the second stage (right half), the network reconstructs back the input from the latent vector representation by deconvolutional (3DDeconv) layers. The network is trained using cross-entropy loss. After training, the encoder \(f(y;{\theta _f})\) can be used to map a noisy volumetric label to a vector representation h in the latent space.

Fig. 3.
figure 3

The architecture of the fully convolutional denoising autoencoder for volumetric representation.

Super Resolution Network. Figure 4 illustrates the architecture of super resolution network. The input of the super resolution network is the concatenation of the low resolution MR image and the segmentation probability map reconstructed from the denoising autoencoder. After one convolutional layer to the input, the feature maps were upscaled by \(2\times 2\times 2\) times through a deconvolutional layer. Followed by another two convolutional layers, the segmentation in high resolution was generated.

All convolutional and deconvolutional layers are followed by a BN layer [16] and a ReLU layer [17], except the last one before the final output layer. All filter size of convolutional and deconvolutional layers are \(3\times 3\times 3\).

Fig. 4.
figure 4

The architecture of super resolution network.

2.4 Implementation Details

The proposed method was implemented in Python using TensorFlow framework on a desktop with a 3.6 GHz Intel(R) i7 CPU and a GTX 1080 Ti graphics card with 11 GB GPU memory.

Training. The whole volume of MR images in low resolution are fed into the neural network, and the size is \(120\times 120\times 40\). Each training volumetric image was normalized as zero mean and unit variance. As we train our three networks (segmentation network, autoencoder, and super resolution network) in an end-to-end way, the batch size of 1 was adopted. We trained our networks from scratch, and all weights were initialized from a Gaussian distribution (\(\mu = 0,\sigma = 0.01\)). As mentioned above, the training was done in three stages. The segmentation network and autoencoder were trained separately for 10, 000 iterations, and all weights were updated by the stochastic gradient descent (SGD) algorithm (momentum = 0.9, weight decay = 0.005). After that, we started the second stage training, i.e. the segmentation network and autoencoder were jointly trained for another 10000 iterations. In the final stage, the super resolution network was trained for another 10000 iterations. In total, we trained our neural network for 30000 iterations. For each stage of the training, the initial learning rate was \(1\times 10^{-3}\) and halved by 3000 every training iterations.

Testing. In the inference phase, the high resolution MR image was downscaled to a low resolution MR image in the size of \(120\times 120\times 40\), and the trained model can directly generate the high resolution segmentation results.

3 Experiments and Results

We conduct a series experiments: First we evaluate the reconstruction performance of the trained denoising autoencoder model. Then we compared different methods on the hip joint segmentation. Metrics used to evaluate the performance include the Dice overlap coefficient (Dice) [18], Hausdorff distance (HD) [19], average surface distance (ASD) [20], and precision and recall [21].

3.1 Dataset and Preprocessing

The dataset contains 24 hip joint data of FAI patients. All data are in the format of volumetric MR images and in the size of \(240\times 240\times 80\) (high resolution). To feed the whole volume data into the neural network, all MR images were downscaled to \(120\times 120\times 40\) (low resolution). Further, to enlarge the training samples and mitigate possible over-fitting problem, random noise was injected: random value between (\(-3,3\)) was added to each voxel. Finally, each training sample was normalized as zero mean and unit variance before being fed into the network.

3.2 Denoising Autoencoder Experiments

In this experiment, we conduct experiments to evaluate the performance of the denoising autoencoder in terms of shape completion and false positive exclusion. Standard two-fold cross validation were adopted. The dataset was randomly split into two separate parts, 12 were used for training and 12 were used for testing.

The qualitative results from the denoising autoencoder are shown in Fig. 5. The top half of the figure illustrates the shape completion function of the denoising autoencoder. Specifically, the input labels are randomly dropped by \(50\%\) percent and become incomplete-shaped labels with lots of “holes”. And the denoising autoencoder generates labels without holes but with complete shape. The bottom half of the figure shows the false positive exclusion function of denoising autoencoder. The input labels are added with \(50\%\) random noise and become noisy input with lots of false positive predictions. Again the denoising autoencoder generates labels with complete shape and almost no false positive predictions.

Fig. 5.
figure 5

Shape reconstruction with dropout and random noise.

We further explore how much shape corruption and noise that the denoising autoencoder can tolerate. Specifically, we are going to find the minimum percentage of original labels that the model needed and the maximum percentage of noise that the model can tolerate when it can still make accurate reconstruction. Table 1 shows the quantitative results about the relationship between the dropout rate of the input labels and the reconstruction accuracy. We can observe that even with a dropout rate of \(95\%\), i.e. only \(5\%\) of the original labels are kept, the model can still make good reconstruction. But when the dropout rate was increased to \(99\%\), ASD of both acetabulum and femur are larger than 1 mm, which is assumed as a bad reconstruction.

Table 1. Dropout rate and reconstruction accuracy (Dice: Dice overlap coefficient; ASD: average surface distance).

Based on the maximum dropout rate (\(95\%\)), we explore the maximum noise level that the model can tolerate. Table 2 shows the quantitative results about the relationship between the noise level and the reconstruction accuracy. As one can observe from the results, even with \(75\%\) of random noise, the denoising autoencoder can still make good reconstruction. When the noise level is increased to \(100\%\), the average ASD of acetabulum is increased to 1.07 mm and the reconstruction is corrupted.

Table 2. Noise level and reconstruction accuracy (Dice: Dice overlap coefficient; ASD: average surface distance).

Based on above analysis, we would say the trained denoising autoencoder model can make good reconstructions with only \(5\%\) of original input labels and have a tolerance of \(75\%\) noise. Figure 6 shows an example of reconstruction with both corruption labels and noise.

Fig. 6.
figure 6

Shape reconstruction with dropout and random noise.

The results demonstrate that the model can reconstruct hip shapes based on little image information and with high noise tolerance. This raises the question of whether it simply represents a fixed shape. Figure 7 shows four reconstructions randomly selected from the 12 test cases. Even though the shapes of the acetabulum vary considerably, the model fits each image accurately, demonstrating that it models the full range of acetabular shapes represented in the input data.

Fig. 7.
figure 7

Randomly selected shape reconstruction examples.

3.3 Segmentation Experiments

In this experiment, the proposed method is applied to segment the hip joint from 3D MR images using the dataset described in Sect. 3.1. We conduct a standard two-fold cross validation to evaluate the performance.

The proposed method is compared with: patch-based 3DUnet [13] model in high resolution (HR-3DUnet), whole-volume based 3DUnet model in low resolution (LR-3DUnet) and the cascaded 3DUnet with AE model regularization (AE-Seg) [22]. The experiment results are shown in Table 3.

Table 3. Quantitative comparison of segmentation results (Dice: Dice overlap coefficient; HD: Hausdorff distance; ASD: average surface distance; Prec.: precision).
Fig. 8.
figure 8

Comparison of segmentations from different methods.

From the experiments results shown in Table 3, we can observe HR-3DUnet significantly under-performs compared with other methods in terms of Dice (0.842 for acetabulum and 0.873 for femur), HD (65.91 mm for acetabulum and 40.67 mm for femur) and ASD (7.80 mm for acetabulum and 6.49 mm for femur). The poor performance is also shown in the qualitative comparison in Fig. 8. The segmentation results from HR-3DUnet contains many more false positive detections compared with LR-3DUnet, especially in the axial view and sagittal view. This is due to the fact that HR-3DUnet adopts a patch-based strategy, and a patch cannot take advantage of the global context information. But global context information is critically important in the task of hip joint segmentation, because the gray value of acetabulum, femur and background are very similar, and the boundary is also very fuzzy.

LR-3DUnet takes the whole low resolution volume as input, encouraging the segmentation results to make more use of global context. All evaluation metrics shown in Table 3 from LR-3DUnet are better than HR-3DUnet. But LR-3DUnet suffer from the voxel-wise loss, which cannot guarantee the global consistency and meaningful structure. For the segmentation results from LR-3DUnet, the mean ASD of acetabulum and femur are 1.79 mm and 1.60 mm, respectively, which are quite large compared to good segmentations. In addition, there are many false positives predictions, especially in the coronal view, which can be seen in the third line of Fig. 8.

AE-Seg [22] makes use of an implicit way to constrain the segmentation shape and implements it as a regularization term in the training phase. However, it is too weak to force the network to generate segmentations that completely lie in the latent space. In the fourth row of Fig. 8, we can see that AE-Seg can partially remove some false positives in the sagittal view, but the false positive predictions in the axial view and the coronal view still cannot be eliminated. For the acetabulum of a hip joint, its shape is complex and has large differences from each other. This is due to regularization term on the latent representation works in an implicit way and cannot really guarantee segmentation with meaningful shape. That is also why though there is an improvement in Dice for acetabulum, compared with LR-3DUnet, the average HD metric (29.23 mm) from AE-Seg is even worse than LR-3DUnet (24.76 mm).

Our method explicitly constrains the segmentation shape via the denoising autoencoder. The outputs from the denoising autoencoder were shown in the fifth row of Fig. 8. The denoising autoencoder can automatically remove false positives and fill some missing parts by prior knowledge.

The comparison between the segmentation results by our method and the ground truth is shown in last two bottom lines in Fig. 8. This comparison clearly demonstrates the effectiveness of incorporating the denoising autoencoder and super resolution network to generate final segmentation results. The quantitative comparison in Table 3 also demonstrated the efficacy of the proposed method. Compared with HR-3DUnet, our method has an improvement of \(5.7\%\) and \(6.0\%\) over HR-3DUnet in terms of Dice for segmenting the acetabulum and the proximal femur, respectively.

4 Discussion

Incorporating prior knowledge into image segmentation algorithms has proven useful in order to obtain more accurate and plausible results [23]. In this paper, we proposed to use a fully convolutional volumetric auto encoder to learn volumetric representation in order to regularize the segmentation output of a segmentation network. We further introduced a super resolution network to improve the segmentation accuracy. We conducted comprehensive experiments on 3D MR images of 24 patients to validate the efficacy of the proposed method.