Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Accurate and robust segmentation of anatomical structures from magnetic resonance (MR) images is valuable in many computer-aided clinical tasks. However, it is often challenging due to ambiguous structure boundaries, resemblance among structures, insufficient image contrast and/or spatial resolution, intensity inhomogeneity and image dimensionality, especially in the case of three-dimensional (3D) images. Segmentation of vertebral bodies (VBs) from 3D MR images was already the topic of several studies, as it represents a non-invasive approach to precise quantitative evaluation of many spinal disorders [1], including spondylolisthesis, disc herniation, stenosis and osteporosis. Štern et al. [2] introduced superquadric models that were aligned to VBs in 3D MR images and yielded a distance error of \(1.85\,\text {mm}\), Neubert et al. [3] applied active shape models and achieved a distance error of \(0.67\,\text {mm}\) and Dice overlap of \(91\,\%\), while Zukić et al. [4] used VB boundary models for MR image feature classification and model mesh inflation for VB segmentation, yielding a distance error of \(1.76\,\text {mm}\) and Dice overlap of \(79.3\,\%\). Schwarzenberg et al. [5] described a graph-cut segmentation that was based on the deformation of a cubic template, achieving a Dice overlap of \(81.3\,\%\). Recently, Chu et al. [6] presented a learning-based method for localization and segmentation of VBs from computed tomography and MR images, where likelihoods of voxels to belong to VBs or background were estimated by random forest classifiers, and reported a distance error of \(1.5\,\text {mm}\) and Dice overlap of \(88.7\,\%\) when segmenting 3D MR images. The authors also publicly released a database of 3D MR spine images with annotated VBs, therefore enabling objective quantitative comparison of any VB segmentation algorithm.

In this paper, we propose an automated method for supervised segmentation of VBs from 3D MR spine images. For this purpose, a 3D convolutional neural network (CNN) architecture was designed that learns the appearance of VBs and background from a training set of MR images, and generates 3D spatial VB probability maps that guide deformable models towards VB boundaries to obtain accurate VB segmentations in a previously unseen MR image.

Fig. 1.
figure 1

(a) The pipeline of the proposed methodology. (b) Spatial vertebral body (VB) probability maps and corresponding segmentations for three selected VBs.

2 Methodology

Let \(\mathcal {M}\) represent the 3D mesh of the mean shape model of VBs in the training set. By coarsely initializing \(\mathcal {M}\) in the 3D image and increasing the size of its minimum bounding box, the volume of interest (VOI) that contains the observed VB is determined. A 3D CNN classifier [7, 8] is first applied to obtain the 3D spatial VB probability map, which represents likelihoods of VOI voxels to belong to the observed VB. The deformable modeling process is then guided by the obtained map and constrained by VB shapes in the training set to obtain a smooth and accurate segmentation of the observed VB (Fig. 1(a)).

2.1 Generation of 3D Spatial VB Probability Maps Using CNN

Designed as a special kind of artificial neural network classifiers, CNNs exploit the spatial structure of the input image by providing relative generalization as well as robustness to simple geometrical transformations [7]. Starting with an input in the form of a 3D image patch \({\varvec{P}}\) around the observed VOI voxel, the CNN is formed by alternating convolutional and pooling layers (\(l = 1, 2, \ldots , L'\)) that are attached to fully connected layers (\(l = L'+1, L'+2, \ldots , L\)).

Convolutional and Pooling Layers. First, several convolutions are performed in parallel to produce each m-th ordinary (feature) map \({\varvec{X}}_{l}^{m}\) of l-th convolutional layer, with the value at position (ijk):

$$\begin{aligned} {\varvec{X}}_{l}^{m}(i,j,k) = b_{l}^{m} + \sum _{n = 1}^{M_{l-1}} \sum _{p = 1}^{P_l} \sum _{q = 1}^{Q_l} \sum _{r = 1}^{R_l} {\varvec{W}}_{l}^{m,n}(p,q,r) \, {\varvec{Z}}_{l-1}^{n}(i+p-1,j+q-1,k+r-1), \end{aligned}$$
(1)

where \(b_{l}^{m}\) denotes the additive bias, and \({\varvec{W}}_{l}^{m,n}\) the convolution kernel of height \(P_l\), width \(Q_l\) and depth \(R_l\), connected to n-th (out of \(M_{l-1}\)) pooled map \({\varvec{Z}}_{l-1}^{n}\) in \((l-1)\)-th pooling layer. The activated map \({\varvec{Y}}_{l}^{m} = f_l({\varvec{X}}_{l}^{m})\) is obtained by running \({\varvec{X}}_{l}^{m}\) through an activation function \(f_l\), and then transformed into m-th pooled map \({\varvec{Z}}_{l}^{m}\) of l-th pooling layer, with the value at position (ijk):

$$\begin{aligned} {\varvec{Z}}_{l}^{m}(i,j,k) = g_l \left( {\varvec{Y}}_{l}^{m}(p,q,r) \Bigg |\Bigg . {\begin{array}{c} p\,{\in }\,\left\{ \widetilde{P}_l(i-1)+1, \ldots , \widetilde{P}_li\right\} \\ ~q\,{\in }\,\left\{ \widetilde{Q}_l(j-1)+1, \ldots , \widetilde{Q}_lj\right\} \\ \,~r\,{\in }\,\left\{ \widetilde{R}_l(k-1)+1, \ldots , \widetilde{R}_lk\right\} \end{array} } \right) , \end{aligned}$$
(2)

where \(g_l\) is a pooling function on a kernel of height \(\widetilde{P}_l\), width \(\widetilde{Q}_l\) and depth \(\widetilde{R}_L\).

Fully Connected Layers. Convolutional and pooling layers are attached to fully connected layers by representing \(M_{L'}\) pooled maps \({\varvec{Z}}_{L'}^{m}\) in \(L'\)-th pooling layer with vector \({\varvec{w}}_{L'}\) that is sent to the first fully connected layer. Vector \({\varvec{w}}_{l}\) is the map of l-th fully connected layer of size \(S_{l-1}\), with the value at position i:

$$\begin{aligned} {\varvec{w}}_{l}(i) = h_l \left( {\varvec{b}}_{l}(i) + \sum _{j=1}^{S_{l-1}}{\varvec{W}}_{\!l}(i,j) \cdot {\varvec{w}}_{l-1}(j) \right) , \end{aligned}$$
(3)

where \(h_l\) denotes an activation function, \({\varvec{b}}_{l}\) the additive bias vector, and \({\varvec{W}}_{\!l}\) the weight matrix. Output \({\varvec{w}}_{L}\,{=}\,(w_{L,1},w_{L,2})^T\) of L-th fully connected layer is passed through the softmax function to obtain vector \({\varvec{p}}\):

$$\begin{aligned} {\varvec{p}} = (p_1, p_2)^T = \frac{(e^{w_{L,1}},e^{w_{L,2}})^T}{e^{w_{L,1}} + e^{w_{L,2}}}; \quad \quad p_1+p_2=1; \quad \quad p_1,p_2 \in [0,1], \end{aligned}$$
(4)

composed of probabilities \(p_1\) and \(p_2\) that the observed VOI voxel belongs to the object of interest (i.e. VB) or background, respectively, according to the input \({\varvec{P}}\), weights \({\varvec{W\!}}=\{{\varvec{W}}_{\!1},\ldots ,{\varvec{W}}_{\!L}\}\) and biases \({\varvec{b}} = \{{\varvec{b}}_1,\ldots ,{\varvec{b}}_L\}\) (Fig. 1(b)).

2.2 Segmentation of VBs with Deformable Models Using 3D Spatial VB Probability Maps

The applied segmentation of VBs [911] is based on deforming mesh \(\mathcal {M}=\{\mathcal {V},\mathcal {F}\}\) (composed of vertices \(\mathcal {V}\) and faces \(\mathcal {F}\)) in the 3D MR image by iteratively minimizing the external energy that drives \(\mathcal {M}\) towards VB boundaries, and the internal energy that constrains \(\mathcal {M}\) according to the training set of VBs.

External Energy. In k-th (out of K) iteration of the applied segmentation, each vertex \({\varvec{x}}_i\) of mesh \(\mathcal {M}\) (\(i\,{=}\,1,\,2,\,{\ldots },\,\left| \mathcal {V}\right| \)) is displaced along the corresponding mesh normal \({\varvec{n}}_i\) to find a more promising vertex on VB boundaries:

$$\begin{aligned} {\varvec{x}}_i^{*} = {\varvec{x}}_i + \delta \, j_i^{*} \, {\varvec{n}}_i; \quad \quad j_i^{*} = \underset{j \in \mathcal {J}}{\arg \max } \Big \{ F_i({\varvec{x}}_i + \delta \, j \, {\varvec{n}}_i) - D\left| \delta j\right| \Big \}, \end{aligned}$$
(5)

where \(\delta \) is the sampling distance, \(j_i^{*}\) is an element from the iteratively reduced search profile \(\mathcal {J}\,{=}\,\{-j',\,-j'+1,\,{\ldots },\,j'-1,\,j'\}\); \(j'\,{=}\,{\lfloor }\frac{J-1}{1-K}{\cdot }k\,{+}\,\frac{1\,-\,J\,K}{1-K}{\rfloor }\) (\({\lfloor }.{\rfloor }\) is the floor function) along \({\varvec{n}}_i\), and D weights the distance from \({\varvec{x}}_i\) to \({\varvec{x}}_i\,{+}\,\delta \,j\,{\varvec{n}}_i\) (i.e. \(\left| \delta j\right| \)) and the boundary detection function \(F_i({\varvec{x}}) = \left\langle {\varvec{n}}_i,{\varvec{g}}({\varvec{x}})\right\rangle \) (\(\left\langle .,.\right\rangle \) is the dot product), with \({\varvec{g}}({\varvec{x}})\) being the gradient of the 3D spatial VB probability map at location \({\varvec{x}}\). The external energy \(E_{\text {ext}}\) attracts vertices \({\varvec{x}}_i\) to vertices \({\varvec{x}}_i^{*}\):

$$\begin{aligned} E_{\text {ext}} = \sum _{i=1}^{\left| \mathcal {V}\right| }w_i^*\left||\mathop {\mathrm {proj}}\left( {\varvec{x}}_i^{*} - {\varvec{x}}_i,\frac{{\varvec{g}}({\varvec{x}}_i^{*})}{\left||{\varvec{g}}({\varvec{x}}_i^{*})\right||}\right) \right||^2, \end{aligned}$$
(6)

where weight \(w_i^*= \max \{0, F_i({\varvec{x}}_i^{*}) - D\left| \delta j_i^{*}\right| \}\) (Eq. 5) gives a greater influence to \({\varvec{x}}_i^*\). By projecting \({\varvec{x}}_i^{*} - {\varvec{x}}_i\) onto the gradient of the 3D spatial VB probability map \({\varvec{g}}({\varvec{x}}_i^{*})\) at location \({\varvec{x}}_i^{*}\) (\(\mathop {\mathrm {proj}}(.,.)\) is the vector projection), \(E_{\text {ext}}\) becomes invariant to movements of vertices within the plane perpendicular to \({\varvec{g}}({\varvec{x}}_i^{*})\).

Internal Energy. From the training set of VBs, M aligned and topologically corresponding meshes \(\{\mathcal {M}_m\}_{m=1}^{M}\,{=}\,\{\mathcal {V}_m,\,\mathcal {F}_m\}_{m=1}^{M}\) (\(\left| \mathcal {V}_m\right| \,{=}\,\left| \mathcal {V}\right| \), \(\mathcal {F}_m\,{=}\,\mathcal {F}\)) are extracted, and their vertices \(\{\mathcal {V}_m\}_{m=1}^{M}\) are squeezed into matrix \(S\,{=}\,[{\varvec{s}}_1,\,{\ldots },\,{\varvec{s}}_M]\) of size \(3\left| \mathcal {V}\right| {\times }M\), where vector \({\varvec{s}}_m\,{=}\,\varPhi (\mathcal {V}_m)\) contains triplets of \(\mathcal {V}_m\) according to function \(\varPhi \). In each k-th (out of K) iteration of the applied segmentation, \(\mathcal {M}\) is defined as a linear combination of \(\{\mathcal {M}_m\}_{m=1}^{M}\) that approximates \(\varPhi (\mathcal {V})\) with S:

$$\begin{aligned} {\varvec{w}}^{*} = \underset{{\varvec{w}}{\in }\mathbb {R}^{M{\times }1}}{\arg \min }\left||S{\varvec{w}} - T(\varPhi (\mathcal {V}))\right||, \end{aligned}$$
(7)

where \({\varvec{w}}\) are linear combination coefficients, and T is the rigid transformation that aligns \(\varPhi (\mathcal {V})\) to S. Coefficients \({\varvec{w}}^*\) represent contributions of \(\{\mathcal {M}_m\}_{m=1}^{M}\) to the reconfiguration of vertices \(\mathcal {V}^{*}\,{=}\,\varPhi ^{-1}\left( T^{-1}(S{\varvec{w}}^{*})\right) \) that form a new topologically corresponding mesh \(\mathcal {M}^{*}\,{=}\,\{\mathcal {V}^{*},\,\mathcal {F}^{*}\}\) (\(\left| \mathcal {V}^{*}\right| \,{=}\,\left| \mathcal {V}\right| \), \(\mathcal {F}^{*}\,{=}\,\mathcal {F}\)). The internal energy \(E_{\text {int}}\) penalizes the deviation of vertices \({\varvec{y}}_i\,{\in }\,\mathcal {V}\) from vertices \({\varvec{y}}_i^{*}\,{\in }\,\mathcal {V}^{*}\):

$$\begin{aligned} E_{\text {int}} = \sum _{i=1}^{\left| \mathcal {V}\right| }\sum _{j\in \mathcal {N}_i}\left||\left( {\varvec{y}}_i-{\varvec{y}}_j\right) -\left( {\varvec{y}}_i^{*}-{\varvec{y}}_j^{*}\right) \right||^2, \end{aligned}$$
(8)

where \(\mathcal {N}_i\) is the set of vertices in the neighborhood of \({\varvec{y}}_i\) (or \({\varvec{y}}_i^{*}\), as the topology is preserved). The deformation flexibility of mesh \(\mathcal {M}\) is therefore regulated by the shape prior constraint between vertices \(\mathcal {V}\) and \(\mathcal {V}^{*}\).

Energy Minimization. Once vertices \({\varvec{x}}_i^{*}\) and \({\varvec{y}}_i^{*}\) are determined, mesh \(\mathcal {M}\) is reconfigured in each k-th iteration of the applied segmentation by minimizing \(E\,{=}\,E_{\text {ext}}\,{+}\,{\alpha }E_{\text {int}}\), where \(\alpha \) weights the contribution of both energy terms. According to \(E_{\text {ext}}\), \(\mathcal {M}\) is driven along iteratively reduced displacements towards VB boundaries, identified as strong gradients in the 3D spatial VB probability map generated by the proposed 3D CNN, while \(E_{\text {int}}\) restricts the flexibility of \(\mathcal {M}\) by preserving the distribution of its vertices in the training set of VBs. After K iterations, mesh \(\mathcal {M}^*\) represents the final segmentation of the observed VB.

3 Results

3.1 Image Database

The proposed method was evaluated on a publicly available databaseFootnote 1 of 23 sagittally reconstructed T2-weighted turbo spin-echo 3D MR images of the lower spine from 23 subjects, acquired with a 1.5 Tesla Siemens scanner and resampled to a voxel size of \(2\,{\times }\,1.25\,{\times }\,1.25\,\text {mm}^3\). For each VB from T11 to L5 (\(7\,{\cdot }\,23\,{=}\,161\) in total), a reference manual segmentation was available [6]. For method evaluation, the database was split into 11 training and 12 testing images, and vice versa, according to the leave-half-images-out scheme that was repeated five times.

3.2 Implementation Details

Mean VB Shape Models. Marching cubes were first applied to each reference segmentation \(B_{ij}\) of i-th VB (\(i\,{=}\,1,\,2,\,{\ldots },\,7\) for T11 to L5) in j-th training image (\(j\,{=}\,1,\,2,\,{\ldots },\,M\) for \(M\,{=}\,11\) or 12 images). The resulting meshes \(\{\mathcal {M}_{ij}\}_{j=1}^{M}\) were aligned and made topologically corresponding by applying the isotropic remeshing, coherent point drift and generalized Procrustes analysis, yielding the mean shape model \(\mathcal {M}_{i}\) of i-th VB. Vertices of \(\mathcal {M}_{i}\) were further used for deformable model-based segmentation guided by 3D spatial VB probability maps.

CNN Architecture. For each i-th VB in the 3D testing image, its mean shape model \(\mathcal {M}_i\) was first initialized by coarsely identifying the corresponding VB centroid, and the minimum bounding box of \(\mathcal {M}_i\) was resized by factor two to obtain the VOI of the observed VB. For each VOI voxel, a 3D patch \({\varvec{P}}\) of \(15\,{\times }\,15\,{\times }\,15\) mm\(^3\) was extracted and isotropically resampled to a grid of \(15\,{\times }\,15\,{\times }\,15\) voxels, and fed to a 3D CNN with two convolutional (\(\mathcal {C}_1\) and \(\mathcal {C}_2\)), two pooling (\(\mathcal {P}_1\) and \(\mathcal {P}_2\)), and two fully connected layers (\(\mathcal {F}_3\) and \(\mathcal {F}_4\)). For CNN training, the same number of patches was extracted within and outside VBs (100k patches per training image). By using 48 kernels of size \(4\,{\times }\,4\,{\times }\,4\) with 3120 weights \({\varvec{W}}_{\!1}\) and biases \({\varvec{b}}_1\), layer \(\mathcal {C}_1\) generated 48 ordinary maps that were activated by \(f_1(x)\,{=}\,\max \{x,\,\frac{x}{10}\}\), and then transformed in layer \(\mathcal {P}_1\) into pooled maps by replacing each \(2\,{\times }\,2\,{\times }\,2\) block with its maximum, i.e. \(g_1\,{=}\,\max \). By convolving pooled maps with 48 kernels of size \(3\,{\times }\,3\,{\times }\,3\) with 62256 weights \({\varvec{W}}_{\!2}\) and biases \({\varvec{b}}_2\), layer \(\mathcal {C}_2\) generated 48 ordinary maps that were activated by \(f_2\,{=}\,f_1\), and then transformed in layer \(\mathcal {P}_2\) into pooled maps equally as in \(\mathcal {P}_1\). Layer \(\mathcal {F}_3\) consisted of 192 linear combinations from \(\mathcal {P}_2\) with 73728 weights \({\varvec{W}}_{\!3}\) and biases \({\varvec{b}}_3\), which were activated by \(h_3(x)\,{=}\,\max \{x,0\}\), while layer \(\mathcal {F}_4\) consisted of two linear combinations from \(\mathcal {F}_3\) with 386 weights \({\varvec{W}}_{\!4}\) and biases \({\varvec{b}}_4\), which were activated by the identity \(h_4(x)\,{=}\,x\). The resulting vector \({\varvec{w}}_4\,{=}\,(w_{4,1},\,w_{4,2})^T\) was passed through the softmax function to obtain vector \({\varvec{p}}\,{=}\,(p_1,\,p_2)^T\), where \(p_1\) and \(p_2\) represented probabilities that the observed VOI voxel belongs to the VB or background, respectively. Vector \({\varvec{p}}\) was obtained for every VOI voxel, resulting in the 3D spatial VB probability map for the observed VOI. The optimal weights \({\varvec{W\!}}\) and biases \({\varvec{b}}\) of the CNN were, using the stochastic gradient descent method, trained by minimizing the \(L^2\)-regularized negative log-likelihood cost function \(l(\mathcal {T},\,\{{\varvec{W\!}},{\varvec{b}}\})\,{=}\,-\sum _{{\varvec{P}} \in \mathcal {T}} \log P(X\,{=}\,c\,{\mid }\,{\varvec{P}},\,\{{\varvec{W\!}},{\varvec{b}}\}) + \lambda \left||\{{\varvec{W\!}},{\varvec{b}}\}\right||_2^2\) to explicitly prevent overfitting, where \(P(X=c)\) is the probability that random variable X belongs to class c (VB or background), \(\lambda \,{=}\,0.001\) is the regularization parameter, and \(\mathcal {T}\) is the set of 3D patches with known classes c extracted from training images. During training, the CNN classification accuracy converged to around \(97.8\,\%\) after 5 out of 20 epochs, and then fluctuated for only \(0.3\,\%\).

Table 1. Segmentation results of 161 vertebral bodies (levels T11 – L5) from 23 MR spine images, presented as mean (± standard deviation) of the Dice similarity coefficient (DSC), mean symmetric surface distance (MSSD) and Hausdorff distance (HD), and compared to the results of Chu et al. [6] obtained on the same image database.

Segmentation with Deformable Models. For each i-th observed VB in the 3D testing image, the mean shape model \(\mathcal {M}_i\) was used for the deformable model-based segmentation according to the generated 3D spatial VB probability map, with experimentally defined parameters \(\alpha \,{=}\,180\), \(J\,{=}\,35\), \(D\,{=}\,5.0\) mm\(^{-2}\), \(\delta \,{=}\,0.25\) mm and \(K\,{=}\,15\) iterations. The conjugate gradient method was applied to obtain coefficients \({\varvec{w}}^*\) and to minimize energy E.

3.3 Experimental Results

The obtained segmentations were compared to reference segmentations (Table 1), resulting in (\(\varDelta \) indicates the mean difference for five leave-half-images-out experiments) an overall Dice similarity coefficient of \(93.4\,{\pm }\,1.7\,\%\) (\(\varDelta \,{=}\,0.5\,\%\)), mean symmetric surface distance of \(0.54\,{\pm }\,0.14\,\text {mm}\) (\(\varDelta \,{=}\,0.13\,\text {mm}\)) and Hausdorff distance of \(3.83\,{\pm }\,1.04\,\text {mm}\) (\(\varDelta \,{=}\,0.53\,\text {mm}\)). Figures 1(b) and 2(a) show examples of VB segmentations, and Fig. 2(b) shows their dependency from initialization.

Fig. 2.
figure 2

(a) Segmentations of vertebral bodies (VB) for a selected MR image, shown in the mid-sagittal and mid-coronal cross-section, and in 3D. (b) Segmentation results after displacing the initialization of mean VB shape models from reference VB centroids (DSC\(\,{-}\,\)Dice similarity coefficient, MSSD\(\,{-}\,\)mean symmetric surface distance).

4 Discussion and Conclusion

By demonstrating that heuristically crafted features can be efficiently replaced only by appearance features, CNNs proved successful in image classification [8], and have therefore been applied also to medical image segmentation [1214]. In this paper, we presented an automated method for supervised segmentation of VBs from 3D MR spine images based on coupling 3D CNNs with deformable models. By comparing the results to existing methods and to those of Chu et al. [6] that were obtained on the same image database, the proposed method proved superior in terms of segmentation accuracy and robustness (Table 1). Moreover, as shown in Fig. 2(b), segmentations fully converged when initialized within a \(7.5\,\text {mm}\) radius neighborhood of reference VB centroids, meaning that the results were not biased by the coarse manual initialization, which can be eventually replaced by automated VB localization [6]. Besides yielding superior results, the proposed method provides additional advantages related to the use of CNNs for segmentation. Existing CNN-based 3D segmentation approaches [1214] apply two-dimensional (2D) CNNs, where the input is a 2D image or patch, by analyzing each individual 3D image cross-section, while the results are usually over- or under-segmented as they are obtained by thresholding the CNN probability maps combined with morphological processing and/or removing of connected components. On the other hand, the main novelty of the proposed method is that it relies on 3D spatial VB probability maps, which are obtained by the designed 3D CNN architecture and guide the shape-constrained deformable models towards highly accurate and robust VB segmentations.