1 Introduction

The intervertebral disc (IVD) is a cartilaginous joint that lies between adjacent vertebras. It plays a crucial role in the shock absorption of vertebral movement [1, 2]. In modern society, back pain is becoming a common healthy problem, which causes the pain, stiffness and loss of independency of patients. According to the international studies, the point prevalence of back pain is between 12% and 35%, while the lifetime prevalence is up to 49% to 80% [3]. For this disease, degeneration of the intervertebral disc is considered as a major cause [4].

Magnetic Resonance Imaging (MRI) is a commonly used imaging technique in the diagnosis of IVD degeneration and many other diseases, which provides non-invasive assessment to human body. Compared to other medical imaging methods, such as Computed Tomography (CT) imaging, MRI could provide excellent contrast in soft tissue without ionizing radiation. Besides, the MR scans could be obtained with different modalities, and provide more information about tissue structure. In this work, four MRI modalities (i.e. in-phase, opposed-phase, water, fat) were used for the segmentation and localization of IVDs. Figure 1 shows an example of these four modalities. It should be noticed that only the 7 IVDs between the twelfth thoracic vertebra and sacrum are delineated manually as the targets.

Fig. 1.
figure 1

An example of 3D multi-modality images provided by MICCAI 2018 Challenge on IVDM3Seg. (a) to (d) are in-phase, opposed-phase, water, and fat modality in order while (e) is the manually delineated labels for 7 IVDs between the twelfth thoracic vertebra and sacrum.

The research on IVD degeneration usually needs the segmentation of IVDs. Traditionally, the IVD labels are delineated manually. However, this job is always time-consuming and may be biased for inter- and intra-observer variabilities [5, 6]. For this matter, automatic IVD segmentation and localization methods have great significance to the study of IVD degeneration.

There are three main challenges for automatic IVD segmentation and localization on multi-modality images. Firstly, distinguishing different IVDs is difficult due to the intra-subject similarity of IVDs. Secondly, the intensity of IVD boundary resembles that of the neighborhood tissues, which makes the IVD contour fuzzy. Thirdly, how to harness the multi-modality information effectively in medical image processing remains to be explored.

1.1 Previous Work

There are many segmentation and localization methods proposed in previous research, which are based on traditional hand-crafted features [7,8,9,10,11,12,13]. Besides, some popular graph-based methods, such as graph cut [10] and statistical shape model [7], were also applied to IVD segmentation. For localization, some graphical models were proposed to take IVD geometric relationship into account [13]. With the reference to the local parts shape and neighborhood anatomical structures, the accuracy of IVD localization improved in some degree.

In recent years, machine learning has drawn extensive attention in many fields. Some classical machine learning algorithms, such as marginal space learning (MSL) [14], Adaboost [15], and sparse kernel machine [16], were also adopted to IVD segmentation and localization. And these methods have shown excellent performance.

More recently, deep learning techniques achieved great success in computer vision. Many researchers began to attempt deep learning algorithms in medical image processing. And these methods have proven effective. In the past few years, all the state-of-art methods on MICCAI IVD segmentation and localization challenge were deep learning-based [17, 18].

Multi-modality images are not only available for IVD segmentation and localization. How to utilize multi-modality information is a common issue in medical image processing, such as MRI-based brain tissue [19] and brain tumor segmentation [20]. Generally, the harness of multi-modality data could improve the performance more or less.

1.2 Our Contribution

We propose a 2.5D multi-scale deep learning network for segmentation and localization of IVDs on multi-modality MR scans. Our method achieved the state-of-art performance in the MICCAI 2018 Challenge on IVDM3Seg.

Our main contributions are summarized below:

  1. 1.

    We proposed a multi-scale 2.5D fully convolutional network (FCN) for IVD segmentation and localization on multi-modality MR scans. The back bone of the proposed network is a U-Net [21] like architecture. The input of the 2.5D network is a few adjacent slices from multi-modality MR scans, while the output of this network is a 2D slice corresponding to a certain layer of the input. For the purpose of make full advantage of multi-modality information, Squeeze-and-Excitation (SE) modules [22] are added in the skip connections.

  2. 2.

    We proposed a model fusion strategy to improve accuracy and robustness of IVD prediction. In this work, we trained three different 2.5D networks. The predictions of these models are corresponding to the middle, the rightmost, and the leftmost slices of the input sequence. For the slices located at the middle of 3D images along Z-axis, the average outputs of these models are taken as the final predictions. For the slices near the both edges, IVD predictions are generated by the model, which is corresponding to either the rightmost or the leftmost slice of the input sequence.

  3. 3.

    We proposed a geometric constraint post-processing method to generate accurate IVD localization results. This method takes the intra-subject geometric relationship of IVDs into account. In our experiments, the false positive regions on the prediction maps are well eliminated by this method.

2 Methodology

The detail of IVD segmentation and localization method is elaborated in this section. We start by illustrating the architecture of proposed 2.5D multi-scale FCN for IVD segmentation. Furthermore, we explain the way to harness multi-modality images with this network. To improve the robustness and accuracy of prediction, an ensemble strategy is employed in this work. In order to correct the false positive regions in prediction maps, we proposed a post-processing pipeline, which takes geometric constraint of 7 specified IVDs into account. The final results of segmentation and localization are generated by this post-processing method.

2.1 2.5D Multi-scale FCN for IVD Segmentation

The detail structure of proposed network is shown in Fig. 2. The back bone of this network is a U-Net like architecture, which has achieved great success in medical image processing since it was proposed in 2015. To utilize multi-modality images, the architecture of U-Net is slightly adapted from the origin version. The input of this network is expanded up to 44 (11 slices * 4 modalities) channels to harness the multi-modality data, while the output is corresponding to a certain position of the input sequence. Besides, residual connections are added between feature maps with the same scale. And SE modules are also inserted in skip connections between the contracting path and the expansive path. The reduction ratio used in SE modules is set to be 16.

Fig. 2.
figure 2

The details of our proposed 2.5D multi-scale FCN for IVD segmentation. The input slice sequence includes 44 slices, which consists of 11 consecutive slices from four modalities with the same corresponding position. The prediction map is corresponding to the middle, the leftmost, or the right most slice of the input sequence.

2.2 2.5D Multi-scale FCN Ensemble Strategy

All the multi-modality images used in this work are in the same size of 256 * 256 * 36. For each study, 11 consecutive slices from four modalities with the same corresponding position are extracted and concatenated as the input sequences. And there are 26 such consecutive sequences for each image. These input sequences are utilized to train three 2.5D multi-scale FCNs. The prediction of these models is corresponding to different layers respectively, which are the middle, the leftmost and the rightmost slices in the input sequence. We use \( m_{middle} \), \( m_{left} \) and \( m_{right} \) to denote these three models in the following content. The ensemble outputs of these models are produced as prediction results, which are more accurate and robust. For the simplicity of description, a mono-modality 3D image \( V \) is picked as an example. Slices in \( V \) from left to right are denoted as \( {\text{S}}_{i} \left( {i \in \left\{ {1, 2, \ldots , 26} \right\}} \right) \). For \( {\text{S}}_{6} \) to \( {\text{S}}_{31} \), the average outputs of \( m_{middle} \), \( m_{left} \) and \( m_{right} \) are taken as the prediction of IVD segmentation. For \( {\text{S}}_{1} \) to \( {\text{S}}_{5} \) and \( {\text{S}}_{32} \) to \( {\text{S}}_{36} \), the prediction of IVD segmentation is generated by \( m_{left} \) and \( m_{right} \) respectively.

2.3 Geometric Constraint Post-processing

Although model ensemble can improve the accuracy and robustness of segmentation results to a certain extent, there are still some obvious false positive regions in the prediction maps. These false positive areas could be categorized as two types, the isolated noise points, and the IVD segmentation above the twelfth thoracic vertebra. Figure 3 visualizes some ensemble prediction maps on opposed-phase. The isolated noise can be well eliminated by excluding the small connected regions in prediction maps. For IVDs above the twelfth thoracic vertebra, we proposed a post-processing method with geometric constraint for removal. Firstly, we picked the ground truths from training set, and aligned them to the segmentation result with reference to the centroid of the last IVD. These ground truths are then registered to the segmentation result with affine transformation. The best fitted one is then selected as the mask. Remove all the connected regions that have no intersection with this mask. The remaining content is right the final prediction of 7 expected IVDs. For the robustness of post-processing, the registered ground truth was dilated before being applied as the mask (Fig. 4).

Fig. 3.
figure 3

Examples of prediction map on opposed-phase without post-processing. (a) to (f) are 6 slices extracted from a study. Green contours indicate the boundary of the ground truths. And the ensemble prediction of IVDs is delineated by red lines. (Color figure online)

Fig. 4.
figure 4

Illustration of the geometric constraint post-processing. (a) is the ensemble prediction of proposed networks. The red mask in (b) is the chosen registered ground truth with binary dilation. And (c) is the final IVD segmentation result of our method. (Color figure online)

3 Experiments and Results

3.1 Data

The performance of our method was evaluated on multi-modality MR scans provided by MICCAI 2018 Challenge on IVDM3Seg. These data were collected from 8 subjects at two time points of prolonged bed rest study. For each study, four MR scans acquired with different modalities (i.e. in-phase, opposed-phase, water, fat) were enrolled. And the IVDs between the twelfth thoracic vertebra and sacrum are delineated manually as the ground truth. Figure 1 shows an example of these multi-modality images and the corresponding ground truth.

3.2 Pre-processing and Data Augmentation

The multi-modality images were pre-processed with some commonly used methods. Firstly, N4 correction algorithm was applied to correct the bias field of MR scans. In the next stage, intensity distribution of the corrected images was normalized as zero mean and unit variance. For the inadequacy of training data, some data augmentation methods (i.e. random scale, rotate, translation, and deformable transformation) are applied during the training stage.

3.3 Evaluation Metrics

The segmentation and localization results are evaluated with the following three quantitative metrics:

  1. 1.

    Dice overlap coefficient. The Dice metric is one of the most popular assessments for semantic segmentation, which measures the percentage of true positive voxels in prediction. The definition of Dice can be expressed by the following formula:

    $$ Dice = \frac{{2\left| {A \cap B} \right|}}{\left| A \right| \cap \left| B \right|} \times 100\% $$
    (1)

    Where A is the set of foreground voxels in the ground truth and B denotes the corresponding set in the prediction of foreground.

  2. 2.

    Average absolute distance (ASD). For IVD segmentation task, ASD is the average absolute distance between disc surface of ground truth and segmentation result. Smaller ASD means a better segmentation result.

  3. 3.

    Localization distance. This metric is used for measuring the localization results. It is calculated by the equation below:

    $$ R = \sqrt {\left( {\Delta x} \right)^{2} + \left( {\Delta y} \right)^{2} + \left( {\Delta z} \right)^{2} } $$
    (2)

    Where \( \Delta x \), \( \Delta y \) and \( \Delta z \) are the absolute distance between the identified IVD centroids and the corresponding ground truth along X-, Y- and Z-axis. It is obvious that a smaller localization distance means a more accurate localization.

3.4 Results of MICCAI 2018 On-site Challenge

Tables 1, 2, and 3 list the on-site test results of MICCAI 2018 Challenge on IVDM3Seg with proposed method. Our method achieved the state-of-art performance with the respect of all the three quantitative metrics (i.e. Dice, ASD, and Localization distance) among nine participating teams.

Table 1. Dice of on-site test results
Table 2. ASD of on-site test results
Table 3. Localization distance of on-site test results

4 Discussion

Some common spine diseases, such as low back pain (LBP), have proven to be associated with IVD degeneration [23]. IVD segmentation and localization have important significance in clinical diagnosis and research. In this work, we proposed an automatic IVD segmentation and localization method on multi-modality MRI with 2.5D multi-scale FCN and geometric constraint post-processing.

In the MICCAI 2018 Challenge on IVDM3Seg, the deep neural network is the most popular algorithm. For 3D multi-modality MR images, processing with a 3D network is a straightforward approach. Compared to 2D networks, 3D architectures could generate more discriminative spatial features. And these architectures were employed by some teams in this challenge. Due to the plenty of parameters in deep neural networks, a huge amount of data is demanded in training stage. However, there were only 16 studies provided by MICCAI 2018 Challenge on IVDM3Seg, which were collected from 8 subjects at two time points. Considering the inadequacy of 3D multi-modality images, we proposed a 2.5D multi-scale FCN architecture as a tradeoff between the capacity of network and the amount of training data. The on-site test results of MICCAI 2018 Challenge on IVDM3Seg shows that the performance of 2D networks was better than that of 3D networks in general with limited training data. And our 2.5D FCN surpassed both 2D and 3D architectures.

The intra-subject morphology and topology relationship between IVDs are similar inter-subjects. And it is potential to be utilized for IVD localization. However, this relationship is hard to be captured by FCN. To take this information into account, we proposed a geometric constraint post-processing method based on registration. And it shows great performance in on-site test of MICCAI 2018 Challenge on IVDM3Seg. It should be noticed that our registration-based post-processing relies on the inter-subject consistency of IVD intra-subject geometric relationship. If this consistency was destroyed by some severe spine diseases, this method may produce wrong cases. The IVD localization method with better robustness remains to be explored in the future work.