1 Introduction

Accurate and robust segmentation of the corpus callosum in brain magnetic resonance images (MRI) is a crucial step in many automatic image analysis tasks, with clinical applications like computer-aided diagnosis and prognosis of several neurological diseases [1, 6, 10, 16].

The segmentation task is often challenging, however, especially when dealing with some clinical studies of chronic diseases, like those of multiple sclerosis patients, whose brain structures may have undergone significant changes over time, thereby rendering registrations of subject images to pre-segmented brain templates inaccurate [12, 18]. Consequently, the standard approach of employing (multi-) atlas-based segmentation algorithms may fail, as reported in recent publications [15, 18]. Furthermore, the MRI scans to be segmented often vary significantly in terms of image quality, rendering many unsupervised segmentation algorithms that often require delicate image-dependent parameter-tuning impractical when a large dataset of images must be segmented.

As demonstrated in [18], a multi-atlas-based algorithm custom-designed for the segmentation of the corpus callosum achieved a segmentation accuracy of 0.77 in Dice similarity score (DSC), as evaluated on a challenging clinical dataset, while those of other generic multi-atlas-based methods, like weighted majority voting and STEPS [8], achieved an average DSC of 0.65 to 0.70.

In view of the availability of several public image datasets [12, 14], where some are accompanied with expert segmentations, we thus hypothesize that adopting a supervised approach [5, 7] to the segmentation task may provide a better alternative to existing unsupervised methods.

More specifically, we propose a two-stage process to improve the efficiency (of model training) and accuracy of the final segmentation. In the first stage, we employ multi-atlas based segmentation to localize the CC structure. Because neighbouring structures of the CC provide adequate contextual information, our localization step is shown to be tolerant to imaging artifacts and shape variations caused by significant brain atrophy, and is sufficient for CC localization, which requires less precision than the actual CC segmentation task. In the second stage, we apt a discriminative learning approach where we employ a subset of the available expert segmentations from a single dataset to train a neural network architecture called convolutional encoder network (CEN) [7] that performs feature extraction and model parameter training in a joint manner.

Our two-stage pipeline is motivated as follows. Rather than training a CEN that would extract features from the entire image slice, we propose to examine image features only within the bounding box of the target structure. By first performing target-localization to identify the region-of-interest (ROI) of the target structure, our two-stage approach explicitly constrains the network to learn and extract features only around the periphery of the CC, thereby yielding features that are more relevant to differentiating the CC structure from its neighbouring vessels, which, as similarly reported in [18], often brought confusion to existing unsupervised segmentation algorithms.

To this end, our main contributions are as follows. We (1) propose a novel pipeline for a fully automatic CC segmentation algorithm; (2) performed comparative analysis with existing unsupervised methods; and (3) conducted extensive evaluation of our method using 4 datasets involving over 2,000 scans with different image characteristics.

2 Methods

2.1 ROI Localization

In localizing the CC structure of each image, we employ a multi-atlas based segmentation approach. This step is done both at training and at test time. Our localization process involves performing registrations of 5 segmented atlases from the public dataset of Heckemann et al. [12]. This allows us to transfer the CC segmentation label of each atlas onto the target image to be segmented. After registrations, we perform weighted majority voting on the transferred atlas labels and threshold the casted votes to define a rough outline of the CC in each image. To give us a confident estimation of the CC region, we employ a relatively high threshold value of 0.7 and set the weight of the vote of each atlas to be proportional to the average local correlation between the registered atlas and the subject image being segmented.

For registrations, we employ a coarse-to-fine, multi-stage (rigid to affine) registration pipeline as provided through the ANTS registration toolkit [2]. Initial experiments demonstrated that the localization results are insensitive to the choice of similarity measure used. We chose to use the mutual information measure due to its low computational and time requirements (as opposed to the use of local cross-correlation measure). In calculating the similarity measure, we also employ stochastic sampling to further increase computational efficiency.

2.2 Joint Feature-Learning and Model Training

Let there be a training set of N images \(\mathcal {I}=\left\{ I \right\} ^{N}_{n=1}\) and corresponding CC segmentations \(\mathcal {S}=\left\{ S \right\} ^{N}_{n=1}\). Following the supervised approach of [7], our goal is to seek a segmentation function f that maps input image \(I_n\) to its corresponding CC segmentation \(S_n\):

$$\begin{aligned} \hat{f} = \arg \min _{f \in \mathcal {F}} \sum _n E(S_n, f(I_n)), \end{aligned}$$
(1)

where \(\mathcal {F}\) is the set of possible segmentation functions, and E is an error function that measures the sum of squared differences between \(S_n\) and the segmentation predicted by f.

As done in previous work [7], \(\mathcal {F}\) is modeled by a convolutional encoder network (CEN) with shortcut connections, whose model architecture is divided into the convolutional pathway and the deconvolutional [5] pathways. The convolutional pathway is designed to automatically learn a set of features from \(\mathcal I\) and is constructed with alternating convolutional and pooling layers, where the former convolves its input signal with convolutional filter kernels, and the latter performs averaging over blocks of data from the convolved signal. Conversely, the deconvolutional pathway is designed to reconstruct the segmentation mask using inputs from the convolutional pathway and consists of alternating deconvolutional and unpooling layers. In short, the convolutional pathway learns a hierarchical set of low-level to high-level image features while the deconvolutional pathway learns to predict segmentations using the learned features extracted from the convolutional pathway.

As detailed in [3, 7], the use of low-level and high-level image features respectively provides the means to increase localization precision and provide contextual information to guide segmentation. Hence, to integrate these hierarchical features, the two pathways in the CEN model [7] are further linked with “shortcut connections”, via which the activations of the convolutional pathway and those of the deconvolutional pathway become connected [7].

2.3 Details on Implementation and Model-Training

We trained a 2-layer CEN where, in each layer, we employ 32 filters of size of \(9 \times 9\) and \(2 \times 2\) respectively for the convolutional and pooling layers. For the hidden units, we employ an improved version of the rectified linear units [7]. We stopped training when the training error converged, which was usually around 800 epochs. To avoid overfitting [7], we also employ the dropout technique with dropout rates of 0.5−0.75.

Further, recent research [7, 17] has shown that pre-training of the CEN model parameters with adequate fine-tuning improves accuracy across a variety of application domains. Accordingly, in a similar manner as done in [7], we performed pre-training on the input images layer by layer using 3 layers of convolutional restricted Boltzmann machines (convRBMs) [13]. More specifically, the first layer is trained on the input images and each subsequent layer is trained on the hidden activations of its previous layer. The model parameters of our CENs are then initialized using the weights obtained from this pre-training step.

In training the convRBMs, we employ the AdaDelta algorithm and explored various weight-initializations for the convRBMs model parameters, i.e. with mean of zero and standard deviation of \(\left\{ 0.1, 0.03, 0.01, 0.003, 0.001\right\} \), which was determined using a logarithmic scale for the condition that the images have been normalized to have mean and standard deviation of 0 and 1, respectively. We also explored different settings for the hyper-parameter in AdaDelta and optimized this parameter empirically (i.e. \(\epsilon =\left\{ 1\mathrm {e}{-8},1\mathrm {e}{-9},1\mathrm {e}{-10},1\mathrm {e}{-11}\right\} \)).

3 Experiments and Results

Materials. The following datasets of MR T1w were retrospectively collected for this work. Due to the retrospective nature, two protocols were used to generate segmentations of the CC for both training and testing purposes: manual vs. semi-automatic.

MS (in-house): 348 patients with secondary progressive multiple sclerosis (SPMS) were enrolled in a clinical trial. Each subject was scanned at 4 different timepoints: at screening, baseline (\(\sim \)4 weeks after screening), year one, and year two. This resulted in a total of 280 + 270 + 280 + 300 = 1130 scans. Registration of each scan to the template space and automatic extraction of the mid-sagittal slice from each volume was performed. Then, the area of the CC was manually outlined by a trained medical student who was blinded to the chronological sequence of the four timepoints.

CIS (in-house): 140 subjects with clinical isolated syndrome (a prodromal phase of MS) were enrolled in another clinical trial that was completed approximately in 2013. Each subject was scanned at up to 5 different timepoints: at screening, baseline (\(\sim \)4 weeks after screening), 3 months, 6 months, year one, and year two. This resulted in a total of 280 + 270 + 280 + 300 = 1130 scans. Registration and automatic extraction of the mid-sagittal slice from each volume were performed. Then, the Yuki tool [6] was applied to segment the CC from each mid-sagittal slice. Finally, each segmentation was manually edited by a blinded operator using a graphical user interface.

Fig. 1.
figure 1

Randomly selected images from our four test datasets (rows from top to bottom): MS#1 (first row), CIS, OAS-NC, and OAS-AD datasets.

OAS-NC (public): normal subjects collected from the OASIS datasetFootnote 1 were downloaded from NITRCFootnote 2. The CC segmentations were obtained using the semi-automatic CC extraction method of [6] with 20 % of the scans subsequently edited by a blinded operator using the ITK-SNAP software, as noted in [6].

OAS-AD (public): subjects with Alzheimer’s disease also from the OASIS dataset were downloaded from NITRC. The CC segmentations were obtained using the same procedure as those used for the previous set.

Figure 1 shows a collection of the images from each dataset while Table 1 summarizes the characteristics of these datasets. All scans were registered to the template space and resampled to 256 \(\times \) 256 with a pixel size of \(1 \times 1\) mm prior to analysis. We chose the images acquired at the fourth timepoint of the MS dataset as the training set (N = 300). We denote this training set as MS#4.

Table 1. Characteristics of the datasets: size (mm), signal-to-noise-ratio (\( \mathrm {SNR_{mean}}\) as given in [9]), original pixel spacing (mm) and image dimensions.

Experiment I. Comparative analysis with unsupervised methods. We first performed comparative analysis using existing unsupervised methods; these include the intensity-based approach of Adamson et al. [1] (denoted as CCTPA), weighted majority voting (WMV), STAPLE [4], STEPS [8], and the random walker-based algorithm of [18]. Note that all multi-altas based methods (except CCTPA) employed the same region-of-interest that we employed for our proposed 2-stage pipeline. For CCTPA, as its parameters depend on the input size, we did not use the same ROI. Table 2 reports the Dice similarity score (DSC) obtained by each method as averaged over the entire MS dataset, excluding the training set MS#4. Examining the numbers, our proposed supervised approach gave results superior to those derived from unsupervised methods.

We also examined the effect of training the CEN directly on the entire midsagittal slices, bypassing the CC localization step. From the same table, we can see that our two-stage ROI-specific CEN approach gave the highest DSC, thereby confirming the usefulness of the proposed two-stage pipeline.

Table 2. DSC obtained by different existing unsupervised methods and our novel supervised approach as averaged over all the available test data. “Ours” denotes using the proposed two-stage pipeline to learn and extract features only within the bounding box of the CC structure as localized from stage 1. “CEN” denotes applying the approach of [7] on the entire image slice.

Experiment II. Sensitivity analysis of the training sample size. Table 3 highlights the effect of training sample size (N); its top three rows report the segmentation accuracy obtained by our method for each dataset not used for training. Evidently, the more training samples used, the higher the accuracy can be obtained for all four datasets.

We again examined the effect of omitting the proposed two-stage pipeline, and trained the CEN directly on the entire image slices. Examining the bottom row of Table 3, we can observe again that our proposed two-stage approach also gave higher segmentation accuracy on all datasets except for MS#1. Hence, the proposed two-stage is not only more efficient (i.e. smaller image domain translates to fewer degrees of freedom being needed to model f), but is also capable of achieving better results.

Figure 2 shows example segmentation results that were obtained by thresholding the probabilistic predictions inferred from the trained CEN, i.e. with no other post-processing. Note that because we adopt a voxel-wise classification approach, and that the CENs that we employ do not explicitly enforce any form of shape and/or spatial regularization, some disconnected voxels can be mislabeled as CC, as highlighted in the figure in red. However, the produced segmentations can still be refined by fairly simple processing steps, such as connected component analysis, but was not done in this work as to avoid interference with the evaluation of the ROI-specific CEN approach.

Table 3. Effect of using N training images randomly drawn from MS#4. Dice scores of the coloured cells indicate top performance achieved for each dataset. Asterisks denote significant improvement over second competing method (p<0.001).
Fig. 2.
figure 2

Segmentation results randomly drawn from each of the four datasets (in order from top row to bottom: MS#1, CIS, OAS-NC, OAS-AD). The true positive, false positive, and false negative pixels are shown in green, red, and yellow, respectively.

4 Conclusions

We proposed a two-stage, fully automatic method for the segmentation of the CC structure in brain MRIs and presented thorough validation using four datasets collected from public and in-house sources. Our method was shown to be capable of performing results superior to those generated by existing unsupervised methods. Furthermore, our proposed two-stage pipeline that trains a ROI-specific CEN was shown to give accuracy higher than a non-ROI-specific CEN for most of the datasets tested. Future work involves implementing post-processing steps, e.g. using connected-component analysis to the minimize the amount of false positive predictions, thereby further improving the segmentation accuracy of our proposed pipeline.