Keywords

1 Introduction

Since 2012 krizhevsky et al. [16] won the ImageNet competition [6], deep convolutional neural networks [17] have been the mainstream to create the state of the art in computer vision and pattern recognition, in which there have been a series of breakthroughs among large lots of tasks, for example to classification [11, 26, 27], object-detection [7, 21], and segmentation [23]. A series of success is credited on the ability of CNNs to self-educate from the raw input without manual intervention.

Benefited from the large open datasets about medical-images, computer aided detection (CADe) making use of deep learning has become a reality. Up to now, CADe has been used in clinical environments for over 40 years [1], but it usually can’t replace the doctor or become the role in diagnosing. However, ResNet designed by He et al. [11] has 3.57% top-5 error on the ImageNet test set excelling human beings, which provides a possibility that machine can substitute for the doctors in some tasks, for instance, detecting cancer metastases [22], and diabetic retinopathy [9] etc.

In this paper, we specifically consider of the problem of automatic segmentation for nerves: given an ultrasound image of neck, some with nerve, we want to fully automatically end-to-end segment the nerves. Recently, most approaches rely on deep convolutional neural networks of medical image segmentation have achieved great success, in view of which, we tested some foundational methods based on CNNs and attempted to raise our own architecture.

The proposed architecture is inspired by U-Net [4] that has a good performance in biomedical image segmentation but with too many parameters. In order to improve the performance, we adopted inception modules and batch normalization. Through some efforts we have done, the model has fewer parameters and less time for training. In addition, we confirmed Dice coefficient as loss function, which compares the pixel-wise agreement with a predicted segmentation and ground truth.

2 Related Work

Early medical image segmentation methods mostly based on statistical shape, gray level and texture collected in [12]. Recently, level set [18], graph cut [19] have been employed as approaches in biomedical image segmentation. However, these methods are not widely used owing to their speed and complex operation.

Through the rapid development in recent years, deep Convolutional Neural Networks (CNNs) have been exploited to improve the ability of machine to achieve state-of-art goals [10, 11, 28] in computer vision. More stirring, CNN seems to be widely-used, which prompted us to employ it to automatically segment nerves from ultrasound images.

Semantic segmentation methods based on convolutional neural networks got a big development. [23] is the first model to use fully convolutional networks that produce corresponding output as same size as input and skip architecture that combines semantic information from a deep layer with appearance information from a shallow layer, SegNet [2] uses encoder-decoder structures restore the feature maps from higher layers with spatial information from lower layers. DeepLab [3] combines the responses at the final CNN layer with a fully connected Conditional Random Field (CRF) to improve localization accuracy. PSPNet [28] proposes a pyramid scene parsing network to combine local and global features.

As for medical image segmentation, U-Net [25] can be trained end-to-end from very few medical images, [24] proposes V-Net to 3D image segmentation. [5] uses cascaded fully convolutional neural networks for liver and tumor segmentation, which firstly trains an FCN to segment the liver as input for the second FCN, secondly uses the second FCN solely segment tumor.

3 Dataset

The dataset we used to evaluate our model is provided by KaggleFootnote 1, which contains a collection of nerves called the Brachial Plexus in ultrasound images. The training data is a set of 5635 images (\(580\times 420\) pixels) and its corresponding masks where nerve (white) and background (black) have been manually annotated (Fig. 1). And the testing set with 5508 ultrasound images is publicly available, but its segmentation maps are kept secret. The main purpose of accurately segmenting the nerve structures in ultrasound images is to help doctors to effectively insert a patients pain management catheter that mitigate pain at the source in order to decrease the intake of narcotic drug and speed up patient recovery.

Fig. 1.
figure 1

The left image is the raw ultrasound image containing nerve structure, the middle image is the corresponding mask manually annotated by experts, and the right image is the ultrasound image overlaid with ground truth segmentation map (red border). (Color figure online)

4 Improved U-Net Model

The network architecture is illustrated in Fig. 2. There are two paths in this architecture, similar to U-Net [25], which are a contracting path and an expansive path. In this network, we combine the two paths with inception module [20, 27] and batch normalization [14]. The contracting path (Fig. 2 left) is a normal convolutional neural network for recognition, it involves 3 basic convolutional units (Fig. 4) and 4 inception modules (Fig. 3), besides these, there are 5 max pooling operations with stride of 2 for downsampling. Between the contracting path and expansive path, there also have an inception module. In expansive path (Fig. 2 right), there are 5 upsample layers followed basic convolutional units or inception modules, generally symmetrical to the contracting path. At the final layer, a 1\(\,\times \,\)1 convolution and a sigmoid activation function are used to output 2 class segmentation images which have the same size as inputs. We also used skip architectures [23], which concatenate features of shallower layers from contracting path with the features of deeper layers from expansive path. We concatenate the features in the inception modules with the 4 convolutional paths.

Fig. 2.
figure 2

The proposed architecture. The blue and green box represent the basic convolutional units and the inception modules respectively. And what the red and yellow box signify are downsampling (max pooling) and upsampling. Simultaneously, there are 5 skip architectures. (Color figure online)

Fig. 3.
figure 3

Inception modules where four convolutional paths act on one input and connect to one output with average pooling as pooling layer.

Each inception module has 4 paths, which act together on one input and are concatenated as an output following the practice in [27]. Before every expansive 3\(\,\times \,\)3 and 5\(\,\times \,\)5 convolution, 1\(\,\times \,\)1 convolution layer is used to reduce computation. In total, the inception module has 6 convolutional layers and an average pooling layer. Besides the final layer, all of the convolutional operations are followed by a batch normalization and a rectified linear unit (ReLU), which make up a basic convolutional unit (Fig. 4). We employed sample linear interpolation as the upsample operation to make the size of output segmentation maps to reach the size of the input image rather than using deconvolutional layers as in [23], which demand supernumerary computation.

Fig. 4.
figure 4

The basic convolutional unit we used, where the convolutional layer followed a batch normalization and a ReLU.

5 Training

The input images and their corresponding masks are resized to \(64\times 64\) so as to reduce the computational cost. We adopt batch normalization [14] after each convolution and before activation. We use Adam [15] with a mini-batch size of 30. The learning rate is 0.001 and the models are trained for up to 78 epochs with Keras [4]. We use \(\beta _1\) of 0.9 and \(\beta _2\) of 0.999 following [15]. We set the initial random weights using Xavier normal initializer [8]. We have not used dropout [13] and any other approach for regularization.

In medical image segmentation, the interested region is more essential than background. If we use binary cross entropy to predict the probability of each pixel to belong to foreground or to background, all the pixels in input images are equally important. However, the interested region (white) just accounts for a smaller proportion of the area than background (black) as shown in Fig. 1, as a result, the interested region will be often missed. Binary cross entropy defined as

$$\begin{aligned} L =-\frac{1}{n}\sum ^n_{i=1}[t_i \ log(o_i)+(1-t_i) \ log(1-o_i)] \end{aligned}$$
(1)

where the sums run over the n pixels, i denotes the pixel position, \(t_i\) is the ground truth value, and the \(o_i\) is the predicted pixel value. In this paper, we employed Dice coefficient as loss function followed the practice in [24]:

$$\begin{aligned} L =-\frac{2\sum ^n_{i=1}o_it_i}{\sum ^n_{i=1}o^2_i+\sum ^n_{i=1}t^2_i} \end{aligned}$$
(2)

where n, i, \(o_i\), \(t_i\) denote same meaning as binary cross entropy.

6 Experiments and Analysis

The model is trained end-to-end on the 5635 training ultrasound images, and tested on the 5508 testing images. An evaluated score can be obtained by submitting the predicted segmentation maps to the Kaggle’s sever. The result is evaluated on the mean Dice coefficient:

$$\begin{aligned} L =\frac{2|X \cap Y|}{|X|+|Y|} \end{aligned}$$
(3)

where X is the set of predicted pixel values and Y is the set of ground truth values, correspondingly |X| and |Y| are the numbers of elements in them. We trained and evaluated our method and U-Net [25] (Fig. 5), which inspired our idea and was used widely in biomedical image segmentation. Same as our model, the U-Net we tested in this paper adopt Dice coefficient as the loss function and use basic convolutional units (Fig. 4) to replaced original convolutional layers without any dropout and other regularization.

Fig. 5.
figure 5

U-Net [25] used to compare with our model, which totally has 23 convolutional layers. In our work, batch normalization is used after each convolution and before activation instead of the original model.

In Fig. 6 we compare Dice coefficient of our model with U-Net [25] during the training procedure. We have observed that our model reached 0.34 after first epoch and 0.80 after 32th epoch, faster than U-Net, which just reached 0.05 after first epoch and 0.80 after 50th epoch. The much faster convergence proves that the inception modules we adopted can accelerate training procedure.

Fig. 6.
figure 6

Training on 5635 ultrasound images using U-Net (orange) and our model (blue). The lines denote training Dice coefficient. (Color figure online)

The results in Table 1 show that our model achieves a score of 0.653, roughly equals to the score of 0.658 from U-Net. However, our model has fewer parameters, only 16% of the parameters of U-Net. The reason for parameters’ reduction is that the \(1\times 1\) convolution does not care the correlation of information in same feature map and leads to dimension reduction. Figure 7 shows a testing result using our model.

Table 1. The results from the test set. The first column shows the models we have tested, the second column their Dice coefficient and the third the total number of these models
Fig. 7.
figure 7

One result of the test data, the left image is the testing ultrasound image containing nerve structure, the middle image is the predicted segmentation map, and the right image is the ultrasound image overlaid with segmentation map (green border). (Color figure online)

7 Conclusion

We present an approach based on convolutional neural networks, which achieves a good performance on ultrasound images segmentation and possesses fewer parameters thanks to inception modules. We adopted an efficacious loss function, Dice coefficient between the predicted segmentation maps and ground truth. Our model has satisfactory training time of 54 h on a Intel CORE i7 vPro and 16 GB of memory. Future work will aim at different biomedical segmentation applications and train our model over GPUs.