Keywords

1 Introduction and Background

The automatic segmentation of blood vessels from retinal fundus images has gained interest in the image processing community, due to its applicability to several problems in different fields. Precisely, the automatic analysis of retinal blood vessels is a basic step both for the diagnosis of several retinal pathologies [2] (e.g. diabetic retinopathy, arteriosclerosis, hypertension, and various cardiovascular diseases), and for person verification in biometric systems, since the retinal vascular structure is different for each individual [15]. Several methods have been proposed in the literature for retinal image segmentation since it remains a challenging task due to the complex nature of vascular structures, illumination variations and the anatomical variability between subjects. Existing methods can generally be divided into two categories: supervised and unsupervised. In general, the performance of supervised methods is superior to that of the unsupervised ones, but with a lower speed and higher computational complexity compare to other unsupervised methods. Supervised methods are based on machine learning techniques and require a manually annotated set of training images in order to classify a pixel as either a vessel or non-vessel. They involve a k-NN classifier, a Support Vector Machine, a Bayesian classifier in combination with features obtained through the multi-scale analysis of Gabor wavelets, AdaBoost or a CNN, etc. [6, 8, 11, 17, 19,20,21]. Unsupervised segmentation methods work without any prior knowledge and are based on matched filtering, centerline tracking, mathematical morphology, or other rule-based techniques [3,4,5, 7, 12, 22,23,24].

In this paper a supervised method based on a CNN and on the use of directional filters, as proposed in [7], is presented. The method has been tested on the DRIVE [19] dataset and its performance has been compared to that of other methods in the literature. The experimental results confirm the goodness of the proposed approach.

The rest of the paper is organized as follows: in Sect. 2, the proposed CNN architecture is presented; Sect. 3 describes the experiments, together with the training strategy, and shows the obtained results; and finally in Sect. 4 some conclusions are drawn.

2 The Method

Our solution is based on a CNN used as a pixel classifier and on the introduction of directional filters. As performed by the majority of authors in the field, only the green channel of RGB color retinal image is considered, since the vessels are characterized by the highest contrast in this channel. In this work, the vessel segmentation issue is addressed as a pixel-level binary classification task. The network computes the probability of a pixel being a vessel, using as input a patch of the image, representing a square window centered on the pixel itself. Next, the input image is segmented by classifying all its pixels. The CNN is trained on a high number of patches, in which the central pixel is annotated by using the relative ground truth included in the dataset. Details about the training dataset will be given in the next section.

A CNN is organized in stacked trainable stages, called layers, each composed of processing units operating on the output of the previous layer. A CNN consists of a number of convolutional and sub-sampling layers optionally followed by fully connected layers. A convolutional layer will have k filters (or kernels) devoted to produce k feature maps. Each map is then sub-sampled typically by means of pooling layers, a process that progressively reduces the spatial size of the representation, the amount of parameters and the computation in the network. The pooling can be of different types: Max, Average, Sum, etc., but max pooling is usually applied where the largest element from the current feature map within a window is considered [16]. The output from the convolutional and pooling layers consists in high-level features of the input image. The purpose of the Fully Connected Layers tecnique is to use these features to classify the input image into various classes based on the training dataset.

The layers of the proposed CNN architecture are:

  • A first non-learnable convolutional layer: to obtain a better performance of our method, we have introduced a new layer representing directional filters to guide the training of the network, also on the linear behavior which characterizes blood vessels. In fact, vessels are thin and elongated structures whose pixels are aligned along different directions. Thus, by fixing a number of different orientations and by taking into account the gray-levels in a suitable window centered on a pixel p, directional information can be computed for p. Next, directional information is combined by higher layers of network to obtain more complex features. Differently from the filters of other Convolutional layers, the directional filters are not learned during the training process. The directional filters consist in twelve windows, each having a size of \(7\times 7\), like the ones presented in [7] (see Fig. 1). Each window represents a direction such that an angle of \(15^\circ \) between two successive directions is obtained.

  • Five convolutional layers: all the filters of these layers have a size of \(3\times 3\) and a stride of 1. The first layer of this block learns 32 filters, the second and the third learn 64 filters and finally the fourth and fifth learn 128 filters. All the filters of these layers are initialized with a Xavier initialization [9] and they are equipped with rectification non linearity (ReLU) [13], i.e. the output volume is max(0, e), where e is the outcome of convolution.

  • Five max-pooling layers: max-pooling is performed after each convolutional layer. It is computed on a window of size \(3\times 3\) and the stride is set to 2.

  • Three fully connected layers: the first two fully connected layers learn 256 filters, and these are used to learn non-linear combinations of the features provided from the previous layers. Moreover, to try to avoid overfitting, these two layers implement dropout regularization [18], where the ratio is set to 0.5. Finally, the number of filters of the last layer is 2, since in our case the classification problem is binary. All the filters of these layers are initialized with values sampled from the N(0, 0.01) Gaussian distribution.

Table 1 summarizes the parameters of the CNN layers, where n-C stands for the Convolutional layer with non-learnable filters, C + P stands for the Convolutional layer followed by Max-pooling layer and FC stands for the Fully Connected layer.

Fig. 1.
figure 1

The 12 Directional filters implemented in the first layer of the proposed CNN.

Table 1. Summary of the proposed CNN architecture

3 Experiments

3.1 Training Strategy and Parameters Setting

The training phase consists in an iterative presentation of the patches together with their associated labels. Patches are randomly extracted from the set of training images of the DRIVE dataset. In particular, DRIVE contains 20 training images and 20 testing images and each image is associated with both a mask delimiting the Field of View (FOV) of the retinal image and two ground truths generated by two ophthalmologists. However, only patches completely contained in the FOV of the retinal images and only the ground truths of the first expert are taken into account. Each patch of an image I has a \(27\times 27\) size and is labeled as a vessel or non-vessel depending on whether its central pixel belongs to foreground or background of the ground truth associated with I. The experiments have been performed considering two types of training sets: a non-balanced set in which most of the patches are labeled as non-vessels and a balanced set including a balanced percentage of patches differently labeled. Precisely, the non-balanced training set is composed of 480, 000 random patches, while the balanced set includes about 700, 000 patches.

For the testing phase, patches are extracted by DRIVE test images. In particular, for each pixel p of a test image I, the patch centered on p is obtained and it is involved in the testing phase only if it is completely contained in the FOV of I.

The number of epochs of the network is set to 15, while the batch size is equal to 256. The learning rate initially is set to \(10^{-2}\), and it is decreased every six epochs by a factor of 10. To train our network, we used the NVIDIA Deep Learning GPU Training System [1] within the Caffe framework [10].

3.2 Results

A qualitative evaluation of the method is possible with reference to Fig. 2, where in each line the input image, the ground truth of the first expert, and the result of our segmentation method are shown from left to right.

Fig. 2.
figure 2

Results of our method: (a) original image, (b) ground truth, (c) results of the proposed method.

We have quantitatively evaluated our method for both training sets, balanced and non-balanced, and in both cases with or without the max-pooling layers and also with or without directional filters. We have computed the Accuracy, Sensitivity and Specificity, as performed by the majority of researchers [14].

Table 2. Results of the proposed method applying different training strategies

Quantitative results by applying different training strategies are given in Table 2. For the non-balanced training set, the best result is obtained without the max-pooling layers in respect of all the considered measures. For the balanced training set, we obtained a better performance as regards accurary and specifity without max-pooling layers, while a lower performance was obtained only as regards sensitivity. However, the non-balanced training set and the network without max-pooling layers provided the best performance in terms of accuracy and specificity and also gave a high sensitivity value.

To demonstrate how the introduction of directional filters in the network provides a better performance of our method, we computed the considered measures also with and without directional filters considering the NoBal / NoPool strategy (see Table 3). We observed that the use of directional filters produces a high sensitivity value, while equivalent values are obtained for the remaining measures.

Table 3. Results of the proposed method with or without the directional filters (DF)
Table 4. Performance Comparisons

Finally, we also compared the performance of our method with that of other unsupervised and supervised methods in the literature. The average values of accuracy, sensitivity and specificity can be checked with reference to Table 4, where the highest values are in bold. Our method has a better performance as regards sensitivity with respect to all methods, a better performance as regards accuracy with respect to the supervised methods and, and a lower performance only as regards specificity with respect to some methods.

4 Conclusion

In this work, we have presented a supervised vessel segmentation method based on a Convolutional Neural Network. The adopted CNN architecture includes a specific layer to compute the directional features. The introduction of this layer allows an improved performance of the network in terms of sensitivity. The method provides results that are satisfactory both from a qualitative point of view and quantitatively. The performance of the method has been checked on the DRIVE dataset in terms of evaluation parameters such as accuracy, sensitivity and specificity. Comparisons have also been made with other unsupervised and supervised methods in the literature, showing that the suggested method has the highest performance in terms of sensitivity.