Keywords

1 Introduction

Early diagnosis is a key to reducing mortality rates in cardiovascular diseases, which caused 30.8% of deaths and were the top healthcare expenditure in the USA in 2013 [9]. Retinal fundus imaging allows the non-invasive observation of the retinal vascular system. This modality thereby offers a good overview of cardiovascular health: statistically, a patient suffering from retinopathy is twice as likely to have a stroke [13]. Clinicians already use fundus images to evaluate cardiovascular risks, by analysing the retinal vasculature morphology through measures like the arterio-venous ratio (highly correlated with hypertension and diabetes risks). However, the accuracy of these measures relies on the robustness of the vessels segmentation and classification between arteries and veins.

In the last two decades, many algorithms were developed to handle those tasks. Most of them perform the vessel segmentation separately from the classification. For the former task, traditional computer vision algorithms such as the multi-scale line detector [10] have been out-performed by deep learning algorithms, using either convolution neural networks (CNNs) [8], CNNs combined with Conditional Random Fields (CRFs), e.g. the Deep Vessel architecture [3], or by adversarial architectures [6]. Meanwhile, for vessel classification, state of the art methods generally combine local features analysis by machine learning algorithms with prior knowledge of the vascular tree structure. Dashtbozorg et al. use linear discriminant analysis outputs in combination with the vascular graph corrected by rules derived from prior knowledge [1]. More recently, Estrada et al. proposed a graph-based algorithm to extract the vascular tree from a fundus image and classify each detected vessel using local features [2]. However, relying on rules derived from prior knowledge can impact the robustness of the algorithm, in particular for severe cases of retinopathy where the vasculature won’t match the rules.

Recent progress in deep learning has made possible the training of larger and deeper Fully Convolutional Neural Networks (FCNNs). In particular, the U-Net achieves remarkably good performance in segmenting medical images thanks to its encoder/decoder architecture and to its skip-connexions [12]. However, to our knowledge, those architectures have never reached state of the art performance in artery/vein classification. Indeed, because of their narrow receptive fields, FCNNs struggle to learn the wide and extensive topology of retinal vessels.

This paper propose a novel Large Receptive Field Fully Convolutional Network architecture (LRFFCN), capable of segmenting very extensive shapes (i.e. vessels) in high-resolution images at a minimal memory and computational cost. This paper is organized as follows: (1) analysis of the U-Net’s poor performances in classifying arteries and veins; (2) description of a novel FCNN architecture: the Large Receptive Field Fully Convolutional Network; (3) evaluation of the LRFFCN architecture experimentally in the semantic vessel segmentation task.

2 Methods

2.1 Large Receptive Field Fully Convolutional Network

Receptive Field Limitation in Convolutional Networks. The term receptive field is inherited from neurosciences and describes the region of the sensory space (e.g. the visual field) in which a stimulus will cause a neuron to be activated. In deep learning, the receptive field of a convolutional network is the region of the input analyzed by the network to produce the prediction for one pixel. Early CNN designers performing pixel-wise classification didn’t care about this concept: because their models ended with fully connected layers, the receptive field was the whole input patch. However, this is not the case with FCNNs. For example, the encoding branch of the U-Net architecture has a receptive field of \(125\times 125\) pixels even though its training patch size is \(500\times 500\) pixels.

There are several ways to increase the receptive field of an FCNN. Stacking more layers or extending their kernel size will theoretically increase the receptive field linearly, whereas sub-sampling the output features of a layer will increase it multiplicatively. In practice, Luo et al. have shown that the Effective Receptive Field (ERF) is always narrower than the theoretical one [7]. More precisely, the ERF follows a Gaussian distribution with a standard deviation depending on the model architecture and the weights initialization. Extending the kernel size will increase the ERF linearly; stacking n layers will only increase the ERF by a factor \(\sqrt{n}\); sub-sampling will effectively increase the ERF quickly. On the contrary, skip-connections will shrink the ERF.

Focusing on the U-Net architecture, its decoding branch shouldn’t have much impact on the ERF: the growth due to the convolutional layers is compensated by the upsampling and the skip-connections. Intuitively, this branch is only a complex upsampling interpolation of the deep features. In other words, the ERF of a U-Net is strictly lower than the theoretical receptive field of its encoding branch: \(125\times 125\) pixels. When processing high-resolution fundus images (\(2048\times 2048\) pixels), such a receptive field is much too small to learn the topology of the vasculature, thus the network can only rely on local texture and color features. However, for small vessels far away from the optic disk, those features are not sufficient, even for clinicians, to efficiently discriminate arteries from veins.

As the U-Net architecture is already a large model, adding more layers to it would make the model too heavy. Namely adding a new pooling/up-conv stage would raise the forward-pass computation from 67 to 389 Mega Flops (mainly because the patch size is doubled).

LRFFCN Architecture. The Large Receptive Field Fully Convolutional Network is a novel network architecture which significantly increases the receptive field at a minimal memory and computational cost. The core of the architecture is a convolutional branch processing the full image at a low resolution (scaled down to a \(128\times 128\) pixels patch). This branch is structured as a repetition of fire/squeeze modules inspired by the SqueezeNet (a concise model with similar performance to AlexNet [4]). The theoretical receptive field of this branch is \(21\times 21\) pixels at \(1/16^{th}\) of the full resolution, which corresponds to \(336\times 336\) pixels in the high resolution image. The large scale features learned by this branch are then incorporated in an encoder/decoder FCNN with skip-connections.

The encoding stage of this network contains 4 pooling layers (\(2\times 2\) pooling), so the resolution of the deepest features \(F_e\) is \(1/16^{th}\). Because the FCNN is trained on patches and not on the full image, a region corresponding to the patch is extracted from the large scale features and is concatenated to \(F_e\). Thus, the decoding stage of the FCNN has access to textural information from high resolution features learned by the encoding stage, and to topological information from the large scale features at the cost of a forward-pass computation increase of only 6% (4 MFlops).

2.2 Model Specificities

The exact model trained for this paper is presented in Fig. 1. Its most interesting specificities are presented in this section.

Fig. 1.
figure 1

The large receptive field fully convolution network architecture. Color code: light blue: encoding branch; orange: decoding branch; dark blue: low resolution branch. (Color figure online)

Due to memory limitations, the maximum minibatch size during training is 8. Batch normalization layers would therefore be ineffective and have been replaced by a SeLu activation function [5]. This modified version of the ReLu activation function offers negative values which speed up learning by pushing mean activation towards 0, and a stable fixed point of the activation mean and variance so that exploding and vanishing gradients are impossible.

The model is trained to perform semantic segmentation: the vessels segmentation is performed simultaneously with the classification as each pixel is assigned a probability of being part of the background, an artery or a vein. Even if the LRFFCN architecture efficiently increases the ERF, performing this classification pixel-wise can cause local errors in the artery/vein classification. Indeed, because arteries and veins are visually similar and because of the local inconsistencies transmitted from the early layers to the later ones through the skip-connections, some pixels from a vessel can suddenly be predicted as veins even though they are surrounded by arteries (and vice versa). To correct such local inconsistencies, a CRF is added at the end of the model.

CRFs use the preprocessed version of the training patch as a reference and 2D Gaussian kernels to propagate the probabilities of being an artery or a vein along the vessel. We used the CRF as Recurrent Neural Network (RNN) implementation proposed by Zheng et al. [14] so its kernels are trained simultaneously with the LRFFCN parameters.

2.3 Training

The model was trained on 69 images: 30 images from the MESSIDOR training dataset, 19 images from the STARE dataset and 20 images from the DRIVE training dataset. We manually labelled the MESSIDOR and STARE images and asked an ophthalmologist to validate this labelling. Each image was preprocessed using a standard contrast enhancement technique. Both the enhanced and raw images were presented to the network since vessel detection strongly relies on the enhanced image whereas vessel classification relies on the vessels’ true colors. Color and geometric data augmentation were used to double the training dataset size (contrast and gamma variations for color augmentation, horizontal mirroring and rotation for geometric augmentation).

The training process was split into 3 phases. The large scale branch was pre-trained first over 100 epochs on full images rescaled to \(128\times 128\) pixels. Then, the full LRFFCN was trained on \(230\times 230\) pixels patches; the large-scale features already learned by the low resolution branch allow a quick convergence of the network. This training phase was driven by the Adadelta gradient descent optimizer and lasted 30 epochs during which the learning rate was slowly decreased. Finally, the CRF as RNN was added to the classifier layer and the network was trained again for 10 epochs.

3 Experiments

3.1 Evaluating the LRFFCN Architecture

We used the 20 images from the DRIVE test dataset to evaluate the LRFFCN architecture. The architecture was tested with and without the CRF as RNN layer. Also, to quantify the contribution of the low resolution branch to the prediction, we evaluated the performance of the model when the scaled-down image input to this branch was replaced by a uniform noise. (We refer to this network as LRFFCN w/o low branch.) Finally, a classic U-Net was trained for the same number of epochs to estimate the performance gain. For each of these models, we measured the accuracy of the vessel segmentation and the accuracy, specificity and sensitivity of the artery/vein classification. The results are presented in Table 1.

Table 1. Performance of LRFFCN architecture on DRIVE test dataset.

The segmentation accuracy is consistent across architectures and is quite high, confirming the efficiency of convolutional networks in segmenting vessels. However, the LRFFCN architecture improves the classification accuracy by almost \(5\%\) compared to the basic U-Net. The large-scale features learned by the low resolution branch seems to be effectively used by the model to improve its generalizing capabilities. This is confirmed by the noise experiment: replacing the real data by noise make the classification accuracy drop by \(16\%\). Thus, the large-scale feature branch contributes greatly to the network’s predictions.

Fig. 2.
figure 2

Comparison of LRFFCN performance with and without CRF as RNN, on two images from DRIVE test dataset. Top row: 06_test; bottom row: 17_test. Color codes: red: true artery; dark blue: true vein; light blue: misclassified vein; yellow: misclassified artery. (Color figure online)

The CRF layer improves the classification accuracy by \(2\%\) and successfully propagates the vessel classes to correct local errors. Those corrections are visible in Fig. 2(b), where artery segments classified as veins by the LRFFCN are corrected by the CRF layer. However, the LRFFCN’s predictions are sometimes not sufficiently accurate and the CRF layer propagates misclassifications (as visible for some veins in Fig. 2(a)). Overall, the CRF improves the topological plausibility of the predicted vessel network.

3.2 Model Segmentation and Classification Performance

In this section, we compare the performance of the LRFFCN to those of state of the art algorithms. For the segmentation task, the LRFFCN exceeds by \(1.3\%\) the Deep Vessel network in terms of accuracy. More precisely, our architecture is much more sensitive (\(80.8\%\) against \(72.7\%\)) (Tables 2 and 3).

Table 2. Segmentation results on DRIVE test dataset.
Table 3. Classification results on DRIVE test dataset.

For the artery/vein classification, the performance of the LRFFCN architecture is still \(10\%\) below state of the art graph-based algorithms. Indeed, for the DRIVE dataset, the prior topological knowledge of retinal vessels provides a good enough estimation of the vascular tree. Graph-based algorithms can propagate the artery/vein probability through this tree, whereas our method didn’t perfectly learn the vasculature topology and often misclassify the small vessels farther from the optic disk.

4 Discussion and Conclusion

The proposed LRFFCN architecture efficiently increases the receptive field of the FCNN by means of a low resolution branch and successfully takes advantage of large-scale features to better learn the retinal vessel topology. In particular, the LRFFCN architecture outperforms the U-Net in classifying retinal arteries and veins. It also does better than state of the art algorithms in vessel segmentation, yielding a higher sensitivity. For the classification task, the LRFFCN architecture does not reach state of the art performance. However, we believe that this particular result is due to the training conditions and not to the model design.

Nevertheless, these results are promising. Indeed, in contrast to static graph-based analysis, the performance of this model will improve as the training dataset grows. In particular, the bottleneck for this architecture is the training of the low resolution branch. Because this branch must be trained on whole images and not on patches, our current training dataset is too small for the LRFFCN architecture to show its full potential. But because the segmentation accuracy is high, the predictions from our method can quickly be fixed by a clinical expert. In other words, the LRFFCN architecture can be used to efficiently generate more ground truth images on which it can then be trained to improve its vessel classification performance.