Keywords

1 Introduction

Glaucoma will affect approximately 80 million persons worldwide by 2020 [12]. Being asymptomatic, Glaucoma patients are usually ignorant about it until a noticeable visual loss occurs at a later stage. Early detection and treatment are essential to reduce the progression of vision loss. Glaucoma diagnosis is based on medical history, intra-ocular pressure and visual field loss tests together with an assessment of the Optic Disc (OD) through ophthalmoscopy. In 2D color retinal fundus images, the OD can be divided into two distinct regions; the central bright optic cup (OC), and the peripheral neuroretinal rim. The loss in optic nerve fibers leads to the enlargement of cup region called cupping. One of the important indicators of glaucoma is the enlargement of the cup with respect to OD which can be measured as the vertical cup to disc ratio CDR. Quantification of CDR requires accurate delineation of the boundaries of the optic disc and cup.

There are many automated methods for segmentation of optic disc [10, 14] in the literature. However, only a few have tackled optic cup segmentation, since ill-defined and in-homogeneous boundaries make its segmentation very challenging. Existing approaches of optic cup segmentation are based on level sets [6], super-pixels classification [15] and sparse dictionary learning [1]. In another method [2], fusion of cup segmentation from multi-view fundus images was used to improve the performance.

Fully supervised approaches require large numbers of annotated images to achieve reasonable robustness and accuracy, which is often difficult to obtain as it can be time-consuming and costly. Semi-supervised approaches tackle this problem by leveraging large number of unlabeled data along with the labeled data to improve the performance. For example, semi-supervised approach has been applied in different medical imaging tasks like, brain MRI segmentation [11], lung nodule detection [7] and retina vessel segmentation [16].

In this paper, we propose a novel approach which leverages unlabeled images to segment the optic cup in retinal fundus images. The proposed method is based on learning a generative model from the unlabeled data and utilizing the feature embedding provided by the trained generative model. We propose to use the variational-autoencoder (VAE) [5] as a generative model which learns the feature embedding as a latent variable without assumption of specific distance measure. Although, VAE have been extended to semi-supervised classification [3, 4, 9], it has not been applied on the segmentation task. Our approach is based on first learning the feature embedding using VAE from large number of unlabeled images. We then train the segmentation autoencoder that maps the image to the segmentation mask by transferring the properties of the learned feature embedding through end-to-end training.

2 Proposed Semi-supervised Segmentation Method

In the proposed semi-supervised segmentation method, a generative model is trained from a large number of unlabeled data. The feature embedding from the generative model is then incorporated in the segmentation model so that the segmentation model can be trained from limited number of labeled images. We use variational autoencoder (VAE) [5] as the generative model which models each observation in terms of a low dimensional latent variable. VAE has two parts; encoder network which maps the input image to the continuous latent variable \(\mathbf z \), and the decoder network which uses the latent variable \(\mathbf z \) to reconstruct the image.

Fig. 1.
figure 1

Proposed segmentation framework: the segmentation model (SVAE) contains segmentation encoder (SE) and segmentation decoder (SD). The generative model (GVAE) is used in the training of SVAE and is not required in testing phase.

As shown in Fig. 1, our proposed semi-supervised learning approach for optic cup segmentation consists of two main components; generative VAE (GVAE) and segmentation VAE (SVAE), details of each is provided in the following.

2.1 GVAE: Generative Variational Autoencoder

GVAE models the probability distribution of the image using neural networks, and is composed of two parts; generative encoder (GE), and generative decoder (GD). GE takes an image x as input and outputs the mean \(\mathbf {\mu }_g\) and standard deviation \(\mathbf {\sigma }_g\). The latent representation \(\mathbf {z}\) of the generative model is constructed by sampling from the distribution \(q_\mathbf {\phi }(\mathbf {z}|\mathbf {x})=N\left( \mu _{z},\sigma _{z}^{2}\mathbf {I}\right) \) where \(\mathbb {\phi }\) is the parameter of GE network. GE is modeled using convolution neural network with five convolution layers where each convolution layer is followed by a max-pooling layer which effectively reduces the size of feature response by half. Two dense layers are then attached to the features response from the last layer to output \(\mathbb {\sigma }_z\) and \(\mathbf {\mu }_z\). The GD network consist of five in-network deconvolution layers [8] which takes the latent representation \(\mathbf z \) and reconstructs the image \(\mathbf x \). GVAE is trained using the following loss function given by the variational lower bound [5]:

$$\begin{aligned} \mathcal {\mathcal {L\left( \theta ;\mathbf {\phi };\mathbf {x}\right) }}=-D_{KL}(q_{\phi }(\mathbf {z}|\mathbf {x})\parallel p(\mathbf {z}))+E_{q_{\phi }(z|x)}[\log p_{\theta }(\mathbf {x}|\mathbf {z})] \end{aligned}$$
(1)

where the first term is the negative KL-divergence between the posterior approximation \(q_\phi (\mathbf z |\mathbf x )\) to the prior \(p(\mathbf z )\), and the second term is expected reconstruction error obtained from the GD network and \(\theta \) is the parameter of GD network. The reconstruction error can be computed using binary crossentropy. The prior \(p(\mathbf z )\) is a spherical Gaussian distribution \(p\mathbf (z ) = N(0,I)\) and the posterior distribution \(q_\phi \mathbf (z |\mathbf x )\) which is the output of the encoder network, is also Gaussian. The KL part can then be written in analytical form as:

$$\begin{aligned} -D_{KL}(q_{\phi }(\mathbf {z}|\mathbf {x})\parallel p(\mathbf {z}))=\frac{1}{2}\varSigma _{j=1}^{J}(1+\log (\left( \sigma _{z}^{j}\right) ^{2})-\left( \sigma _{z}^{j}\right) ^{2}-\left( \mu _{z}^{j}\right) ^{2}) \end{aligned}$$
(2)

The input to the decoder is the random sample generated from the posterior \(q_\phi \mathbf (z |\mathbf x )\), yet back-propagation is not possible through random sampling. To overcome this obstacle, re-parametrization trick have been used [5].

The GVAE model can be used to estimate the probability density of data from which unseen samples can be generated. The GE part of the GVAE model provides the latent representation of the input image which we use as the feature embedding to improve the segmentation performance.

2.2 SVAE: Segmentation Variational Autoencoder

The goal of SVAE is to predict the segmentation mask of the optic cup from the given image by leveraging the feature embedding learned by GVAE. Similar to GVAE, SVAE consists of two parts; segmentation encoder (SE) and segmentation decoder (SD). SE is modeled using five blocks of convolution and max-pool layers followed by two dense layers which outputs \(\mathbf {\sigma }_v\) and \(\mathbf {\mu }_v\). The latent representation \(\mathbf {v}\) of the segmentation model is obtained by sampling from the distribution \(q_\mathbf {\alpha }(\mathbf {v}|\mathbf {x})=N\left( \mu _{v},\sigma _{v}^{2}\mathbf {I}\right) \) where \(\mathbb {\alpha }\) is the parameter of SE network. SD consist of five deconvolution layer which takes the latent representation \(\mathbf v \) as input and outputs the segmentation mask \(\mathbf y \).

In order to leverage the information within unlabeled data for segmentation, SVAE model is trained to reconstruct not only the segmentation mask but also the latent representation learned by GVAE. Given an image \(\mathbf x \), the corresponding latent representation \(\mathbf z \) is generated from conditional distribution \(p_\phi (\mathbf z |\mathbf x )\) given by GE. The SVAE network is then trained using following loss function:

$$\begin{aligned} \mathcal {\mathcal {L\left( \mathbf {\alpha };\mathbf {\beta };\mathbf {x}\right) }}=-D_{KL}(q_{\mathbf {\alpha }}(\mathbf {v}|\mathbf {x})\parallel p(\mathbf {v}))+E_{q_{\mathbf {\alpha }}(v|x)}[\log p_{\mathbf {\beta }}(\mathbf {x}|\mathbf {v})]+\parallel \mathbf {z}-\mathbf {v}\parallel ^{2} \end{aligned}$$
(3)

where the first term is the negative KL-divergence between the posterior approximation \(q_{\alpha }(\mathbf {v}|\mathbf {x})\) to the prior \(p(\mathbf {v})=N(0,I)\) which can be computed in analytical form as

$$\begin{aligned} -D_{KL}(q_{\alpha }(\mathbf {v}|\mathbf {x})\parallel p(\mathbf {v}))=\frac{1}{2}\varSigma _{j=1}^{J}(1+\log (\left( \sigma _{v}^{j}\right) ^{2}))-\left( \sigma _{v}^{j}\right) ^{2}-\left( \mu _{v}^{j}\right) ^{2}). \end{aligned}$$
(4)

The second terms of Eq. 3 denotes the expected reconstruction error of the segmentation mask which can be computed using binary crossentropy, \(\beta \) denotes the parameter of SD network and the third term is the Euclidean distance loss between the latent codes produced by GE and SE. Therefore, the training process transfers the latent representation learned from the generative model into the segmentation model. Our entire algorithm is summarized in Algorithm 1.

figure a

3 Experiment Results

The dataset used in this research is provided by EyePACSFootnote 1 and contains 12000 high resolution fundus images. We select 600 images to create a labeled set, where the ground truth has been obtained by manual delineation of the cup regions of all images by a clinician. The remaining 11400 images are used as the unlabeled set. Since our goal is to segment the optic cup, we cropped the optic disc region of all images by first segmenting the optic disc using the approach of [10] and rescaled to \(128\times 128\) dimension. Therefore, images in both labeled and unlabeled sets are disc cropped.

We split the labeled set into 400 training and 200 test sets. We train the network parameters using the training set, and then evaluate the final model on the test set. In order to evaluate the effect of number of training samples, we further divide the training set into four subsets containing 50, 100, 200 and 400 samples. Data augmentation is an important step in training deep networks. We augment the training images and corresponding label masks in each subset through a mirror-image reflection and rotation at 6 different angles, leading to 12x the original set.

We first train the generative model (GVAE) network using the unlabeled set for 20000 iterations. We have used mini-batch gradient descent using the RMSprop algorithm with momentum and a batch size of 50. The learning rate is set to 0.001 which is decreased by one tenth after 10000 iterations of the training process. Figure 2 shows examples of images reconstructed using GVAE. It can be seen that the reconstruction preserves the optic disc and optic cup structure in images, even though the surrounding vasculature is not clear. This shows that the latent representation obtained from GVAE is able to capture the structural information of the cup region.

Fig. 2.
figure 2

Random images from the test set (top row) and the corresponding reconstructed images from using GVAE (bottom row)

We then train the segmentation model (SVAE) using the labeled set by leveraging the trained GVAE as described in Algorithm 1. The SVAE network is trained for 10000 iteration using mini-batch gradient descent and the Adam optimizer with learning rate of 0.0001. In the testing phase, the output of the SE (i.e. the mean value of the latent code \(\mu _v\)) is directly fed to SD to obtain the segmentation mask, as the sampling is only required in the training phase. For both GVAE and SVAE we set the dimension of latent variable \(\mathbf z \) and \(\mathbf v \) to \(J=100\). We compare our method with Unet network [13] which is the state-of-the-art method in biomedical image segmentation. The architecture of the Unet is kept similar to SE and SD except that in the Unet there are skip connections between SE and SD. The Unet is trained using binary crossentropy loss using the same parameters as of SVAE. We also compared the result of SVAE with plain SVAE (SVAE-Plain) which does not take into account the feature embedding from GVAE, i.e., it does not include the third term of the loss in Eq. 3.

Table 1 compares the average Dice coefficient (DC) between the ground truth and predicted segmentation for the proposed SVAE, the Unet methods and the SAVE-Plain. The proposed method SVAE resulted in average DC of 0.80 when trained on the full training set, slightly improving over the Unet and SVAE-Plain. However, when we use less number of training samples, the proposed method improves significantly over both Unet and SVAE-Plain. This demonstrates that the proposed approach improves the segmentation performance when the number of labeled images are limited. This also demonstrates that our approach can leverage the information from unlabeled samples by first learning the encoding by GVAE and training SVAE with encoding samples generated from GVAE. Figure 3 shows the examples of segmentation produced by the proposed semi-supervised method.

Table 1. Optic cup segmentation performance of the proposed method compared with the Unet and SVAE-Plain for different training sizes.
Fig. 3.
figure 3

Examples of the optic cup segmentation produced by the proposed method. The red color indicates the ground truth cup region (contour). The green color indicates the predicted optic cup region (contour)

4 Conclusion

In this paper, we have presented a novel semi-supervised segmentation algorithm based on variational autoencoder (VAE) to segment optic cup in retinal fundus images. The generative VAE was trained using large number of unlabeled images. The segmentation VAE, which maps the image to the segmentation mask, was then trained using limited number of labeled images by leveraging the feature embedding provided by the generative VAE. We have demonstrated the effectiveness of our proposed method using limited number of labeled samples to the challenging task of segmentation of optic cup in retinal fundus images. We have demonstrated that the proposed method improves the segmentation performance when the number of labeled images is limited. Therefore, our approach is useful in clinical applications where the availability of annotated images in limited. Although, we have applied our approach in segmentation of cup in fundus image, we believe that our method is equally applicable to other modalities.