Keywords

1 Introduction

Glaucoma is currently the second leading cause of irreversible blindness in the world [1], which is commonly characterized by sustained or temporary elevation of IOP and defects in visual field. Diagnosis of glaucoma depends on the information from various clinical examinations including visual field (VF), optical coherence tomography (OCT) and fundus photo [1, 2]. Fundus photos are easy to capture and frequently used in glaucoma screening. Localization of optic cup and disc is the main clue for machines to make diagnosis [3, 4]. In clinical practice, VF is widely used as the gold standard to judge whether patients have typical glaucomatous damage. Specific patterns of defects such as nasal step and arcuate scotoma shown in visual field indicate existence of glaucoma [5, 6].

Researchers have developed several algorithms based on data from clinical studies, such as Advanced Glaucoma Intervention Study (AGIS) criteria and Glaucoma Staging System (GSS) criteria to grade glaucomatous VFs [5, 7,8,9]. However, it is hard to diagnose glaucoma depending on VF alone and for early stage glaucoma, even if retinal nerve fiber layer (RNFL) had been damaged there can be no obvious defect in VF. Therefore, it is necessary to develop new algorithm for glaucoma diagnosis. Thus, we designed this study to investigate the performance of deep neural network to identify glaucomatous VFs from non-glaucomatous VFs and to compare the performance of machine against human ophthalmologists.

2 Methods

2.1 Data Preparation

The study was approved by the Ethical Review Committee of the Zhongshan Ophthalmic Center and was conducted in accordance with the Declaration of Helsinki for research involving human subjects. The study has been registered in clincaltrials.gov (NCT: 03268031). All the visual fields (VFs) were obtained by either Humphrey Field Analyzer 30-2 or 24-2 tests. To guarantee reliability, only VFs with fixation losses of less than 2/13, false positive and false negative rates of less than 15% were selected in the experiments. Representative examples of non-glaucoma and glaucoma PD plots are shown in Fig. 1.

Fig. 1.
figure 1

Representative examples of pattern deviation figures in glaucomatous and non-glaucomatous visual fields.

The probability map of pattern deviation (PD image) is then cropped from the VF report and resized to \( 224 \times 224 \) as the input of a deep CNN. All the VFs of both eyes of a single patient are assigned to either training or validation set to avoid data leakage. In this way, we split a total of 4012 PD images into two sets, 3712 for training and another 300 for validation. For data augmentation, we randomly flip the PD images in the training set horizontally to obtain final 7424 training samples. Cross validation is performed by randomly splitting the training and validation sets 3 times and no significant difference is observed. The validation set consists of 150 glaucomatous PD images and 150 non-glaucomatous PD images. The non-glaucomatous PD images include 50 images with only cataract and 150 images with no ocular disease, retinal diseases or neuro-ophthalmic diseases.

2.2 Diagnostic Criteria of Glaucoma

Glaucoma was diagnosed with similar criteria to UKGTS study [10]. VFs of patients who have glaucomatous damage to optic nerve head (ONH) and reproducible glaucomatous VF defects were included. A glaucomatous VF defect was defined as a reproducible reduction of sensitivity compared to the normative database in reliable tests at: (1) two or more contiguous locations with P < 0.01 loss or more, (2) three or more contiguous locations with P < 0.05 loss or more. ONH damage was defined as C/D ratio ≥ 0.7, thinning of RNFL or both, without a retinal or neurological cause of VF loss.

2.3 Deep CNN for Glaucoma Diagnosis

We adopted the powerful VGG [11, 12] as our network structure. The VGG network consists of 13 convolution layers and 3 fully connected layers. We modified the output dimension of the penultimate layer fc7 from 4096 to 200. And the last layer is modified to output a two-dimension vector which corresponds to the prediction scores of healthy VF and glaucoma VF. The network is first pre-trained on a large scale, natural image classification dataset ImageNet [13] to initialize its parameters. Then we modified the last two layers as mentioned above and initialized their parameters by drawing from a Gaussian distribution. All the parameters of the network were updated by the stochastic gradient descend algorithm with the softmax cross-entropy loss. The network structure is shown in Fig. 2.

Fig. 2.
figure 2

VGG15 was adopted as our network structure. We modified the output dimension of the penultimate layer fc7 from 4096 to 200. And the last layer is modified to output a two-dimension vector which corresponds to the prediction scores of healthy VF and glaucoma VF. The network is first pre-trained on a large scale, natural image classification dataset ImageNet16 to initialize its parameters. Then we modified the last two layers as mentioned above and initialized their parameters by drawing from a Gaussian distribution.

2.4 Comparison Between CNN-Based Algorithm and Human Ophthalmologists in Glaucoma Diagnosis

We compared diagnostic accuracy between our algorithm based on deep neural network and ophthalmologists. We chose 9 ophthalmologists in 3 different levels (glaucoma experts: Professor YL-L, XC-D and SJ-F; attending ophthalmologists: Dr. T-S, WY-L and WY-Y; resident ophthalmologists: Dr. X-G, WJ-Z and YY-W), from 4 eye institutes (see details in acknowledgements). None of them has participated in the current research. Attending ophthalmologists are doctors who have clinical training in ophthalmology for at least 5 years, while resident ophthalmologists are doctors who have clinical training in ophthalmology for 1–3 years. Ophthalmologists were shown the PD images alone and requested to assign one of five labels to each PD image, i.e., non-glaucoma, likely non-glaucoma, uncertain, likely glaucoma and glaucoma.

2.5 Traditional Methods for Glaucoma Diagnosis

As a comparison, we also evaluated several rule-based methods and traditional machine learning methods for glaucoma diagnosis.

Rule-based methods included AGIS and GSS methods. For AGIS, a VF is considered to be abnormal if three or more contiguous points in the TD plot are outside of normal limits [8]. GSS2 uses both MD and PSD values to classify VFs into 6 stages [9]. Only stage 0 is considered healthy and other stages are treated as glaucoma.

Moreover, we also compared our method with three other non-deep machine learning algorithms. Support Vector Machine (SVM) [14] maps training samples into high dimensional points that can be separated by a hyperplane as wide as possible. Random Forest (RF) [15] constructs a set of decision trees, and each sample is classified according to the number of training samples of different categories falling into the same leaf node. For k-Nearest-Neighbor (k-NN) [16] method, the sample is classified as healthy or glaucoma by majority voting from its k nearest training samples. Throughout these experiments, we used 52 PD values in VFs obtained in 24-2 test. For 30-2 test, 22 outermost values were discarded so that they can be treated equally. We optimized all the algorithms to improve their performance, e.g., we experimented whether to use Principal Component Analysis (PCA) for preprocessing, different kernel types in SVM, different numbers of trees in RF and various k values in k-NN.

3 Results

Baseline characteristics are shown in Table 1. We totally collected 4012 VF reports, including glaucoma and non-glaucoma reports. To compare the statistical difference between non- glaucoma group and glaucoma group, we run an unpaired test for numerical data and chi-square test for categorical data. It can be observed that there was no significant difference between left eye to right eye ratio (P = 0.6211, chi-square test), while age (P = 0.0022, unpaired t test), VFI (P = 0.0001, unpaired t test), MD (P = 0.0039, unpaired t test) and PSD (P = 0.0001, unpaired t test) exhibited obvious statistical differences.

Table 1. Baseline characteristics of participants

To evaluate the effectiveness of the algorithm for automatic diagnosis of glaucoma, we summarized the performance of the proposed algorithm in Table 2.

Table 2. Performance of the algorithm and the compared methods.

On the validation set of 300 VFs, our algorithm based on CNN achieved an accuracy of 0.876, while the specificity and sensitivity was 0.826 and 0.932, respectively. In order to compare the results of ophthalmologists with machines, we also developed a software to collect evaluation results from ophthalmologists. Ophthalmologists were shown the PD images alone and requested to assign one of five labels to each image, i.e., non-glaucoma, likely non-glaucoma, uncertain, likely glaucoma and glaucoma. They were strongly advised not to choose the uncertain label. For final evaluation, the non-glaucoma and likely non-glaucoma labels were counted as normal, while the likely glaucoma and glaucoma labels were counted as glaucoma, and the uncertain level is considered as a wrong answer. Although the ophthalmologists included three resident ophthalmologists, three attending ophthalmologists and three glaucoma experts, we did not observe significant differences among these three groups. The average accuracies are 0.607, 0.585 and 0.626 for resident ophthalmologists, attending ophthalmologists and glaucoma experts, respectively. However, there exists a huge performance gap between ophthalmologists and CNN, which indicates that CNN may have strong ability to identify the complex patterns presented in the PD images for glaucoma diagnosis. Two rule based methods, AGIS and GSS2, were also compared in the experiment. Both methods are not able to achieve satisfactory results. Interestingly, all the ophthalmologists performed better than GSS2 and AGIS, indicating the importance of human experience in the decision-making process. Three traditional machine learning algorithms were also included in the experiments. SVM performed best among these machine learning methods, but still much worse than CNN.

As shown in Fig. 3, we examined the receiver operating characteristic curve (ROC) of CNN and the compared methods. Our algorithm achieved an AUC of 0.966 (95%CI, 0.948-0.985). It outperformed all the ophthalmologists, rule based methods and traditional machine learning methods by a large margin.

Fig. 3.
figure 3

Performance of CNN, ophthalmologists and traditional algorithms are presented. There were 9 ophthalmologists participating in evaluation of VFs. On the validation set of 300 VFs, CNN achieved an accuracy of 0.876, while the specificity and sensitivity was 0.826 and 0.932, respectively. The average accuracies are 0.607, 0.585 and 0.626 for resident ophthalmologists, attending ophthalmologists and glaucoma experts, respectively. Both AGIS and GSS2 are not able to achieve satisfactory results. Three traditional machine learning algorithms were also included in the experiments. SVM performed best among these machine learning methods, but still much worse than CNN. We also examined the receiver operating characteristic curve (ROC) of CNN and the compared methods. CNN achieved an AUC of 0.966 (95%CI, 0.948–0.985), which outperformed all the ophthalmologists, rule based methods and traditional machine learning methods by a large margin.

We also studied the relative validation set accuracy as a function of the number of images in the training set. The training set is randomly chosen as a subset of the original training set at rates of (5%, 10%, …, 100%). Each set includes all the images in the smaller subset. As shown in Fig. 4, we can see the performance does not improve too much after the training set includes more than 3612 images.

Fig. 4.
figure 4

We studied the relative validation set accuracy as a function of the number of images in the training set. The training set is randomly chosen as a subset of the original training set at rates of (5%, 10%, …, 100%). Each set includes all the images in the smaller subset. As shown in the figure, the performance does not improve too much after the training set includes more than 3712 images.

4 Discussion

In our study, we presented two meaningful contributions: (1) we designed a project to develop our algorithm for diagnosis of glaucoma, which consisted of 4 steps: data collection, model design, training strategy design and model validation; (2) we have developed a deep learning-based method that can differentiate glaucoma from non-glaucoma based on VFs and verified its efficacy on differentiation of VFs and advantage over human ophthalmologists. Our approach based on CNN achieved both higher sensitivity and specificity than traditional machine learning method and the algorithms concluded from clinical trials such as AGIS [8]. Applying CNN to the interpretation of VFs, we found that the method is both sensitive and reliable. Although ophthalmologists performed better than AGIS and GSS2, CNN-based algorithm is even better at recognizing patterns presented in the PD images. Our results demonstrated the possibility of applying CNN to assist screening and diagnosis of glaucoma.

We compared the performance of our algorithm based on CNN against human ophthalmologists of different levels. As expected, glaucoma experts achieved the highest accuracy in VF interpretation, although there was just 2% and 4% different when compared to attending and resident doctors respectively. With accumulation of clinical experience, doctors tend to have higher specificity while lower sensitivity. Because doctors only have VFs as accessory examination to make a diagnosis, their diagnostic ability was restricted, and they would tend to be more careful about their decision. However, machines got the highest score in the test, achieving highest sensitivity while keeping high specificity. In our second step, we compared performance of our algorithm against 2 criteria summarized from clinical trials, AGIS and GSS2 [8, 9]. AGIS and GSS2 criteria were built to evaluate severity and staging of glaucoma based on VF. VF is divided into different areas with different weights. These algorithms, however, were based on regression analysis, so it is typically linear and won’t have good performance with complex VFs. In the last step, we compared performance of our CNN-based algorithm with traditional machine learning method, including RF, SVM and k-NN. A previous study used feed forward neural network (FNN) to detect preperimetric glaucoma, which showed overwhelming advantage over traditional machine learning methods [17]. In our study, similar results were obtained. This is because these algorithms are all shallow models which cannot extract representative features of the PD images.

It should be noted that this study had several limitations. First, we used only pattern deviation images as the input of machine learning algorithms. Thus, preperimetric glaucoma may not be effectively detected by machine. We don’t consider VF from cross-sectional test is able to help diagnose early stage disease, that’s why we didn’t try to differentiate preperimetric glaucoma in our study. In future studies, we plan to combine VF with OCT scans. With input from different imaging modalities, it is expected that deep networks may be able to make more accurate diagnosis. Second, at current stage, the program we developed can just tell glaucoma from non-glaucoma. Various diseases, such as neuro-ophthalmic diseases and cataract, may influence VFs. We hope to extend the function of our deep models to diagnose more ocular diseases.

5 Conclusion

In glaucoma diagnosis based on VF, our algorithm based on CNN has achieved higher accuracy compared to human ophthalmologists and traditional rules (AGIS and GSS2). The accuracy is 0.876, while the specificity and sensitivity are 0.826 and 0.932, respectively, indicating advantages of CNN-based algorithms over humans in diagnosis of glaucoma. It will be a powerful tool to distinguish glaucoma from non-glaucoma VFs, and may help screening and diagnosis of glaucoma in the future.