1 Introduction

Cataracts are a clouding of the lens that reduces transmission of light to the retina. They may be caused by a variety of factors, including age, ultraviolet radiation, and genetics. This obstruction of light can seriously impair vision and may even progress into blindness [1]. Due to its prevalence particularly among the elderly, there is a need to screen for them in an efficient and cost-effective manner.

Most commonly, cataracts develop in the nucleus, which is the central layer of the lens. The opacification and coloration caused by nuclear cataracts is visible in cross-sectional views of the lens in slit-lamp images. Currently, nuclear cataracts are diagnosed by ophthalmologists directly using a slit-lamp microscope, or graded by clinicians who assess the presence and severity of a cataract by comparing slit-lamp images against a set of protocol photographs [24] such as that shown in Fig. 1. However, manual assessments can be both subjective and time-consuming [4].

The need for objectivity and efficiency in nuclear cataract grading has led to the development of several computer-aided systems [510]. These systems generally operate with three main steps: lens structure detection, feature extraction, and severity prediction. Most prior art formulate the severity prediction as a regression problem on certain visual features. In the state-of-the-art method of [5], bag-of-features (BOF) descriptors are extracted from RGB and HSV color channels of different lens sections, and group sparsity regression (GSR) is used to jointly select the features, parameters and models for grading. In [6], 21 pre-defined features are extracted from different sections of the lens, and then are fed into a pre-learned RBF kernel-based Support Vector Regressor (SVR) to estimate the cataract grade.

While regression-based methods have achieved higher accuracy than other previous techniques, we observe that the dense sampling of cataract grades in the available training data allows for a more direct grading prediction. When training samples are densely distributed in the feature space, higher accuracy can be achieved through reconstruction from the samples, where class membership of an input is estimated based on reconstruction accuracy from similar instances within each class. Compared to regression, a reconstruction-based approach is less reliant on discriminative feature quality and more robust to small inter-class margins, such as those that exist for the continuous space of cataract grades. These advantages of reconstruction-based classification have been exploited for human gait recognition [11], where a large number of training samples are densely distributed in a feature space that is compact due to the relatively narrow range of gait differences.

For nuclear cataract grading, however, the reconstruction-based approach is ineffective when employed in a straightforward manner. In our preliminary tests, linear reconstruction of lens images after alignment and size normalization of lens sections led to grading performance much lower than the state-of-the-art BOF+GSR method [5]. This is mainly due to inadequate lens alignment. Since lenses vary in both size and shape, a non-rigid structural alignment of lenses is needed for accurate reconstruction, but is challenging to accomplish. To address this problem, we propose to model the test and reference images with a new semantic representation of lens structure that is less sensitive to slight misalignments, instead of processing in the original raw image space. In addition, we improve the accuracy of reconstruction-based cataract grade prediction by accounting for the degree of similarity between the test image and reference images. By ignoring reference images that are not well-aligned to the test image as done in the similarity ranking-based CBMIR approach [8], the alignment issue is further diminished.

Our proposed method essentially follows the manual grading protocol, as it directly compares with reference images through the alignment and reconstruction procedure, and it compares intensity/color and contrast patterns via the semantic feature representation. With this approach, our system attains higher overall performance than the state-of-the-art, and has the potential to be applied to other ocular diseases such as angle closure glaucoma detection and optical cup localization.

Fig. 1.
figure 1

Standard photographs of the Wisconsin nuclear cataract grading system. From left to right, the severity of the nuclear cataracts increases, with greater brightness and lower contrast between anatomical landmarks. In addition, the color of the nucleus and posterior cortex exhibits more of a yellow tint due to brunescence.

2 Nuclear Cataract Grading Through Semantic Reconstruction

We formulate nuclear cataract grading as a linear reconstruction problem with a similarity-weighted constraint. For a given slit-lamp test image, our algorithm follows the steps of lens structure detection, semantic feature extraction, and linear reconstruction with reference images.

2.1 Lens Structure Detection

Detection of lens structures in slit-lamp images is a well-studied problem with effective solutions [510]. For this purpose, we employ techniques similar to those used in [5, 6]. As illustrated in Fig. 2, the lens structure detection proceeds with the following steps:

  1. 1.

    Using the active shape model based lens structure detection proposed in [6], each lens image is separated into three sections: anterior cortex, nucleus, and posterior cortex. The visual axis is located as well.

  2. 2.

    A lens cross-section is extracted around the visual axis, using a bounding box with a height of h pixels (\(h=128\) in our implementation) to obtain the central parts of the nucleus, anterior cortex and posterior cortex.

  3. 3.

    Features are extracted from only the nucleus and posterior cortex sections, since the anterior cortex contains no discriminant information for nuclear cataract grading [5]. This practice is also supported by clinical protocol [4], where nuclear cataracts are graded based on the intensity and visibility of nuclear landmarks and the color of the nucleus and posterior cortex.

  4. 4.

    As illustrated on the right side of Fig. 2, bright spots may appear in the extracted posterior cortex section, due to reflections of the photographic flash. The presence of these sharp reflections may greatly reduce grading accuracy, so we avoid them by simply shifting the bounding box of the posterior cortex vertically with a step size of h / 2 until it has a mean scaled value lower than a threshold value \(\theta _p\), where \(\theta _p = 192\) in our implementation.

Fig. 2.
figure 2

Left: illustration of lens structure detection, with the initially detected lens structure (solid yellow boxes) and final detected posterior cortex (dashed yellow box). Right: examples of detected lens cross-sections. On the left are initially detected cross-sections with sharp reflections in the posterior cortex. On the right are cross-sections that were detected without the posterior cortex reflections. Since the anterior cortex will be discarded, reflections there need not be avoided. (Color figure online)

2.2 Semantic Feature Representation

After detection, the posterior cortex is divided into \(s \times s\) (\(s=3\) in our implementation) half-overlapping grid cells, and the nucleus is partitioned into \(s \times 2s\) half-overlapping grid cells. For each of the RGB and HSV color channelsFootnote 1, a grid cell is represented by its mean intensity \(\hat{t}\) and entropy e, defined as \(e=-\sum _{l=0}^{255}{p_l\log {p_l}}\) where \(p_l\) is the probability of intensity l in the grid for a given color channel. With this data, each image is represented by a feature vector with \(3 \times s \times s \times 6 \times 2 = 36s^2\) dimensions.

With the downsampling and half-overlapping grid cells, this representation becomes less sensitive to slight misalignments caused by differences in lens shape. We note that the feature vectors used in previous works such as [6], though containing features such as intensity ratios and edge strength that are useful for discriminative classification, are less suitable for dealing with the alignment problem, which is critical to the success of reconstruction-based techniques.

2.3 Similarity Weighted Linear Reconstruction (SWLR)

Suppose we have a dictionary that consists of n reference images, denoted by \(\mathbf {{D}}=\{\mathbf {d_1,d_2,\cdots ,}\) \(\mathbf {d_n}\}\in \mathbb {R}^{f\times n}\) where each column \(d_i\) denotes a reference image expressed by its semantic feature vector. For a given test image expressed as \(\mathbf {y}\in \mathbb {R}^{f\times 1}\), we compute the optimal linear reconstruction coefficients \(\mathbf {w}\in \mathbb {R}^{n\times 1}\), \(\sum _{i=1}^{n}{w_i}=1\), \( w_i \ge 0\), that minimize the reconstruction error \(||\mathbf {y-Dw}||^2\). Our objective function also includes a cost term that penalizes the use of references that are less similar to the test image. Let us denote the costs for the reference images in \(\mathbf {D}\) as the vector \(\mathbf {c}=\{c_1,c_2,\cdots ,c_n\}^{\top }\in \mathbb {R}^{n\times 1}\), where \(c_i\) is the cost of using \(\mathbf {d_i}\) for reconstruction. The overall cost term can then be expressed as \(||\mathbf {c \odot w}||^2\) where \(\odot \) denotes the Hadamard product. Combining this cost term with the reconstruction error gives the following objective function:

$$\begin{aligned} \begin{aligned} \min _{\mathbf {w}} {||\mathbf {y-Dw}||^2+\lambda ||\mathbf {\mathbf {c \odot w}||^2}}, \;\;\;\; s.t.\; \sum _{i=1}^{n}{w_i}=1,w_i \ge 0, \end{aligned} \end{aligned}$$
(1)

where \(\lambda >0\) is a regularization parameter. This objective can be minimized in closed form using the Lagrange multiplier method:

$$\begin{aligned} \begin{aligned} \mathbf {w}=&\frac{1}{\mathbf {1}^{\top }(\hat{\mathbf {D}}^{\top }\hat{\mathbf {D}} + \lambda \mathbf {C}^{\top } \mathbf {C})\mathbf {1}}(\hat{\mathbf {D}}^{\top }\hat{\mathbf {D}} + \lambda \mathbf {C}^{\top } \mathbf {C})^{-1} \mathbf {1}, \\ \hat{\mathbf {D}} =&(\mathbf {1 \otimes y - D}), \end{aligned} \end{aligned}$$
(2)

where \(\mathbf {C}=\text {diag}(\mathbf {c})\) and \(\otimes \) denotes the Kronecker product. The cost \(c_i\) is defined as the \(\chi ^2\)-distance between the test image \(\mathbf {y}\) and the i-th reference image \(\mathbf {d_i}\), i.e.,

$$\begin{aligned} c_i=\sum _{j=1}^{f}{\frac{(y_j-d_{i,j})^2}{2(y_j+d_{i,j})}}, \end{aligned}$$
(3)

where \(d_{i,j}\) denotes the j-th entry of reference image \(\mathbf {d_i}\). We note that the inclusion of entropy in the semantic feature helps to exclude misaligned references in the SWLR algorithm, since high entropy indicates the presence of structural variations, and differences in entropy caused by misalignment are penalized by this cost function.

Finally, the test image is graded as \(\mathbf {w}^{\top }\mathbf {g}\), where \(\mathbf {g}\) denotes the corresponding cataract grades of reference images in dictionary \(\mathbf {D}\).

3 Experiments

To evaluate our method, we first compare it to the state-of-the-art nuclear cataract grading methods [5, 6]. We then validate its major components by comparing to versions of our technique with certain components replaced.

3.1 Experimental Setting

Dataset. All the experiments are performed on the large ACHIKO-NC dataset used in [5, 6], which is comprised of 5378 images with decimal grading scores that range from 0.3 to 5.0. The scores are determined by professional graders based on the Wisconsin protocol [4]. The protocol takes the ceiling of each decimal grading score as the integral grading score, i.e., a cataract with a decimal grading score of 2.4 has an integral grading score of 3. ACHIKO-NC consists of 94 images of integral grade 1, 1874 images of integral grade 2, 2476 images of integral grade 3, 897 images of integral grade 4, and 37 images of integral grade 5. All left eye images are flipped horizontally so that they can be processed in the same way as right eye images.

Evaluation Criteria. For a fair comparison to prior art, we measure grading accuracy using the same four evaluation criteria as in [5, 6], namely the exact integral agreement ratio (\(R_0\)), the percentage of decimal grading errors \(\le 0.5\) (\(R_{e0.5}\)), the percentage of decimal grading errors \(\le 1.0\) (\(R_{e1.0}\)), and the mean absolute error (\(\varepsilon \)), which are defined as

(4)

where \(G_{gt}\) denotes the ground-truth clinical grade, \(G_{pr}\) denotes the predicted grade, \(\lceil \cdot \rceil \) is the ceiling function, \(|\cdot |\) denotes the absolute value, \(|\cdot |_0\) is a function that counts the number of non-zero values, and N is the number of testing images (\(N=|G_{gt}|_0=|G_{pr}|_0\)). \(R_{e0.5}\) has the most narrow tolerance among the four evaluation criteria, which makes it more significant in evaluating the accuracy of grading.

Testing Method. To examine generalization ability, we follow the repeated test settings in [5], i.e., in each round, 100 training samples are randomly selected from all the 5378 images, with 20 images for each grade, and the remaining 5278 images are used for testing. In training, optimal parameters are selected for each method by cross-validation, where half of the images (50 images with 10 per grade) are used as the dictionary, the other half used for testing, and the set of parameters with the smallest average \(\varepsilon \) is chosen. The result of each round is obtained by testing the remaining 5278 images using all the 100 images as the dictionary together with the determined optimal parameters.

Table 1. Cataract grading performance vs. state-of-the-art regression based methods

3.2 Comparison to State-of-the-art Regression Methods

We first compare our method to the state-of-the-art techniques, namely BOF+GSR [5] and RBF \(\epsilon \)-\(\varvec{SVR}\) [6], using the same dataset, experimental setting and reporting methods. The results are listed in Table 1, where SF refers to the proposed semantic feature and SWLR refers to the proposed similarity weighted linear reconstruction method. According to the results, our method is shown to surpass [5, 6] in all four evaluation criteria.

In Table 1, our method is also compared to the application of RBF \(\epsilon \)-\(\varvec{SVR}\) on the proposed semantic feature. The results show that the proposed feature has less discriminative power than the features used in [5, 6]. This is not unexpected, since our semantic feature is designed for more robust alignment rather than discriminative power. In addition, our method has an extra advantage over [6] in that a more detailed segmentation of the lens is not needed for extracting discriminative features.

In summary, reconstruction and regression are two significantly different approaches to solve the cataract grading problem. The proposed SWLR selects more relevant sample images for each individual testing image to perform grading, while GSR selects more discriminative feature vector entries over all training images and assumes that good prediction can be obtained with these features on all the test images.

3.3 Comparison to Alternative Versions of Our Method

To validate the components of our technique, we compare it to alternative versions without the similarity based regularizer (referred to as LR), and by applying SWLR on different feature sets. The results are given in Table 2, and the following observations can be made:

  • Comparing SWLR to LR shows that the similarity constraint is helpful for selecting more relevant/representative reference images for each individual test image. With better reconstruction, the performance is improved. We note that applying LR on only the k-nearest neighbours (k-NN) also does not yield performance as high as SWLR, since k cannot be fixed to a value that is suitable for all test images. By contrast, SWLR can adaptively determine a set of proper reference images to reconstruct each individual test image.

  • Comparing SWLR using different feature sets shows that though some features may have greater discriminative power, they are less suitable in the context of reconstruction-based classification.

Table 2. Cataract grading performance using different reconstruction techniques and features

3.4 Discussion

Similarity Metric. We compared our SWLR method using different similarity metrics, namely \(\chi ^2\) distance and Gaussian distance, defined as \(\exp (||\mathbf {y-d_i}||^2 / {\sigma }^2)\) where \(\sigma \) is a parameter that accounts for imaging noise. It was observed that \(\chi ^2\) distance is more effective than Gaussian distance for the proposed feature representation, with metrics of (\(R_0\), \(R_{e0.5}\), \(R_{e1.0}\), \(\varepsilon \)) for \(\chi ^2\) distance being (0.696±0.008, 0.871±0.007, 0.991±0.001, 0.332±0.006) and for Gaussian distance being (0.688±0.009, 0.868±0.007, 0.990±0.001, 0.337±0.007).

Processing Speed. On a four-core \(2.4\,GHz\) PC with \(16\,GB\) RAM, our method takes 17.73 s on average to process an image, with 1.36 s for feature extraction and only 0.001 s for prediction because of the small dictionary size. This processing speed slightly exceeds the 20.45 s per image of [5], which takes 4.23 s for feature extraction and 0.00001 s for prediction. It is also faster than the 25.00 s per image of [6], which spends 8.76 s for feature extraction and 0.02 s for prediction.

4 Conclusion

For grading the severity of nuclear cataracts from slit-lamp lens images, we proposed a reconstruction-based approach with a new semantic feature representation and a similarity weighted regularizer. In tests on the ACHIKO-NC dataset comprised of 5378 images, our approach achieves significant improvements over the state-of-the-art regression based methods [5, 6]. In future work, we plan to elevate performance by introducing a feature selection mechanism and investigating other similarity metrics.