1 Introduction

Eyes as one of the most important organs, is regarded as an important information input source in the human-interactive, and gaze estimation is considered as an important new type of human-interactive method. Because of its convenience and rapidity, the gaze estimation has been widely researched in recent years. And with the development of image and video processing technology, the high precision gaze estimation on monocular data has been achieved.

In general, gaze estimation method can be roughly divided into model-based method and interpolation-based method. The former method uses the eyeball geometric model, image features and hardware parameters to calculate the gaze position. Although this kind of method has been achieved in the literature [17], these systems tend to take at least 2 cameras and some hardware parameters. Even if the hardware cost and the complexity of calibration do not be considered, the deviation of the gaze direction calculated through the model is above 4 degrees.

Unlike the model-based method which needs accurate mathematical model as the input information, the interpolation-based methods do not require a calibrated hardware setup or extensive information about the user. This kind of method using the calibration process to construct the mapping relationship between the high-dimensional image features and the low-dimensional gazing space, and the mapping relationship is used to calculate the gaze position of test image. Sugano et al. [8] introduced a method that through Gaussian process regression establishes the mapping relationship between the eye image and the gaze point. They use visual saliency map as the input feature and achieve the accuracy of 3.5 degrees. Villanueva et al. [9] established the mapping relationship using the vector from pupil center to two corneal reflection centers, and the system accuracy reached less than 4 degrees. Cerrolaza et al. [10], Ramanauskas et al. [11] used a similar method to establish the different types of mapping relationship, which reached a similar gaze estimation accuracy.

On the other hand, the interpolation-based method also has its own disadvantages. The accuracy of gaze estimation is closely related to the number of training samples. Xu et al. [12] and Tan et al. [13] used more than 200 training samples to build the mapping relationship between input features and gaze points. Obviously, such a long time calibration process makes the user feels fatigue and disgust, so it can’t be spread to commercial use or other applications.

In this paper, we propose a novel interpolation-based gaze estimation method. It utilizes the PCA + HOG feature as input feature. The core idea of the method is to found the optimized subset among all the training samples and used the \( \ell^{1} \)-minimize to reconstruct test feature vector in optimized subset. The linear combination of the optimized subset is the initial gaze estimation result. Then we construct a gaze compensation equation to compensate the initial gaze estimation result, it can compensate the effect of head movement on initial gaze estimation. Eventually, the gaze estimation result gains good accuracy in the case of only using 33 calibration points.

2 Gaze Estimation Method

This paper obtains the mapping relationship between input feature and gaze position based on the input features reconstruction. The selection of input features is very important, it has decisive effect on the accuracy of features reconstruction. HOG feature has strong robustness to illumination changes and image geometric deformation. Therefore this paper uses the HOG feature as the system input feature.

2.1 Feature Extraction

When gaze position changes, the most intuitive feeling from the 2-D image is that the pupil position in the eye changed. We use the face alignment method proposed by Ren et al. [14] to find the left eye region and the right eye region as the interest area Fig. 1. In the process of HOG feature extraction, the whole eyes image is regarded as a block, and each block is divided into 3*6 cells Fig. 1. In this way, in each eye image, we get a 162-D (3*6*9) feature vector.

Fig. 1.
figure 1

Feature vector extraction.

Such a big feature vector not only affects the speed of feature reconstruction but also contains a lot of useless feature dimension which will become noise in the process of reconstruction. The main factor that reflects the essence of image changes can be analyzed from high dimensional feature vectors by PCA. It can be seen from Fig. 2, the former 10-D features contains 90 % information of the feature space. Therefore, we use PCA to reduce feature vector dimensions from 162 to 10, and make it to be the system input feature.

Fig. 2.
figure 2

The percentages of eigenvalues in different feature dimensions

The proposed gaze tracking system is a real-time system, the sizes of two consecutive frames of eye image do not appear a big change. To each test frame, we use the former frame to judge if the interest area is right or not. If the height or weight of interest area has a large change, we skip the frame to test next frame.

2.2 Initial Gaze Estimation

The feature vector reflects the changes of 2-D eye image when gazing at different positions. Assuming a training set of eye images consist of all the eye features matrices \( E = \left[ {e_{1} ,e_{2} , \ldots ,e_{n} } \right] \in {\mathbb{R}}^{m*n} \) and corresponding gaze position matrices \( P = \left[ {p_{1} ,p_{2} , \ldots ,p_{n} } \right] \in {\mathbb{R}}^{1*n} \). We hope to find a mapping from \( E \) to \( P \):

$$ P = AE $$
(1)

where \( A \in {\mathbb{R}}^{1*m} \) is the projection matrix. Obviously, if n > m, the system of equation is overdetermined. We cannot find a mapping matrix that is accurate for all the training samples. However, if n < m, the certain A can be found. So we need to choose the optimized subset \( E^{'} = [e_{1}^{'} ,e_{2}^{'} , \ldots ,e_{{n^{'} }}^{'} ] \) and \( P^{'} = [p_{1}^{'} ,p_{2}^{'} , \ldots ,p_{{n^{'} }}^{'} ] \) in all training sets. The new mapping can be found:

$$ {\text{P}}^{'} = {\text{A}}^{'} {\text{E}}^{'} $$
(2)

where \( A^{'} \) is the new projection matrix. We hope any test sample (\( \hat{e} \), \( \hat{p} \)) can find the corresponding \( A^{'} \). So in the process of the mapping relationship constructing, only a few \( e_{i}^{'} \) weights is allowed to be different than zero. We can make the problem of solving the mapping matrix A transformed into selecting the fewest \( e_{i} \) to construct optimized subset in all image training samples. The optimized subset \( E^{'} \) should be closely to the test image feature \( \hat{e} \), so that they can have a same mapping relationship, and a existed set of reconstruction weight {\( w_{i} \)} can linear reconstruct \( \hat{e} \).

$$ \hat{e} = \sum\nolimits_{i} {w_{i} e_{i}^{ '} } $$
(3)

Ideally optimized subset would contain the samples which are close to the test image in the gaze space. A training feature vector which has the minimum Euclidean distance to test feature vector can be found by Eq. 4. The calibration point corresponding to this vector is marked for the main point, which is the closest to the real gaze position.

$$ \min\nolimits_{i} d_{i} = argmin\;\;||\;\hat{e} - c*e_{i} \;||_{1} $$
(4)

where \( \hat{e} \) is the feature vector of test image, \( e_{i} \) is the ith feature vector of training samples, \( c \) is a coefficient, \( d_{i} \) is the minimum Euclidean distance. The feature vectors, corresponding to the main point and six other calibration points around it, constitute the optimized subset \( \hat{E} = [e_{main} ,e_{main}^{1} , \ldots ,e_{main}^{6} ] \). If the main point is on the edge, it is used to constitute the optimized subset with the existed calibration points around it.

Reconstructing the weight \( w \) in optimized subset is formulated as a sparse reconstruction problem, which can be solved by minimizing the \( \ell^{1} \) norm of \( w \) [15, 16]. Due to the existence of real noise, it may not be possible to represent the test sample exactly as a linear combination of the optimized subset. A small constant ε was introduced to express the maximum allowed Euclidean distance from \( \hat{E}w \) to the ground true \( \hat{e} \). The reconstruction weight \( w \) can be get by:

$$ \hat{w} = {\text{argmin}}\; | |\;{\text{w}}\; | |_{1} \;\;\;\;\;s.t.\;\; | |\;\hat{E}{\text{w}} - \hat{e}\;||_{1} < \varepsilon $$
(5)

Lu et al. [17] has demonstrated that use of the same weights to estimate the gaze parameters is justified by locality, as the linear combinations in the subspaces spanned by {\( e_{j}^{'} \)} and {\( p_{j}^{'} \)} are equal. Finally, the test gaze position \( \hat{p} \) can be calculated by:

$$ \hat{p} = \sum\nolimits_{i} {w_{i} p_{i} } $$
(6)

2.3 Gaze Position Compensation

The vertical height changes of the ground truth gaze point on the influence of the initial gaze results are shown in Fig. 3. In Fig. 3, the vertical direction of gaze estimation result has the trend to close the center of the screen. The vertical error of test points in the screen edge are bigger than in the screen central (point height between 300 and 800). In addition, the error of gaze estimation results in vertical direction is larger than in horizontal direction (Sect. 4). The main reasons for the above phenomenon are: 1. Due to the people vision area on horizontal is wider than it on vertical, people move head largely when changing the gazing point on vertical. 2. When the gaze point changes in the horizontal direction, the eye image has significant changes; when the gaze point changes in the vertical direction, the eye image has little changes.

Fig. 3.
figure 3

Initial gaze estimation result under different gaze positon.

The curves of mean errors and eye open sizes have almost the same linear relation. Therefore, after the initial gaze position is got, a linear equation is used to compensate the vertical deviation. The final gaze estimation result in vertical direction is calculated by:

$$ p_{yf} = \hat{p}_{y} + (S_{h} - S_{l} )/(H_{h} - H_{l} ) *(H/2 - \hat{p}_{y} ) $$
(7)

where \( \hat{p}_{y} \) is the estimated value of initial gaze estimation in vertical direction, H is the total pixels in the screen vertical direction, \( S_{h} \) is the eye open size when gazing the highest training point on the screen, \( {\text{S}}_{\text{l}} \) is the eye open size when gazing the lowest training point on the screen, \( {\text{H}}_{\text{l}} \) is the vertical coordinate of the lowest training point on the screen, \( {\text{H}}_{\text{h}} \) is the vertical coordinate of the highest training point on the screen.

3 Experiments

In this section, the experiments performed to evaluate the proposed gaze estimation system. 6 male and 4 female subjects are chosen to do the experiment under a condition of one camera with a resolution of 1280*720 and 2 infrared light sources with 850 nm wavelength. We implemented our system with a 24-inch computer screen, and the resolution is 1920*1080 pixels. The subject was asked to sit at a distance of 600 mm from the computer screen. In the experimental process, subject’s head tries to aim at the screen center as much as possible, and the subject is allowed to have slight head motion.

The whole experiment process is divided into calibration stage and test stage. In the process of calibration, the subject focused on each calibration point shown on the screen and allowed the camera to capture frontal appearance. In the process of test, the subject watched the test points shown on screen. There are 30 test points distributed in each position of the screen, and were shown in random order.

3.1 Evaluation and Comparison

For each input image, the gaze positions of the left eye and the right eye are calculated respectively, the average value of the two gaze positions is regard as the double-eye gazing position. In order to show the experimental result directly and compare with other state-of-the-art methods, the angular of estimation error will be calculated by:

$$ {\text{error}} \approx { \arctan }(||\hat{p} - p_{0} ||_{2} /{\text{D}}) $$
(8)

where \( ||{\hat{\text{p}}} - {\text{p}}_{0} ||_{2} \) denotes the Euclidean distance between a real 2D gaze position \( {\text{p}}_{0} \) and the estimated 2D gaze position \( {\hat{\text{p}}} \), D indicates the distance between the subject’s eye and screen.

Table 1 shows the mean errors of the gaze estimation system for each subject. It shows the highest estimation accuracy in all subjects. In general, the gaze estimator of one eye achieves a mean error of 83 pixels, corresponding to an angle error of 2.15°; while the gaze estimator of double eyes has a mean error of 69 pixels, corresponding to an angle error of 1.79°. In some subjects, the left eye’s gaze estimation accuracy is higher than the right eye’s, while other subjects are opposite. The left and right eye’s overall average gaze estimation precisions are basically the same. It demonstrates that which eye’s gaze estimation result is more accurate depends on the subjects’ own individual differences, and every double eye gazing estimation accuracy is higher than one eye gazing estimation accuracy.

Table 1. Mean pixel error and mean angel error

In addition, we compare our system with other gaze estimation systems which without head fixed device. Comparison results are shown in Table 2, compared with other excellent methods in recent years, our method has better positioning accuracy no matter in one eye or double eye gazing estimation.

Table 2. Comparison with the state-of-the-art method

3.2 Compensation Equation Evaluation

To compensate the error caused by slight head motion during gaze estimation, a gaze compensation algorithm is proposed in Sect. 2.3. This section is mainly to evaluate the effect of the compensation algorithm for the final result of gaze estimation.

It can be seen from the Fig. 4, the mean pixel error of initial double-eye gazing estimation on the y direction (mean error 61) is obviously larger than it on the x direction (mean error 37), and the reason has been explained in Sect. 2.3. After the use of gaze compensation method, the mean pixel error on y direction is descended from 61 to 45, decreased by 35 %.

Fig. 4.
figure 4

Mean error at x direction and y direction

3.3 Distance Change

In the above experiment, the distance between subjects and screen is set to be 600 mm. This section evaluates the robustness of the proposed method in test with distance changes between subjects and screen. We choose three subjects to do the experiment, the test distances were respectively set as 500 mm, 600 mm, 700 mm and 800 mm. The experiment processes are exactly as described in Sect. 3.1.

It can be seen from the experimental results in Table 3, the overall gaze accuracy under different distances are basically equal. This proves the gaze estimation method mentioned in this paper has a good robustness to distance changes.

Table 3. The mean error of gaze estimation under different distance

4 Conclusion and Future Work

In this paper, we have proposed an accurate gaze estimation method under a little calibration points. First, the main point is found by using the minimum Euclidean distance among the all calibration feature vectors. The main point and the calibration points around it constitute the optimized subset. In the optimized subset, a set of sparse reconstruction weights is solved by using the \( \ell^{1} \)-minimum and utilized to linear express the initial gaze estimation result. Based on the result of initial gaze estimation, we use gazing compensation equation to get the final gaze estimation result. Experiment shows that our method can achieve an accuracy of 1.79° under monocular vision.

However, limitations still exist. First, although our method can achieve high precision gaze estimation with slight head motion, it is still powerless with free head motion. Second, the calibration process is necessary to each subject. The mapping relationship between input feature and gaze position cannot be found without calibration. Overall, handling large head motion and completely removing training stage are currently impossible in this work. For further research, the method could be improved on basis of this paper, 3D head pose estimation can be added, in the meanwhile, reconstructing the head motion compensation equation, and then the free head motion gaze estimator would be achieved.