1 Introduction

Gaze-based interaction has been a highlight in the field of Human-Computer Interaction. As the fundamental technology of gazed-based interaction, gaze estimation attracted lots of researches in recent years. So far, there are mainly two kinds of gaze estimation methods, i.e., model-based [1,2,3,4,5,6,7,8] and appearance-based methods [10,11,12,13,14,15,16,17,18].

Most of the model-based gaze estimation methods are usually based on cameras with high resolution and infrared (IR) lights [5, 6]. These methods established the eyeball geometry referring to the location of pupil center and the reflections of the IR lights on the cornea, and estimated the gaze points according to the location relationship between the IR lights and the cameras [7, 8]. Most of the model-based gaze estimation methods can reach high accuracy, but most of them needed calibration process for individual user before estimation, which destroys the naturality of HCI and degrades the quality of experience.

Different from the model-based gaze estimation methods, the appearance-based gaze estimation methods are data-driven methods. They estimate the gaze point based on the mapping between gaze points and the related eye images or their features, which is obtained based on the training eye images labelled with their corresponding gaze directions. Compared to the model-based methods, the appearance-based methods don’t need any calibration, and have low hardware requirements, which make these methods promising in practical applications. Lu et al. [9, 10] divided an eye image into sub-regions and adopt intensity feature vector as the extracted feature. Sugano et al. [11] proposed an appearance-based gaze estimation using visual saliency. When referring to the gaze estimation methods, the most commonly used mapping functions include Gaussian Bayesian regression [12], adaptive clustering [13], random forest regression [14, 15], neural network [16,17,18], and so on. Most of the above methods estimate the exact gaze points via regression model. And referring to the performance announced in previous researches, the accuracy of appearance-based methods is much lower that of model-based ones in estimating the exact gaze point, even with the support of deep learning and large amounts of training data.

In practical interaction applications, the button-based touch interface is almost the most popular way. Given this interaction consideration and the challenge in estimating the exact gaze point, we propose a novel appearance-based gaze estimation method. Our contributions in this paper are:

  1. (1)

    We estimate the gaze blocks, not the exact positions of fixation points. During button-based touch interactions, the user usually tends to touch the center of the button, though any touch on any position of the button can trigger the operation. It is the same that the user usually tends to gaze the center of the button, when he tries to trigger the button via gazing. It means that all of the gaze falling into one button can be classified as the gaze which can trigger this button. Therefore, we can broaden the gaze estimation task from point-wise estimation to block-wise estimation taking account of the practical interaction applications.

  2. (2)

    We take the mapping between the eye images and gaze blocks as a classification task, which is implemented by CNN-based classification. Because all the gazes falling into the same button can trigger the button, these gazes can be considered in the same class in the case of button-based interaction, and the corresponding accuracy can be relaxed to the size of the button. The classification-base estimation can balance the requirements of practical applications and difficulty in accurate estimation.

  3. (3)

    We use the binocular images as the training data because the binocular images can provide more information of head pose and relative position between two eyes than monocular images, which are considered useful for accurate gaze estimation. It is different from most of the existing methods, which use left or right eye images independently, or use left or right eye images separately and fuse them finally. In addition, different user have their own dominant eyes, the estimation accuracy on monocular images of dominant eye will be higher than that on the non-dominant eye. And the estimation based on binocular images can achieve accurate and robust results.

We perform block-wise gaze estimation on the 6- and 54-block levels with our collected data and MPII dataset. The experiments verify the outperformance of the proposed method in gaze block estimation.

2 Proposed Method

2.1 Framework

As mentioned above, this paper is dedicated to the appearance-based gaze estimation method via CNN. Figure 1 shows the basic framework of our binocular data-driven method. Just like the most appearance-based method, the learning-based gaze estimation method first requires a large amount of data for model training. For this purpose, we established our binocular dataset as detailed below. After the data collection, we take the binocular images as the input of CNN.

Fig. 1.
figure 1

Framework of the proposed gaze estimation method.

2.2 Binocular Image Collection

There are many existing human eye datasets, which can be available publicly. However, most existing datasets provide only monocular data and most of them are designed for regression problems. Unlike these datasets, we build our own dataset as in Sect Data Collection, label all eye images with their corresponding block, and take the eye images with the same label as the same category.

Data Collection.

Our experimental images are captured from a single web camera, the Logic C270, with the resolution of 640 × 480, and one normal 19-in screen with the aspect ratio 16:9. The screen blocks are as shown in Fig. 1. We first divide the screen into 2 × 3 blocks as shown in thick lines and each block is further divided into 3 × 3 small blocks as shown in thin lines according to our screen ratio. The subjects are asked to sit about 60 cm away from screen with their heads free and this can ensure that one centimeter on screen is approximately equal to one degree in gaze direction. The big blocks are with the size of 12.75 × 12.75 cm and the corresponding 6-class classification can achieve the accuracy of ±6.38°. The small blocks are with the size of 4.25 × 4.25 cm and the corresponding 54-class classification can achieve the accuracy of ±2.13°.

We provide a simple way for data collecting. Each subject is asked to gaze the center of the blocks on the screen. And the center of the block is taken as the ground truths of fixation points falling in this block. The gazes that fixate any point in each block are classified the same category in our method. Totally 56 groups of eye videos are collected from 22 subjects aged in 20–30 range, and each video lasted about 4 min with 15 fps. In order to prevent the fixation fatigue, we set a two-second rest between every two fixation points.

Binocular Images.

Figure 1 shows the acquisition of binocular data. We adopted the face and eye detection method in [19]. Given a fixation video of a subject, we first detected the face of the first frame by Haar-like features [20]. And then, we detected all remaining images by matching with the detected face from the first frame. For the obtained human face area, we used the empirical theory namely the general positions of the two eyes in the face area to obtain each eye’s center. The precise locations of the two eyes center can be further determined based on the human face. The binocular rectangle can be obtained by setting the threshold. All extracted binocular images contain both left and right eyes and are normalized to the fixed size of 40 × 184 according to the average size of the eye image samples. We randomly selected the 151200 training images and 30240 testing images to train the CNN model. The eye images in training set and testing set remain mutually disjoint.

2.3 Gaze Estimation Using CNN

Although many appearance-based gaze estimation methods based on regression models have been proposed, it is difficult to achieve high accuracy because it is a really hard task to estimate the exact gaze position. Our proposed block-wise classification-based gaze estimation method relax the estimation from one point to a block, which reduce the difficulty of training a learning-based model. Besides, the estimation of the gaze blocks is exactly suitable for the touch-based interaction.

Considering the good performance of the CNN model used in [21] for image classification in ImageNet competition, we fine tune the parameters of this network to suit our input image. Figure 2 illustrates the deep structure of our CNN model. The network contains three convolutional layers followed by max pooling layers, and two fully connected layers which connect the extracted features into multidimensional vector. The last layer determines the final category that the input image belongs to by calculating the probability of each class. The category with the highest probability is just the one which the input image belongs to. The cuboids in Fig. 2 represent the convolutional layers, and max pooling layers. The length, width, and height of the cuboid denote the number of feature maps, and the size of each map. The size of the convolution kernel is illustrated by small squares in each cuboid. Two fully connected layers are represented by rectangular bar with the dimension of 256 and the final layer represents the number of classes. In this paper, we give two classification criteria namely 6-class classifications and 54-class classifications which denote two different block size.

Fig. 2.
figure 2

The structure of deep convolutional neural network.

2.4 CNN-Based Classification

We choose the RGB images as the input as the color information in the CNN is useful to improve the accuracy. We define \( x_i^{l - 1}\; \) as the \( i{\text{th}} \) feature map of the \( l - 1{\text{th}} \) layer, \( M\; \) as the number of feature maps of this layer. So, the output of the \( l{\text{th}} \) convolutional layer can be expressed as:

$$ x_j^l = Relu(\sum\limits_{i \in M} {x_i^{l - 1} * k_{ij}^l} + b_j^l) $$
(1)

Where \( k_{ij}^l \) is the convolution kernel, \( b_j^l\; \) is the bias, and the “\( * \)” represents the convolution operator. The previous feature maps are convoluted with different convolution kernels and shifted by a bias, followed by the activation function. Then, the result can form one of the feature maps in the convolutional layer.

The output of the current sub-sampling layer can be expressed as:

$$ x_j^l = f(\beta_j^l \cdot down(x_j^{l - 1}) + b_j^l) $$
(2)

Where \( down( \cdot ) \) represent the max-pooling operation. Pooling results are multiplied with a gain coefficient \( \beta \) and shifted by a bias followed by the activation function \( f \).

We set Rectified Linear Units (ReLU) as the activation function and add Local Response Normalization (LRN) layer between the convolutional and the pooling layers to improve the performance. We adopt the dropout layer to prevent over-fitting.

The convolution layers, the pooling layers and the activation functions are used to map the original input into the hidden feature space. After the feature extracted, we use the fully connected layers to realize the classification. The last two fully connected layers are realized by inner product operation. In our network, a 256-dimension feature vector is generated. In the final classification operator, the probability values belonging to different classes are calculated and the category corresponding to the highest probability value is the category which the input image belongs to. When the test image goes through the trained network, the outputs of three pooling layers in CNN are shown in Fig. 3. The images indicate that the contour of input image is gradually blurred and the deep features are gradually extracted as the network grows deeper.

Fig. 3.
figure 3

Output of different pooling layers in CNN. Output of the 1st pooling layer (40 × 18 × 90) (top left), output of the 2nd pooling layer (80 × 10 × 46) (top right), output of the 3rd pooling layer (128 × 4 × 22) (bottom).

3 Experiments and Evaluations

3.1 Gaze Estimation Performance

In this section, we present the results of our proposed binocular image based method. Just as mentioned above, we randomly choose the training set and the testing set with a 5:1 sample number ratio, and the two sets are completely disjoint without any overlap. We choose the binocular data, left eyes and right eyes as the training data of the network respectively. All of the experiments are repeated in both 6-classes and 54-classes scenario. For binocular data, the proposed method can reach the average accuracy with 98.52% for the 6-class classification and 90.97% for the 54-class classification. For left eyes only, the accuracy can reach 93.31% for 6-class classification and 80.89% for 54-class classification. And for right eyes only, the accuracy can reach 92.74% for 6-class classification and 79.42% for 54-class classification. The experimental results of average classification accuracy are shown in Fig. 4. The rising curves indicate accuracy change with iterations for different type of input. It can be seen that the accuracy improvement of binocular images is about 5% in 6 block case and about 10% in 54 block case, which means the binocular images is helpful in accurate gaze estimation.

Fig. 4.
figure 4

Experimental results of both 6-class and 54-class average classification accuracy.

We also give the confusion matrix to display the classification performance of each category in Fig. 5. The horizontal axis of the confusion matrix represents the real category, and the vertical axis represents the predicted classification. So, the diagonal in confusion matrix indicates the probability of the correct classification. We can find that the proposed binocular data-driven method can get high classification accuracy in each category.

Fig. 5.
figure 5

The confusion matrix for 6-class classification (left) and confusion matrix for 54-class classification (right).

3.2 Comparison with MPIIGaze Dataset

In this section, we discuss our classification method based on MPIIGaze dataset in [18]. This dataset offered eye images and the corresponding gaze direction. In order to meet the needs of our experiment, we first turn the three-dimensional coordinates of fixation points into angular coordinates, and then we map the coordinates into the corresponding blocks on a screen, which is same as the one used in our data collection. The human eyes are labeled according to these corresponding mapping blocks. We select 33000 left eye samples and 33000 right eye samples from the MPIIGaze dataset. The estimation accuracy results when using both our dataset and the MPIIGaze dataset are shown in Table 1. In [18], the lowest error is about 6° and the mean error is about 10.3°. We have already described the range of degree of our blocks in Sect. 2.2. From Table 1, we can see that the accuracy of gaze estimation can mostly reach about 6.38°, and some can reach about 2.13°. It demonstrates that our proposed method outperforms the method in [18] on their MPIIGaze dataset. The experimental results may also indicate that the datasets based on regression methods are not suited for the block-based classification, and it is necessary to build the specific dataset for the button based gaze interaction.

Table 1. Comarison with the MPIIGaze dataset.

3.3 Cross-Subject Performance

A practical network should have a wide range of adaptability to the new subjects. So, we test the cross-subject performance for our trained CNN. We randomly selected 4 subjects marked as n1–n4 out of the 22 subjects and tested the classification accuracy of the samples from the 4 subjects in our trained CNN. The experimental results for both 6-class and 54-class classification are shown in Table 2. It can be seen that there are significant differences between subjects, and the performance in 54-class classification is worse than that in 6-class classification. The possible reason for the differences between subjects can be related to the diversity of human eyes. And the general characteristics extracted through CNN can only be used to represent the common parts between individuals. As our dataset contains few individual (22 individuals), it is difficult to cover all appearance of users, and the performance could be better when the training set cover more individuals. The poor accuracy for 54-class classification may be that the block size of 54-class classification is much smaller than 6-class classification, so the classification accuracy becomes worse compared to 6-class classification. We also give the comparison with MPIIGaze dataset for the cross-subject evaluation in Table 2. We also randomly select 4 subjects marked as s1–s4 from the 15 subjects of MPIIGaze dataset as the validation set and set the eye samples from the other 11 subjects as the training set for classification based on our CNN model.

Table 2. Investigation on cross-subject gaze block estimation.

3.4 Comparison with Other Methods

We compare our binocular data-driven method with other appearance-based methods based on our dataset. George and Routray [22] proposed a real-time eye gaze direction classification method using CNN. They adopted the similar classification method with us, but they trained two networks for left and right eyes independently and got the final category by combining the two scores. They finally get 86.81% recognition rate for 7-class classification in the Eye Chimera dataset [23]. Our proposed method finally achieves 9.39% accuracy improvement for 6-class classification and 36.39% accuracy improvement for 54-class classification over the method in [22]. Zhang et al. [18] trained a regression network to estimate the fixation points. In order to unify the comparison pattern, we estimated the fixation point positions using their method and mapped the fixation points to our screen blocks. The mapping results are very poor as their best performance of mean error in [18] is 10.5° for cross-dataset evaluation. The comparison results in Table 3 show that our classification method has higher accuracy in both monocular and binocular data. And the binocular data can improve the classification performance effectively.

Table 3. Comarison with other methods.

4 Conclusion

In this paper, we proposed a binocular-image-based gaze estimation method, which estimates the gaze block by using CNN classification. Through the mapping relationship between eyes appearance, gaze estimation and the screen blocks, we estimated a new gaze estimation pattern based on classification. Different from the previous gaze estimation methods, we have achieved the twofold improvement on both accuracy and stability by the block based classification and the information provided by binocular images. For the future work, we will continue to enrich and release our dataset with more subjects and more situations to improve the performance for cross-subjects.