Block-Wise Gaze Estimation Based on Binocular Images

Wu, Xuemei; Li, Jing; Wu, Qiang; Sun, Jiande; Yan, Hua

doi:10.1007/978-3-319-75786-5_38

Block-Wise Gaze Estimation Based on Binocular Images

Xuemei Wu¹⁶,
Jing Li¹⁷,
Qiang Wu¹⁶,
Jiande Sun¹⁸ &
…
Hua Yan¹⁹

Conference paper
First Online: 15 February 2018

1508 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10749))

Abstract

Appearance-based gaze estimation methods have been proved to be highly effective. Different from the previous methods that estimate gaze direction based on left or right eye image separately, we propose a binocular-image based gaze estimation method. Considering the challenges in estimating the precise gaze points via regression models, we estimate the block-wise gaze position by classifying the binocular images via convolutional neural network (CNN) in the proposed method. We divide the screen of the desktop computer into 2 × 3 and 6 × 9 blocks respectively, label the binocular images with their corresponding gazed block positions, train a convolutional neural network model to classify the eye images according to their labels, and estimate the gazed block through the CNN-based classification. The experimental results demonstrate that the proposed gaze estimation method based on binocular images can reach higher accuracy than those based on monocular images. And the proposed method shows its great potential in practical touch screen-based applications.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Gaze-based interaction has been a highlight in the field of Human-Computer Interaction. As the fundamental technology of gazed-based interaction, gaze estimation attracted lots of researches in recent years. So far, there are mainly two kinds of gaze estimation methods, i.e., model-based [1,2,3,4,5,6,7,8] and appearance-based methods [10,11,12,13,14,15,16,17,18].

Most of the model-based gaze estimation methods are usually based on cameras with high resolution and infrared (IR) lights [5, 6]. These methods established the eyeball geometry referring to the location of pupil center and the reflections of the IR lights on the cornea, and estimated the gaze points according to the location relationship between the IR lights and the cameras [7, 8]. Most of the model-based gaze estimation methods can reach high accuracy, but most of them needed calibration process for individual user before estimation, which destroys the naturality of HCI and degrades the quality of experience.

Different from the model-based gaze estimation methods, the appearance-based gaze estimation methods are data-driven methods. They estimate the gaze point based on the mapping between gaze points and the related eye images or their features, which is obtained based on the training eye images labelled with their corresponding gaze directions. Compared to the model-based methods, the appearance-based methods don’t need any calibration, and have low hardware requirements, which make these methods promising in practical applications. Lu et al. [9, 10] divided an eye image into sub-regions and adopt intensity feature vector as the extracted feature. Sugano et al. [11] proposed an appearance-based gaze estimation using visual saliency. When referring to the gaze estimation methods, the most commonly used mapping functions include Gaussian Bayesian regression [12], adaptive clustering [13], random forest regression [14, 15], neural network [16,17,18], and so on. Most of the above methods estimate the exact gaze points via regression model. And referring to the performance announced in previous researches, the accuracy of appearance-based methods is much lower that of model-based ones in estimating the exact gaze point, even with the support of deep learning and large amounts of training data.

In practical interaction applications, the button-based touch interface is almost the most popular way. Given this interaction consideration and the challenge in estimating the exact gaze point, we propose a novel appearance-based gaze estimation method. Our contributions in this paper are:

(1)
We estimate the gaze blocks, not the exact positions of fixation points. During button-based touch interactions, the user usually tends to touch the center of the button, though any touch on any position of the button can trigger the operation. It is the same that the user usually tends to gaze the center of the button, when he tries to trigger the button via gazing. It means that all of the gaze falling into one button can be classified as the gaze which can trigger this button. Therefore, we can broaden the gaze estimation task from point-wise estimation to block-wise estimation taking account of the practical interaction applications.
(2)
We take the mapping between the eye images and gaze blocks as a classification task, which is implemented by CNN-based classification. Because all the gazes falling into the same button can trigger the button, these gazes can be considered in the same class in the case of button-based interaction, and the corresponding accuracy can be relaxed to the size of the button. The classification-base estimation can balance the requirements of practical applications and difficulty in accurate estimation.
(3)
We use the binocular images as the training data because the binocular images can provide more information of head pose and relative position between two eyes than monocular images, which are considered useful for accurate gaze estimation. It is different from most of the existing methods, which use left or right eye images independently, or use left or right eye images separately and fuse them finally. In addition, different user have their own dominant eyes, the estimation accuracy on monocular images of dominant eye will be higher than that on the non-dominant eye. And the estimation based on binocular images can achieve accurate and robust results.

We perform block-wise gaze estimation on the 6- and 54-block levels with our collected data and MPII dataset. The experiments verify the outperformance of the proposed method in gaze block estimation.

2 Proposed Method

2.1 Framework

As mentioned above, this paper is dedicated to the appearance-based gaze estimation method via CNN. Figure 1 shows the basic framework of our binocular data-driven method. Just like the most appearance-based method, the learning-based gaze estimation method first requires a large amount of data for model training. For this purpose, we established our binocular dataset as detailed below. After the data collection, we take the binocular images as the input of CNN.

2.2 Binocular Image Collection

There are many existing human eye datasets, which can be available publicly. However, most existing datasets provide only monocular data and most of them are designed for regression problems. Unlike these datasets, we build our own dataset as in Sect Data Collection, label all eye images with their corresponding block, and take the eye images with the same label as the same category.

Data Collection.

Our experimental images are captured from a single web camera, the Logic C270, with the resolution of 640 × 480, and one normal 19-in screen with the aspect ratio 16:9. The screen blocks are as shown in Fig. 1. We first divide the screen into 2 × 3 blocks as shown in thick lines and each block is further divided into 3 × 3 small blocks as shown in thin lines according to our screen ratio. The subjects are asked to sit about 60 cm away from screen with their heads free and this can ensure that one centimeter on screen is approximately equal to one degree in gaze direction. The big blocks are with the size of 12.75 × 12.75 cm and the corresponding 6-class classification can achieve the accuracy of ±6.38°. The small blocks are with the size of 4.25 × 4.25 cm and the corresponding 54-class classification can achieve the accuracy of ±2.13°.

We provide a simple way for data collecting. Each subject is asked to gaze the center of the blocks on the screen. And the center of the block is taken as the ground truths of fixation points falling in this block. The gazes that fixate any point in each block are classified the same category in our method. Totally 56 groups of eye videos are collected from 22 subjects aged in 20–30 range, and each video lasted about 4 min with 15 fps. In order to prevent the fixation fatigue, we set a two-second rest between every two fixation points.

Binocular Images.

Figure 1 shows the acquisition of binocular data. We adopted the face and eye detection method in [19]. Given a fixation video of a subject, we first detected the face of the first frame by Haar-like features [20]. And then, we detected all remaining images by matching with the detected face from the first frame. For the obtained human face area, we used the empirical theory namely the general positions of the two eyes in the face area to obtain each eye’s center. The precise locations of the two eyes center can be further determined based on the human face. The binocular rectangle can be obtained by setting the threshold. All extracted binocular images contain both left and right eyes and are normalized to the fixed size of 40 × 184 according to the average size of the eye image samples. We randomly selected the 151200 training images and 30240 testing images to train the CNN model. The eye images in training set and testing set remain mutually disjoint.

2.3 Gaze Estimation Using CNN

Although many appearance-based gaze estimation methods based on regression models have been proposed, it is difficult to achieve high accuracy because it is a really hard task to estimate the exact gaze position. Our proposed block-wise classification-based gaze estimation method relax the estimation from one point to a block, which reduce the difficulty of training a learning-based model. Besides, the estimation of the gaze blocks is exactly suitable for the touch-based interaction.

Considering the good performance of the CNN model used in [21] for image classification in ImageNet competition, we fine tune the parameters of this network to suit our input image. Figure 2 illustrates the deep structure of our CNN model. The network contains three convolutional layers followed by max pooling layers, and two fully connected layers which connect the extracted features into multidimensional vector. The last layer determines the final category that the input image belongs to by calculating the probability of each class. The category with the highest probability is just the one which the input image belongs to. The cuboids in Fig. 2 represent the convolutional layers, and max pooling layers. The length, width, and height of the cuboid denote the number of feature maps, and the size of each map. The size of the convolution kernel is illustrated by small squares in each cuboid. Two fully connected layers are represented by rectangular bar with the dimension of 256 and the final layer represents the number of classes. In this paper, we give two classification criteria namely 6-class classifications and 54-class classifications which denote two different block size.

2.4 CNN-Based Classification

We choose the RGB images as the input as the color information in the CNN is useful to improve the accuracy. We define $ x_i^{l - 1}\; $ as the $ i{\text{th}} $ feature map of the $ l - 1{\text{th}} $ layer, $ M\; $ as the number of feature maps of this layer. So, the output of the $ l{\text{th}} $ convolutional layer can be expressed as:

$$ x_j^l = Relu(\sum\limits_{i \in M} {x_i^{l - 1} * k_{ij}^l} + b_j^l) $$

(1)

Where $ k_{ij}^l $ is the convolution kernel, $ b_j^l\; $ is the bias, and the “$ * $” represents the convolution operator. The previous feature maps are convoluted with different convolution kernels and shifted by a bias, followed by the activation function. Then, the result can form one of the feature maps in the convolutional layer.

The output of the current sub-sampling layer can be expressed as:

$$ x_j^l = f(\beta_j^l \cdot down(x_j^{l - 1}) + b_j^l) $$

(2)

Where $ down( \cdot ) $ represent the max-pooling operation. Pooling results are multiplied with a gain coefficient $ \beta $ and shifted by a bias followed by the activation function $ f $.

We set Rectified Linear Units (ReLU) as the activation function and add Local Response Normalization (LRN) layer between the convolutional and the pooling layers to improve the performance. We adopt the dropout layer to prevent over-fitting.

The convolution layers, the pooling layers and the activation functions are used to map the original input into the hidden feature space. After the feature extracted, we use the fully connected layers to realize the classification. The last two fully connected layers are realized by inner product operation. In our network, a 256-dimension feature vector is generated. In the final classification operator, the probability values belonging to different classes are calculated and the category corresponding to the highest probability value is the category which the input image belongs to. When the test image goes through the trained network, the outputs of three pooling layers in CNN are shown in Fig. 3. The images indicate that the contour of input image is gradually blurred and the deep features are gradually extracted as the network grows deeper.

3 Experiments and Evaluations

3.1 Gaze Estimation Performance

In this section, we present the results of our proposed binocular image based method. Just as mentioned above, we randomly choose the training set and the testing set with a 5:1 sample number ratio, and the two sets are completely disjoint without any overlap. We choose the binocular data, left eyes and right eyes as the training data of the network respectively. All of the experiments are repeated in both 6-classes and 54-classes scenario. For binocular data, the proposed method can reach the average accuracy with 98.52% for the 6-class classification and 90.97% for the 54-class classification. For left eyes only, the accuracy can reach 93.31% for 6-class classification and 80.89% for 54-class classification. And for right eyes only, the accuracy can reach 92.74% for 6-class classification and 79.42% for 54-class classification. The experimental results of average classification accuracy are shown in Fig. 4. The rising curves indicate accuracy change with iterations for different type of input. It can be seen that the accuracy improvement of binocular images is about 5% in 6 block case and about 10% in 54 block case, which means the binocular images is helpful in accurate gaze estimation.

We also give the confusion matrix to display the classification performance of each category in Fig. 5. The horizontal axis of the confusion matrix represents the real category, and the vertical axis represents the predicted classification. So, the diagonal in confusion matrix indicates the probability of the correct classification. We can find that the proposed binocular data-driven method can get high classification accuracy in each category.

3.2 Comparison with MPIIGaze Dataset

In this section, we discuss our classification method based on MPIIGaze dataset in [18]. This dataset offered eye images and the corresponding gaze direction. In order to meet the needs of our experiment, we first turn the three-dimensional coordinates of fixation points into angular coordinates, and then we map the coordinates into the corresponding blocks on a screen, which is same as the one used in our data collection. The human eyes are labeled according to these corresponding mapping blocks. We select 33000 left eye samples and 33000 right eye samples from the MPIIGaze dataset. The estimation accuracy results when using both our dataset and the MPIIGaze dataset are shown in Table 1. In [18], the lowest error is about 6° and the mean error is about 10.3°. We have already described the range of degree of our blocks in Sect. 2.2. From Table 1, we can see that the accuracy of gaze estimation can mostly reach about 6.38°, and some can reach about 2.13°. It demonstrates that our proposed method outperforms the method in [18] on their MPIIGaze dataset. The experimental results may also indicate that the datasets based on regression methods are not suited for the block-based classification, and it is necessary to build the specific dataset for the button based gaze interaction.

Table 1. Comarison with the MPIIGaze dataset.

Full size table

3.3 Cross-Subject Performance

A practical network should have a wide range of adaptability to the new subjects. So, we test the cross-subject performance for our trained CNN. We randomly selected 4 subjects marked as n1–n4 out of the 22 subjects and tested the classification accuracy of the samples from the 4 subjects in our trained CNN. The experimental results for both 6-class and 54-class classification are shown in Table 2. It can be seen that there are significant differences between subjects, and the performance in 54-class classification is worse than that in 6-class classification. The possible reason for the differences between subjects can be related to the diversity of human eyes. And the general characteristics extracted through CNN can only be used to represent the common parts between individuals. As our dataset contains few individual (22 individuals), it is difficult to cover all appearance of users, and the performance could be better when the training set cover more individuals. The poor accuracy for 54-class classification may be that the block size of 54-class classification is much smaller than 6-class classification, so the classification accuracy becomes worse compared to 6-class classification. We also give the comparison with MPIIGaze dataset for the cross-subject evaluation in Table 2. We also randomly select 4 subjects marked as s1–s4 from the 15 subjects of MPIIGaze dataset as the validation set and set the eye samples from the other 11 subjects as the training set for classification based on our CNN model.

Table 2. Investigation on cross-subject gaze block estimation.

Full size table

3.4 Comparison with Other Methods

We compare our binocular data-driven method with other appearance-based methods based on our dataset. George and Routray [22] proposed a real-time eye gaze direction classification method using CNN. They adopted the similar classification method with us, but they trained two networks for left and right eyes independently and got the final category by combining the two scores. They finally get 86.81% recognition rate for 7-class classification in the Eye Chimera dataset [23]. Our proposed method finally achieves 9.39% accuracy improvement for 6-class classification and 36.39% accuracy improvement for 54-class classification over the method in [22]. Zhang et al. [18] trained a regression network to estimate the fixation points. In order to unify the comparison pattern, we estimated the fixation point positions using their method and mapped the fixation points to our screen blocks. The mapping results are very poor as their best performance of mean error in [18] is 10.5° for cross-dataset evaluation. The comparison results in Table 3 show that our classification method has higher accuracy in both monocular and binocular data. And the binocular data can improve the classification performance effectively.

Table 3. Comarison with other methods.

Full size table

4 Conclusion

In this paper, we proposed a binocular-image-based gaze estimation method, which estimates the gaze block by using CNN classification. Through the mapping relationship between eyes appearance, gaze estimation and the screen blocks, we estimated a new gaze estimation pattern based on classification. Different from the previous gaze estimation methods, we have achieved the twofold improvement on both accuracy and stability by the block based classification and the information provided by binocular images. For the future work, we will continue to enrich and release our dataset with more subjects and more situations to improve the performance for cross-subjects.

References

Morimoto, C.H., Mimica, M.R.M.: Eye gaze tracking techniques for interactive applications. Comput. Vis. Image Underst. 98(1), 4–24 (2005)
Article Google Scholar
Yang, C., et al.: A gray difference-based pre-processing for gaze tracking. In: 10th International Conference Proceedings on Signal Processing Proceedings, Beijing, China, pp. 1293–1296. IEEE Computer Society (2010)
Google Scholar
Shih, S.-W., Liu, J.: A novel approach to 3-D gaze tracking using stereo cameras. Trans. Syst. Man Cybern. 34(1), 234–245 (2004)
Article Google Scholar
Cheung, Y., Peng, Q.: Eye gaze tracking with a web camera in a desktop environment. IEEE Trans. Hum.-Mach. Syst. 45(4), 419–430 (2015)
Article Google Scholar
Morimoto, C.H., Amir, A., Flickner, M.: Detecting eye position and gaze from a single camera and 2 light sources. In: 16th International Conference Proceedings on Pattern Recognition, Washington, DC, USA, p. 40314. IEEE Computer Society (2002)
Google Scholar
Coutinho, F.L., Morimoto, C.H.: Free head motion eye gaze tracking using a single camera and multiple light sources. In: 19th Brazilian Symposium Proceedings on Computer Graphics and Image Processing, Manaus, Brazil, pp. 171–178. IEEE Computer Society (2006)
Google Scholar
Guestrin, E.D., Eizenman, M.: General theory of remote gaze estimation using the pupil center and corneal reflections. IEEE Trans. Bio-Med. Eng. 53(6), 1124–1133 (2006)
Article Google Scholar
Niu, C., Sun, J., Li, J., Yan, H.: A calibration simplified method for gaze interaction based on using experience. In: 2015 International Workshop Proceedings on Multimedia Signal Processing, Xiamen, China, pp. 1–5. IEEE Computer Society (2015)
Google Scholar
Lu, F., Sugano, Y., Okabe, T., Sato, Y.: Adaptive linear regression for appearance-based gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell. 36(10), 2033–2046 (2014)
Article Google Scholar
Lu, F., Sugano, Y., Okabe, T., Sato, T.: Inferring human gaze from appearance via adaptive linear regression. In: 2011 International Conference Proceedings on Computer Vision Processing, Washington, DC, USA, pp. 153–160. IEEE Computer Society (2011)
Google Scholar
Sugano, Y., Matsushita, Y., Sato, Y.: Appearance-based gaze estimation using visual saliency. IEEE Trans. Pattern Anal. Mach. Intell. 35(2), 329–341 (2013)
Article Google Scholar
Ye, N., Tao, X., Dong, L., Ge, N.: Mouse calibration aided real-time gaze estimation based on boost Gaussian Bayesian learning. In: 2016 International Conference Proceedings on Image Processing, Phoenix, AZ, USA, pp. 2797–2801. IEEE Computer Society (2016)
Google Scholar
Sugano, Y., Matsushita, Y., Sato, Y., Koike, H.: Appearance-based gaze estimation with online calibration from mouse operations. IEEE Trans. Hum.-Mach. Syst. 45(6), 750–760 (2015)
Article Google Scholar
Wang, Y., Shen, T., Yuan, G., Bian, J., Xianping, F.: Appearance based gaze estimation using deep features and random forest regression. Knowl.-Based Syst. 1(10), 293–301 (2016)
Article Google Scholar
Sugano, Y., Matsushita, Y., Sato, Y.: Learning-by-synthesis for appearance-based 3D gaze estimation. In: 2014 IEEE Conference Proceedings on Computer Vision and Pattern Recognition, Washington, DC, USA, pp. 1821–1828. IEEE Computer Society (2014)
Google Scholar
Krafka, K., et al.: Eye tracking for everyone. In: 2016 IEEE Conference Proceedings on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp. 2176–2184. IEEE Computer Society (2016)
Google Scholar
Baluja, S., Pomerleau, D.: Non-intrusive gaze tracking using artificial neural networks. Technical report. Carnegie Mellon University, Pittsburgh, PA, USA (1994)
Google Scholar
Zhang, X., Sugano, Y., Fritz, M.: Andreas bulling: appearance-based gaze estimation in the wild. In: 2015 IEEE Conference Proceedings on Computer Vision and Pattern Recognition, Boston, MA, USA, pp. 4511–4520 (2015)
Google Scholar
Baggio, D.L., et al.: Mastering OpenCV with Practical Computer Vision Projects, 1st edn. Packt, Birmingham (2012)
Google Scholar
Lienhart, R., Maydt, J.: An extended set of Haar-like features for rapid object detection. In: 2002 International Conference Proceedings on Image Processing, Rochester, New York, USA, pp. I-900–I-903 (2002)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) 25th International Conference Proceedings on Neural Information Processing Systems 2012, pp. 1097–1105. Curran Associates Inc., USA (2012)
Google Scholar
George, A., Routray, A.: Real-time eye gaze direction classification using convolutional neural network. In: 2016 International Conference Proceedings on Signal Processing and Communications Processing, Bangalore, India, pp. 1–5 (2016)
Google Scholar
Florea, L., Florea, C., Vrânceanu, R., Vertan, C.: Can your eyes tell me how you think? A gaze directed estimation of the mental activity. In: Proceedings of the 2013 British Machine Vision Conference, Bristol, UK, pp. 60–61. BMVA Press (2013)
Google Scholar

Download references

Acknowledgements

This work is supported by Key Research and Development Foundation of Shandong Province (2016GGX101009), Natural Science Foundation of Shandong Province (ZR2014FM012), and Scientific Research and Development Foundation of Shandong Provincial Education Department (J15LN60). We acknowledge the support of NVIDIA Corporation with the donation of the TITAN X GPU used for this research.

Author information

Authors and Affiliations

School of Information Science and Engineering, Shandong University, Jinan, China
Xuemei Wu & Qiang Wu
School of Mechanical and Electrical Engineering, Shandong Management University, Jinan, China
Jing Li
School of Information Science and Engineering, Shandong Normal University, Jinan, China
Jiande Sun
School of Computer Science and Technology, Shandong University of Finance and Economics, Jinan, China
Hua Yan

Authors

Xuemei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jing Li
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jiande Sun
View author publications
You can also search for this author in PubMed Google Scholar
Hua Yan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiande Sun .

Editor information

Editors and Affiliations

School of Computing and Mathematics, Charles Sturt University, Bathurst, New South Wales, Australia
Manoranjan Paul
University of São Paulo, São Paulo, Brazil
Carlos Hitoshi
University of Chinese Academy of Science, Beijing, China
Qingming Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, X., Li, J., Wu, Q., Sun, J., Yan, H. (2018). Block-Wise Gaze Estimation Based on Binocular Images. In: Paul, M., Hitoshi, C., Huang, Q. (eds) Image and Video Technology. PSIVT 2017. Lecture Notes in Computer Science(), vol 10749. Springer, Cham. https://doi.org/10.1007/978-3-319-75786-5_38

Download citation

DOI: https://doi.org/10.1007/978-3-319-75786-5_38
Published: 15 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75785-8
Online ISBN: 978-3-319-75786-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Abstract

1 Introduction

2 Proposed Method

2.1 Framework

2.2 Binocular Image Collection

Data Collection.

Binocular Images.

2.3 Gaze Estimation Using CNN

2.4 CNN-Based Classification

3 Experiments and Evaluations

3.1 Gaze Estimation Performance

3.2 Comparison with MPIIGaze Dataset

3.3 Cross-Subject Performance

3.4 Comparison with Other Methods

4 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation