Skip to main content

Advertisement

Log in

Vicsgaze: a gaze estimation method using self-supervised contrastive learning

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Existing deep learning-based gaze estimation methods achieved high accuracy, and the prerequisite for ensuring their performance is large-scale datasets with gaze labels. However, collecting large-scale gaze datasets is time-consuming and expensive. To this end, we propose VicsGaze, a self-supervised network that learns generalized gaze-aware representations without labeled data. We feed two gaze-specific augmentation views of the same face image into a multi-branch convolutional re-parameterization encoder to obtain feature representations. Although the two augmentation views make the origin face image present different appearances, the gaze direction they represent is consistent. We then map these two representations into an embedding space and employ a novel loss function to optimize model training. The experiments demonstrate that our VicsGaze performs outstanding cross-dataset gaze estimation on several datasets. Meanwhile, VicsGaze outperforms the baseline of supervised learning methods when fine-tuning with few calibration samples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability statement

The data and code are available at https://github.com/lmh10233/VicsGaze.

References

  1. Bao, Y., Cheng, Y., Liu, Y., et al.: Adaptive feature fusion network for gaze tracking in mobile tablets. In: 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, pp. 9936–9943 (2021)

  2. Bardes, A., Ponce, J., LeCun, Y.: VICReg: Variance-invariance-covariance regularization for self-supervised learning. In: International Conference on Learning Representations (2022)

  3. Caron, M., Misra, I., Mairal, J., et al.: Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inform. Process. Syst. 33, 9912–9924 (2020)

    Google Scholar 

  4. Caron, M., Touvron, H., Misra, I. et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)

  5. Castner, N., Kuebler, T.C., Scheiter, K,. et al.: Deep semantic gaze embedding and scanpath comparison for expertise classification during opt viewing. In: ACM Symposium on Eye Tracking Research and Applications, pp. 1–10 (2020)

  6. Chen, T., Kornblith, S., Norouzi, M., et al.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, PMLR, pp. 1597–1607 (2020)

  7. Chen, X., He, K.: Exploring simple Siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)

  8. Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9640–9649 (2021)

  9. Chen, Z., Shi, B.E.: Appearance-based gaze estimation using dilated-convolutions. In: Asian Conference on Computer Vision, Springer, pp. 309–324 (2018)

  10. Cheng, Y., Lu, F.: Gaze estimation using transformer. In: 2022 26th International Conference on Pattern Recognition (ICPR), IEEE, pp. 3341–3347 (2022)

  11. Cheng, Y., Lu, F., Zhang, X.: Appearance-based gaze estimation via evaluation-guided asymmetric regression. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 100–115 (2018)

  12. Cheng, Y., Huang, S., Wang, F. et al.: A coarse-to-fine adaptive network for appearance-based gaze estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10623–10630 (2020a)

  13. Cheng, Y., Zhang, X., Lu, F., et al.: Gaze estimation by exploring two-eye asymmetry. IEEE Trans. Image Process. 29, 5259–5272 (2020)

    Article  Google Scholar 

  14. Cheng, Y., Wang, H., Bao, Y. et al.: Appearance-based gaze estimation with deep learning: a review and benchmark. arXiv preprint arXiv:2104.12668 (2021)

  15. Ding, X., Zhang, X., Han, J., et al.: Diverse branch block: Building a convolution as an inception-like unit. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10886–10895 (2021)

  16. Ding, X., Zhang, X., Ma, N., et al.: Repvgg: making vgg-style convnets great again. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13733–13742 (2021)

  17. Doersch, C., Zisserman, A.: Multi-task self-supervised visual learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2051–2060 (2017)

  18. Dong, X., Bao, J., Zhang, T., et al.: Bootstrapped masked autoencoders for vision bert pretraining. In: European Conference on Computer Vision, Springer, pp. 247–264 (2022)

  19. Du, L., Lan, G.: Freegaze: resource-efficient gaze estimation via frequency domain contrastive learning. arXiv preprint arXiv:2209.06692 (2022)

  20. Du, L., Zhang, X., Lan, G.: Unsupervised gaze-aware contrastive learning with subject-specific condition. arXiv preprint arXiv:2309.04506 (2023)

  21. Farkhondeh, A., Palmero, C., Scardapane, S., et al.: Towards self-supervised gaze estimation. arXiv preprint arXiv:2203.10974 (2022)

  22. Fischer, T., Chang, H.J., Demiris, Y.: Rt-gene: real-time eye gaze estimation in natural environments. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 334–352 (2018)

  23. Mora, K.A.F., Monay, F., Odobez, J.M.: Eyediap: a database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In: Proceedings of the Symposium on Eye Tracking Research and Applications, pp. 255–258 (2014)

  24. Gidaris, S., Bursuc, A., Puy, G., et al.: Obow: Online bag-of-visual-words generation for self-supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6830–6840 (2021)

  25. Grill, J.B., Strub, F., Altché, F., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural. Inf. Process. Syst. 33, 21271–21284 (2020)

    Google Scholar 

  26. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  27. He, K., Fan, H., Wu, Y., et al.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)

  28. He, K., Chen, X., Xie, S., et al.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)

  29. Hu, M., Feng, J., Hua, J. et al.: Online convolutional re-parameterization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 568–577 (2022)

  30. Kellnhofer, P., Recasens, A., Stent, S. et al.: Gaze360: Physically unconstrained gaze estimation in the wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6912–6921 (2019)

  31. Konrad, R., Angelopoulos, A., Wetzstein, G.: Gaze-contingent ocular parallax rendering for virtual reality. ACM Trans. Graph. (TOG) 39(2), 1–12 (2020)

    Article  Google Scholar 

  32. Krafka, K., Khosla, A., Kellnhofer, P., et al.: Eye tracking for everyone. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2176–2184 (2016)

  33. Kytö, M., Ens, B., Piumsomboon, T., et al.: Pinpointing: Precise head-and eye-based target selection for augmented reality. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–14 (2018)

  34. Liu, J., Huang, X., Zheng, J., et al.: Mixmae: mixed and masked autoencoder for efficient pretraining of hierarchical vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6252–6261 (2023)

  35. Ma, N., Zhang, X., Zheng, H.T., et al.: Shufflenet v2: practical guidelines for efficient cnn architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131 (2018)

  36. van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(86), 2579–2605 (2008)

    Google Scholar 

  37. Martin, S., Vora, S., Yuen, K., et al.: Dynamics of driver’s gaze: explorations in behavior modeling and maneuver prediction. IEEE Trans. Intell. Veh. 3(2), 141–150 (2018)

    Article  Google Scholar 

  38. Park, S., Mello, S.D., Molchanov, P., et al.: Few-shot adaptive gaze estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9368–9377 (2019)

  39. Ren, D., Chen, J., Zhong, J., et al.: Gaze estimation via bilinear pooling-based attention networks. J. Vis. Commun. Image Represent. 81, 103369 (2021)

    Article  Google Scholar 

  40. Shrivastava, A., Pfister, T., Tuzel, O., et al.: Learning from simulated and unsupervised images through adversarial training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2107–2116 (2017)

  41. Smith, B.A., Yin, Q., Feiner, S.K., et al.: Gaze locking: passive eye contact detection for human-object interaction. In: Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology, pp. 271–280 (2013)

  42. Stellmach, S., Stober, S., Nürnberger, A., et al.: Designing gaze-supported multimodal interactions for the exploration of large image collections. In: Proceedings of the 1st Conference on Novel Gaze-Controlled Applications, pp. 1–8 (2011)

  43. Sun, Y., Zeng, J., Shan, S., et al.: Cross-encoder for unsupervised gaze representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3702–3711 (2021)

  44. Vaswani, A., Shazeer, N., Parmar, N. et al.: Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008 (2017)

  45. Wu, C., Hu, H., Lin, K., et al.: Attention-guided and fine-grained feature extraction from face images for gaze estimation. Eng. Appl. Artif. Intell. 126, 106994 (2023)

    Article  Google Scholar 

  46. Wu, Y., Li, G., Liu, Z., et al.: Gaze estimation via modulation-based adaptive network with auxiliary self-learning. IEEE Trans. Circuits Syst. Video Technol. 32(8), 5510–5520 (2022)

    Article  Google Scholar 

  47. Wu, Z., Xiong, Y., Yu, S.X., et al.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)

  48. Xie, Z., Zhang, Z., Cao, Y., et al.: Simmim: a simple framework for masked image modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9653–9663 (2022)

  49. Zbontar, J., Jing, L., Misra, I., et al.: Barlow twins: self-supervised learning via redundancy reduction. In: International Conference on Machine Learning, PMLR, pp. 12310–12320 (2021)

  50. Zhang, X., Sugano, Y., Fritz, M., et al.: It’s written all over your face: full-face appearance-based gaze estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 51–60 (2017)

  51. Zhang, X., Sugano, Y., Fritz, M., et al.: Mpiigaze: real-world dataset and deep appearance-based gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell. 41(1), 162–175 (2017)

    Article  Google Scholar 

  52. Zhang, X., Park, S., Beeler, T., et al.: Eth-xgaze: a large scale dataset for gaze estimation under extreme head pose and gaze variation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, Springer, pp. 365–381 (2020)

Download references

Acknowledgements

This work was supported by Natural Science Foundation of Jiangsu Province of China (Grant No. BK20180594 and No. BK20231036).

Author information

Authors and Affiliations

Authors

Contributions

De Gu: Funding acquisition, Methodology, Writing—Reviewing and Editing; Minghao Lv: Methodology, Software, Writing—original draft; Jianchu Liu: Investigation, Methodology, Writing—Reviewing and Editing.

Corresponding author

Correspondence to De Gu.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Communicated by Haojie Li.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gu, D., Lv, M. & Liu, J. Vicsgaze: a gaze estimation method using self-supervised contrastive learning. Multimedia Systems 30, 330 (2024). https://doi.org/10.1007/s00530-024-01458-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01458-x

Keywords