Vicsgaze: a gaze estimation method using self-supervised contrastive learning

Gu, De; Lv, Minghao; Liu, Jianchu

doi:10.1007/s00530-024-01458-x

Vicsgaze: a gaze estimation method using self-supervised contrastive learning

Regular Paper
Published: 02 November 2024

Volume 30, article number 330, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

De Gu¹,
Minghao Lv¹ &
Jianchu Liu^2,3

337 Accesses
Explore all metrics

Abstract

Existing deep learning-based gaze estimation methods achieved high accuracy, and the prerequisite for ensuring their performance is large-scale datasets with gaze labels. However, collecting large-scale gaze datasets is time-consuming and expensive. To this end, we propose VicsGaze, a self-supervised network that learns generalized gaze-aware representations without labeled data. We feed two gaze-specific augmentation views of the same face image into a multi-branch convolutional re-parameterization encoder to obtain feature representations. Although the two augmentation views make the origin face image present different appearances, the gaze direction they represent is consistent. We then map these two representations into an embedding space and employ a novel loss function to optimize model training. The experiments demonstrate that our VicsGaze performs outstanding cross-dataset gaze estimation on several datasets. Meanwhile, VicsGaze outperforms the baseline of supervised learning methods when fine-tuning with few calibration samples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-supervised Contrastive Regression for Estimation of Eye Gaze

Facial Landmarks Based Region-Level Data Augmentation for Gaze Estimation

Fine-grained gaze estimation based on the combination of regression and classification losses

Article Open access 03 September 2024

Data availability statement

The data and code are available at https://github.com/lmh10233/VicsGaze.

References

Bao, Y., Cheng, Y., Liu, Y., et al.: Adaptive feature fusion network for gaze tracking in mobile tablets. In: 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, pp. 9936–9943 (2021)
Bardes, A., Ponce, J., LeCun, Y.: VICReg: Variance-invariance-covariance regularization for self-supervised learning. In: International Conference on Learning Representations (2022)
Caron, M., Misra, I., Mairal, J., et al.: Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inform. Process. Syst. 33, 9912–9924 (2020)
Google Scholar
Caron, M., Touvron, H., Misra, I. et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Castner, N., Kuebler, T.C., Scheiter, K,. et al.: Deep semantic gaze embedding and scanpath comparison for expertise classification during opt viewing. In: ACM Symposium on Eye Tracking Research and Applications, pp. 1–10 (2020)
Chen, T., Kornblith, S., Norouzi, M., et al.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, PMLR, pp. 1597–1607 (2020)
Chen, X., He, K.: Exploring simple Siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)
Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9640–9649 (2021)
Chen, Z., Shi, B.E.: Appearance-based gaze estimation using dilated-convolutions. In: Asian Conference on Computer Vision, Springer, pp. 309–324 (2018)
Cheng, Y., Lu, F.: Gaze estimation using transformer. In: 2022 26th International Conference on Pattern Recognition (ICPR), IEEE, pp. 3341–3347 (2022)
Cheng, Y., Lu, F., Zhang, X.: Appearance-based gaze estimation via evaluation-guided asymmetric regression. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 100–115 (2018)
Cheng, Y., Huang, S., Wang, F. et al.: A coarse-to-fine adaptive network for appearance-based gaze estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10623–10630 (2020a)
Cheng, Y., Zhang, X., Lu, F., et al.: Gaze estimation by exploring two-eye asymmetry. IEEE Trans. Image Process. 29, 5259–5272 (2020)
Article Google Scholar
Cheng, Y., Wang, H., Bao, Y. et al.: Appearance-based gaze estimation with deep learning: a review and benchmark. arXiv preprint arXiv:2104.12668 (2021)
Ding, X., Zhang, X., Han, J., et al.: Diverse branch block: Building a convolution as an inception-like unit. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10886–10895 (2021)
Ding, X., Zhang, X., Ma, N., et al.: Repvgg: making vgg-style convnets great again. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13733–13742 (2021)
Doersch, C., Zisserman, A.: Multi-task self-supervised visual learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2051–2060 (2017)
Dong, X., Bao, J., Zhang, T., et al.: Bootstrapped masked autoencoders for vision bert pretraining. In: European Conference on Computer Vision, Springer, pp. 247–264 (2022)
Du, L., Lan, G.: Freegaze: resource-efficient gaze estimation via frequency domain contrastive learning. arXiv preprint arXiv:2209.06692 (2022)
Du, L., Zhang, X., Lan, G.: Unsupervised gaze-aware contrastive learning with subject-specific condition. arXiv preprint arXiv:2309.04506 (2023)
Farkhondeh, A., Palmero, C., Scardapane, S., et al.: Towards self-supervised gaze estimation. arXiv preprint arXiv:2203.10974 (2022)
Fischer, T., Chang, H.J., Demiris, Y.: Rt-gene: real-time eye gaze estimation in natural environments. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 334–352 (2018)
Mora, K.A.F., Monay, F., Odobez, J.M.: Eyediap: a database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In: Proceedings of the Symposium on Eye Tracking Research and Applications, pp. 255–258 (2014)
Gidaris, S., Bursuc, A., Puy, G., et al.: Obow: Online bag-of-visual-words generation for self-supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6830–6840 (2021)
Grill, J.B., Strub, F., Altché, F., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural. Inf. Process. Syst. 33, 21271–21284 (2020)
Google Scholar
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
He, K., Fan, H., Wu, Y., et al.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
He, K., Chen, X., Xie, S., et al.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Hu, M., Feng, J., Hua, J. et al.: Online convolutional re-parameterization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 568–577 (2022)
Kellnhofer, P., Recasens, A., Stent, S. et al.: Gaze360: Physically unconstrained gaze estimation in the wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6912–6921 (2019)
Konrad, R., Angelopoulos, A., Wetzstein, G.: Gaze-contingent ocular parallax rendering for virtual reality. ACM Trans. Graph. (TOG) 39(2), 1–12 (2020)
Article Google Scholar
Krafka, K., Khosla, A., Kellnhofer, P., et al.: Eye tracking for everyone. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2176–2184 (2016)
Kytö, M., Ens, B., Piumsomboon, T., et al.: Pinpointing: Precise head-and eye-based target selection for augmented reality. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–14 (2018)
Liu, J., Huang, X., Zheng, J., et al.: Mixmae: mixed and masked autoencoder for efficient pretraining of hierarchical vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6252–6261 (2023)
Ma, N., Zhang, X., Zheng, H.T., et al.: Shufflenet v2: practical guidelines for efficient cnn architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131 (2018)
van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(86), 2579–2605 (2008)
Google Scholar
Martin, S., Vora, S., Yuen, K., et al.: Dynamics of driver’s gaze: explorations in behavior modeling and maneuver prediction. IEEE Trans. Intell. Veh. 3(2), 141–150 (2018)
Article Google Scholar
Park, S., Mello, S.D., Molchanov, P., et al.: Few-shot adaptive gaze estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9368–9377 (2019)
Ren, D., Chen, J., Zhong, J., et al.: Gaze estimation via bilinear pooling-based attention networks. J. Vis. Commun. Image Represent. 81, 103369 (2021)
Article Google Scholar
Shrivastava, A., Pfister, T., Tuzel, O., et al.: Learning from simulated and unsupervised images through adversarial training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2107–2116 (2017)
Smith, B.A., Yin, Q., Feiner, S.K., et al.: Gaze locking: passive eye contact detection for human-object interaction. In: Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology, pp. 271–280 (2013)
Stellmach, S., Stober, S., Nürnberger, A., et al.: Designing gaze-supported multimodal interactions for the exploration of large image collections. In: Proceedings of the 1st Conference on Novel Gaze-Controlled Applications, pp. 1–8 (2011)
Sun, Y., Zeng, J., Shan, S., et al.: Cross-encoder for unsupervised gaze representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3702–3711 (2021)
Vaswani, A., Shazeer, N., Parmar, N. et al.: Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008 (2017)
Wu, C., Hu, H., Lin, K., et al.: Attention-guided and fine-grained feature extraction from face images for gaze estimation. Eng. Appl. Artif. Intell. 126, 106994 (2023)
Article Google Scholar
Wu, Y., Li, G., Liu, Z., et al.: Gaze estimation via modulation-based adaptive network with auxiliary self-learning. IEEE Trans. Circuits Syst. Video Technol. 32(8), 5510–5520 (2022)
Article Google Scholar
Wu, Z., Xiong, Y., Yu, S.X., et al.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
Xie, Z., Zhang, Z., Cao, Y., et al.: Simmim: a simple framework for masked image modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9653–9663 (2022)
Zbontar, J., Jing, L., Misra, I., et al.: Barlow twins: self-supervised learning via redundancy reduction. In: International Conference on Machine Learning, PMLR, pp. 12310–12320 (2021)
Zhang, X., Sugano, Y., Fritz, M., et al.: It’s written all over your face: full-face appearance-based gaze estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 51–60 (2017)
Zhang, X., Sugano, Y., Fritz, M., et al.: Mpiigaze: real-world dataset and deep appearance-based gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell. 41(1), 162–175 (2017)
Article Google Scholar
Zhang, X., Park, S., Beeler, T., et al.: Eth-xgaze: a large scale dataset for gaze estimation under extreme head pose and gaze variation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, Springer, pp. 365–381 (2020)

Download references

Acknowledgements

This work was supported by Natural Science Foundation of Jiangsu Province of China (Grant No. BK20180594 and No. BK20231036).

Author information

Authors and Affiliations

Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), Jiangnan University, 1800 Lihu Avenue, Wuxi, 214122, Jiangsu, China
De Gu & Minghao Lv
Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, Qingyuan Campus: 111 Jiulong Road, Hefei, 230601, Anhui, China
Jianchu Liu
Anhui Super Vision Optical Technology Co., Ltd, 2508 Tongjing East Road, Tongling, 244000, Anhui, China
Jianchu Liu

Authors

De Gu
View author publications
You can also search for this author inPubMed Google Scholar
Minghao Lv
View author publications
You can also search for this author inPubMed Google Scholar
Jianchu Liu
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

De Gu: Funding acquisition, Methodology, Writing—Reviewing and Editing; Minghao Lv: Methodology, Software, Writing—original draft; Jianchu Liu: Investigation, Methodology, Writing—Reviewing and Editing.

Corresponding author

Correspondence to De Gu.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Communicated by Haojie Li.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gu, D., Lv, M. & Liu, J. Vicsgaze: a gaze estimation method using self-supervised contrastive learning. Multimedia Systems 30, 330 (2024). https://doi.org/10.1007/s00530-024-01458-x

Download citation

Received: 29 April 2024
Accepted: 18 August 2024
Published: 02 November 2024
DOI: https://doi.org/10.1007/s00530-024-01458-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Vicsgaze: a gaze estimation method using self-supervised contrastive learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Semi-supervised Contrastive Regression for Estimation of Eye Gaze

Facial Landmarks Based Region-Level Data Augmentation for Gaze Estimation

Fine-grained gaze estimation based on the combination of regression and classification losses

Data availability statement

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now