Look Both Ways: Self-supervising Driver Gaze Estimation and Road Scene Saliency

Kasahara, Isaac; Stent, Simon; Park, Hyun Soo

doi:10.1007/978-3-031-19778-9_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13673))

Included in the following conference series:

European Conference on Computer Vision

2072 Accesses
4 Citations

Abstract

We present a new on-road driving dataset, called “Look Both Ways”, which contains synchronized video of both driver faces and the forward road scene, along with ground truth gaze data registered from eye tracking glasses worn by the drivers. Our dataset supports the study of methods for non-intrusively estimating a driver’s focus of attention while driving - an important application area in road safety. A key challenge is that this task requires accurate gaze estimation, but supervised appearance-based gaze estimation methods often do not transfer well to real driving datasets, and in-domain ground truth to supervise them is difficult to gather. We therefore propose a method for self-supervision of driver gaze, by taking advantage of the geometric consistency between the driver’s gaze direction and the saliency of the scene as observed by the driver. We formulate a 3D geometric learning framework to enforce this consistency, allowing the gaze model to supervise the scene saliency model, and vice versa. We implement a prototype of our method and test it with our dataset, to show that compared to a supervised approach it can yield better gaze estimation and scene saliency estimation with no additional labels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We use a von Mises-Fisher density function where \(\kappa \) is equivalent to the standard deviation of a Gaussian density function.

References

International Data Corporation: Worldwide Autonomous Vehicle Forecast, 2020–2024 (2020)
Google Scholar
SAE Levels of Driving Automation Refined for Clarity and International Audience (2021). https://www.sae.org/blog/sae-j3016-update
Baee, S., Pakdamanian, E., Kim, I., Feng, L., Ordonez, V., Barnes, L.: MEDIRL: predicting the visual attention of drivers via maximum entropy deep inverse reinforcement learning. In: ICCV (2021)
Google Scholar
Baluja, S., Pomerleau, D.: Non-intrusive gaze tracking using artificial neural networks (1993)
Google Scholar
Bylinskii, Z., Recasens, A., Borji, A., Oliva, A., Torralba, A., Durand, F.: Where should saliency models look next? In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 809–824. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_49
Chapter Google Scholar
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR (2020)
Google Scholar
Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., Sheikh, Y.A.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. TPAMI 43, 172–186 (2019)
Google Scholar
Chang, Z., Matias Di Martino, J., Qiu, Q., Espinosa, S., Sapiro, G.: SalGaze: personalizing gaze estimation using visual saliency. In: ICCV Workshops (2019)
Google Scholar
Deng, H., Zhu, W.: Monocular free-head 3D gaze tracking with deep learning and geometry constraints. In: ICCV (2017)
Google Scholar
Droste, R., Jiao, J., Noble, J.A.: Unified image and video saliency modeling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 419–435. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_25
Chapter Google Scholar
Fang, J., Yan, D., Qiao, J., Xue, J., Yu, H.: DADA: driver attention prediction in driving accident scenarios. IEEE Trans. Intell. Transp. Syst. 23, 4959–4971 (2021)
Google Scholar
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. ACM Commun. 24, 381–395 (1981)
Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)
Google Scholar
Hansen, D.W., Ji, Q.: In the eye of the beholder: a survey of models for eyes and gaze. TPAMI 32, 478–500 (2009)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Jain, A., Koppula, H.S., Raghavan, B., Soh, S., Saxena, A.: Car that knows before you do: anticipating maneuvers via learning temporal driving models. In: ICCV (2015)
Google Scholar
Jiang, M., Huang, S., Duan, J., Zhao, Q.: SALICON: saliency in context. In: CVPR (2015)
Google Scholar
Kellnhofer, P., Recasens, A., Stent, S., Matusik, W., Torralba, A.: Gaze360: physically unconstrained gaze estimation in the wild. In: ICCV (2019)
Google Scholar
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. arXiv (2014)
Google Scholar
Land, M.F.: Eye movements and the control of actions in everyday life. Prog. Retinal Eye Res. 25, 296–324 (2006)
Google Scholar
Lindén, E., Sjostrand, J., Proutiere, A.: Learning to personalize in appearance-based gaze tracking. In: ICCV Workshops (2019)
Google Scholar
Lipson, L., Teed, Z., Deng, J.: RAFT-stereo: multilevel recurrent field transforms for stereo matching. In: 23DV (2021)
Google Scholar
Lowe, D.G.: Object recognition from local scale-invariant features. IJCV (1999)
Google Scholar
Martin, M., et al.: Drive &Act: a multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles. In: ICCV (2019)
Google Scholar
Mathe, S., Sminchisescu, C.: Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. TPAMI 37, 1408–1424 (2015)
Google Scholar
Min, K., Corso, J.J.: TASED-Net: temporally-aggregating spatial encoder-decoder network for video saliency detection. In: ICCV (2019)
Google Scholar
Ortega, J.D., et al.: DMD: a large-scale multi-modal driver monitoring dataset for attention and alertness analysis. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12538, pp. 387–405. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-66823-5_23
Chapter Google Scholar
Palazzi, A., Abati, D., Solera, F., Cucchiara, R., et al.: Predicting the driver’s focus of attention: the DR(eye)VE project. TPAMI 41, 1720–1733 (2018)
Google Scholar
Park, S., Aksan, E., Zhang, X., Hilliges, O.: Towards end-to-end video-based eye-tracking. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 747–763. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_44
Chapter Google Scholar
Park, S., Mello, S.D., Molchanov, P., Iqbal, U., Hilliges, O., Kautz, J.: Few-shot adaptive gaze estimation. In: ICCV (2019)
Google Scholar
Park, S., Spurr, A., Hilliges, O.: Deep pictorial gaze estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 741–757. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_44
Chapter Google Scholar
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)
Google Scholar
Recasens, A., Khosla, A., Vondrick, C., Torralba, A.: Where are they looking? In: NeurIPS (2015)
Google Scholar
Recasens, A., Vondrick, C., Khosla, A., Torralba, A.: Following gaze in video. In: ICCV (2017)
Google Scholar
Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)
Google Scholar
Shen, C., Zhao, Q.: Webpage saliency. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 33–46. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_3
Chapter Google Scholar
Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: CVPR (2017)
Google Scholar
Sugano, Y., Matsushita, Y., Sato, Y.: Appearance-based gaze estimation using visual saliency. TPAMI 35, 329–341 (2013)
Google Scholar
Sugano, Y., Matsushita, Y., Sato, Y.: Learning-by-synthesis for appearance-based 3D gaze estimation. In: CVPR (2014)
Google Scholar
Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR (2020)
Google Scholar
Sun, Y., Zeng, J., Shan, S., Chen, X.: Cross-encoder for unsupervised gaze representation learning. In: ICCV (2021)
Google Scholar
Wang, J., Olson, E.: AprilTag 2: efficient and robust fiducial detection. In: IROS (2016)
Google Scholar
Wang, W., Shen, J., Xie, J., Cheng, M.M., Ling, H., Borji, A.: Revisiting video saliency prediction in the deep learning era. TPAMI 43, 220–237 (2021)
Google Scholar
Wood, E., Baltrusaitis, T., Zhang, X., Sugano, Y., Robinson, P., Bulling, A.: Rendering of eyes for eye-shape registration and gaze estimation. In: ICCV (2015)
Google Scholar
Wu, T., Martelaro, N., Stent, S., Ortiz, J., Ju, W.: Learning when agents can talk to drivers using the INAGT dataset and multisensor fusion. ACM Interact. Mob. Wearable Ubiquit. Technol. 5, 1–28 (2021)
Google Scholar
Xia, Y., Zhang, D., Kim, J., Nakayama, K., Zipser, K., Whitney, D.: Predicting driver attention in critical situations. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11365, pp. 658–674. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20873-8_42
Chapter Google Scholar
Yarbus, A.L.: Eye Movements and Vision. Springer, New York (2013). https://doi.org/10.1007/978-1-4899-5379-7
Yu, Y., Odobez, J.M.: Unsupervised representation learning for gaze estimation. In: CVPR (2020)
Google Scholar
Zhang, X., Park, S., Beeler, T., Bradley, D., Tang, S., Hilliges, O.: ETH-XGaze: a large scale dataset for gaze estimation under extreme head pose and gaze variation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 365–381. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_22
Chapter Google Scholar
Zhang, X., Sugano, Y., Fritz, M., Bulling, A.: Appearance-based gaze estimation in the wild. In: CVPR (2015)
Google Scholar
Zheng, Q., Jiao, J., Cao, Y., Lau, R.W.H.: Task-driven webpage saliency. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 300–316. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_18
Chapter Google Scholar

Download references

Acknowledgement

This research is based on work supported by Toyota Research Institute and the NSF under IIS #1846031. The views and conclusions contained herein are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the sponsors.

Author information

Authors and Affiliations

University of Minnesota, Minneapolis, USA
Isaac Kasahara & Hyun Soo Park
Toyota Research Institute, Cambridge, MA, USA
Simon Stent

Authors

Isaac Kasahara
View author publications
You can also search for this author in PubMed Google Scholar
Simon Stent
View author publications
You can also search for this author in PubMed Google Scholar
Hyun Soo Park
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Isaac Kasahara .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 971 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kasahara, I., Stent, S., Park, H.S. (2022). Look Both Ways: Self-supervising Driver Gaze Estimation and Road Scene Saliency. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13673. Springer, Cham. https://doi.org/10.1007/978-3-031-19778-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-19778-9_8
Published: 03 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19777-2
Online ISBN: 978-3-031-19778-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Look Both Ways: Self-supervising Driver Gaze Estimation and Road Scene Saliency