Skip to main content

Look Both Ways: Self-supervising Driver Gaze Estimation and Road Scene Saliency

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

We present a new on-road driving dataset, called “Look Both Ways”, which contains synchronized video of both driver faces and the forward road scene, along with ground truth gaze data registered from eye tracking glasses worn by the drivers. Our dataset supports the study of methods for non-intrusively estimating a driver’s focus of attention while driving - an important application area in road safety. A key challenge is that this task requires accurate gaze estimation, but supervised appearance-based gaze estimation methods often do not transfer well to real driving datasets, and in-domain ground truth to supervise them is difficult to gather. We therefore propose a method for self-supervision of driver gaze, by taking advantage of the geometric consistency between the driver’s gaze direction and the saliency of the scene as observed by the driver. We formulate a 3D geometric learning framework to enforce this consistency, allowing the gaze model to supervise the scene saliency model, and vice versa. We implement a prototype of our method and test it with our dataset, to show that compared to a supervised approach it can yield better gaze estimation and scene saliency estimation with no additional labels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We use a von Mises-Fisher density function where \(\kappa \) is equivalent to the standard deviation of a Gaussian density function.

References

  1. International Data Corporation: Worldwide Autonomous Vehicle Forecast, 2020–2024 (2020)

    Google Scholar 

  2. SAE Levels of Driving Automation Refined for Clarity and International Audience (2021). https://www.sae.org/blog/sae-j3016-update

  3. Baee, S., Pakdamanian, E., Kim, I., Feng, L., Ordonez, V., Barnes, L.: MEDIRL: predicting the visual attention of drivers via maximum entropy deep inverse reinforcement learning. In: ICCV (2021)

    Google Scholar 

  4. Baluja, S., Pomerleau, D.: Non-intrusive gaze tracking using artificial neural networks (1993)

    Google Scholar 

  5. Bylinskii, Z., Recasens, A., Borji, A., Oliva, A., Torralba, A., Durand, F.: Where should saliency models look next? In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 809–824. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_49

    Chapter  Google Scholar 

  6. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR (2020)

    Google Scholar 

  7. Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., Sheikh, Y.A.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. TPAMI 43, 172–186 (2019)

    Google Scholar 

  8. Chang, Z., Matias Di Martino, J., Qiu, Q., Espinosa, S., Sapiro, G.: SalGaze: personalizing gaze estimation using visual saliency. In: ICCV Workshops (2019)

    Google Scholar 

  9. Deng, H., Zhu, W.: Monocular free-head 3D gaze tracking with deep learning and geometry constraints. In: ICCV (2017)

    Google Scholar 

  10. Droste, R., Jiao, J., Noble, J.A.: Unified image and video saliency modeling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 419–435. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_25

    Chapter  Google Scholar 

  11. Fang, J., Yan, D., Qiao, J., Xue, J., Yu, H.: DADA: driver attention prediction in driving accident scenarios. IEEE Trans. Intell. Transp. Syst. 23, 4959–4971 (2021)

    Google Scholar 

  12. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. ACM Commun. 24, 381–395 (1981)

    Google Scholar 

  13. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)

    Google Scholar 

  14. Hansen, D.W., Ji, Q.: In the eye of the beholder: a survey of models for eyes and gaze. TPAMI 32, 478–500 (2009)

    Google Scholar 

  15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  16. Jain, A., Koppula, H.S., Raghavan, B., Soh, S., Saxena, A.: Car that knows before you do: anticipating maneuvers via learning temporal driving models. In: ICCV (2015)

    Google Scholar 

  17. Jiang, M., Huang, S., Duan, J., Zhao, Q.: SALICON: saliency in context. In: CVPR (2015)

    Google Scholar 

  18. Kellnhofer, P., Recasens, A., Stent, S., Matusik, W., Torralba, A.: Gaze360: physically unconstrained gaze estimation in the wild. In: ICCV (2019)

    Google Scholar 

  19. Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. arXiv (2014)

    Google Scholar 

  20. Land, M.F.: Eye movements and the control of actions in everyday life. Prog. Retinal Eye Res. 25, 296–324 (2006)

    Google Scholar 

  21. Lindén, E., Sjostrand, J., Proutiere, A.: Learning to personalize in appearance-based gaze tracking. In: ICCV Workshops (2019)

    Google Scholar 

  22. Lipson, L., Teed, Z., Deng, J.: RAFT-stereo: multilevel recurrent field transforms for stereo matching. In: 23DV (2021)

    Google Scholar 

  23. Lowe, D.G.: Object recognition from local scale-invariant features. IJCV (1999)

    Google Scholar 

  24. Martin, M., et al.: Drive &Act: a multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles. In: ICCV (2019)

    Google Scholar 

  25. Mathe, S., Sminchisescu, C.: Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. TPAMI 37, 1408–1424 (2015)

    Google Scholar 

  26. Min, K., Corso, J.J.: TASED-Net: temporally-aggregating spatial encoder-decoder network for video saliency detection. In: ICCV (2019)

    Google Scholar 

  27. Ortega, J.D., et al.: DMD: a large-scale multi-modal driver monitoring dataset for attention and alertness analysis. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12538, pp. 387–405. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-66823-5_23

    Chapter  Google Scholar 

  28. Palazzi, A., Abati, D., Solera, F., Cucchiara, R., et al.: Predicting the driver’s focus of attention: the DR(eye)VE project. TPAMI 41, 1720–1733 (2018)

    Google Scholar 

  29. Park, S., Aksan, E., Zhang, X., Hilliges, O.: Towards end-to-end video-based eye-tracking. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 747–763. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_44

    Chapter  Google Scholar 

  30. Park, S., Mello, S.D., Molchanov, P., Iqbal, U., Hilliges, O., Kautz, J.: Few-shot adaptive gaze estimation. In: ICCV (2019)

    Google Scholar 

  31. Park, S., Spurr, A., Hilliges, O.: Deep pictorial gaze estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 741–757. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_44

    Chapter  Google Scholar 

  32. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)

    Google Scholar 

  33. Recasens, A., Khosla, A., Vondrick, C., Torralba, A.: Where are they looking? In: NeurIPS (2015)

    Google Scholar 

  34. Recasens, A., Vondrick, C., Khosla, A., Torralba, A.: Following gaze in video. In: ICCV (2017)

    Google Scholar 

  35. Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)

    Google Scholar 

  36. Shen, C., Zhao, Q.: Webpage saliency. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 33–46. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_3

    Chapter  Google Scholar 

  37. Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: CVPR (2017)

    Google Scholar 

  38. Sugano, Y., Matsushita, Y., Sato, Y.: Appearance-based gaze estimation using visual saliency. TPAMI 35, 329–341 (2013)

    Google Scholar 

  39. Sugano, Y., Matsushita, Y., Sato, Y.: Learning-by-synthesis for appearance-based 3D gaze estimation. In: CVPR (2014)

    Google Scholar 

  40. Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR (2020)

    Google Scholar 

  41. Sun, Y., Zeng, J., Shan, S., Chen, X.: Cross-encoder for unsupervised gaze representation learning. In: ICCV (2021)

    Google Scholar 

  42. Wang, J., Olson, E.: AprilTag 2: efficient and robust fiducial detection. In: IROS (2016)

    Google Scholar 

  43. Wang, W., Shen, J., Xie, J., Cheng, M.M., Ling, H., Borji, A.: Revisiting video saliency prediction in the deep learning era. TPAMI 43, 220–237 (2021)

    Google Scholar 

  44. Wood, E., Baltrusaitis, T., Zhang, X., Sugano, Y., Robinson, P., Bulling, A.: Rendering of eyes for eye-shape registration and gaze estimation. In: ICCV (2015)

    Google Scholar 

  45. Wu, T., Martelaro, N., Stent, S., Ortiz, J., Ju, W.: Learning when agents can talk to drivers using the INAGT dataset and multisensor fusion. ACM Interact. Mob. Wearable Ubiquit. Technol. 5, 1–28 (2021)

    Google Scholar 

  46. Xia, Y., Zhang, D., Kim, J., Nakayama, K., Zipser, K., Whitney, D.: Predicting driver attention in critical situations. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11365, pp. 658–674. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20873-8_42

    Chapter  Google Scholar 

  47. Yarbus, A.L.: Eye Movements and Vision. Springer, New York (2013). https://doi.org/10.1007/978-1-4899-5379-7

  48. Yu, Y., Odobez, J.M.: Unsupervised representation learning for gaze estimation. In: CVPR (2020)

    Google Scholar 

  49. Zhang, X., Park, S., Beeler, T., Bradley, D., Tang, S., Hilliges, O.: ETH-XGaze: a large scale dataset for gaze estimation under extreme head pose and gaze variation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 365–381. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_22

    Chapter  Google Scholar 

  50. Zhang, X., Sugano, Y., Fritz, M., Bulling, A.: Appearance-based gaze estimation in the wild. In: CVPR (2015)

    Google Scholar 

  51. Zheng, Q., Jiao, J., Cao, Y., Lau, R.W.H.: Task-driven webpage saliency. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 300–316. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_18

    Chapter  Google Scholar 

Download references

Acknowledgement

This research is based on work supported by Toyota Research Institute and the NSF under IIS #1846031. The views and conclusions contained herein are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the sponsors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Isaac Kasahara .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 971 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kasahara, I., Stent, S., Park, H.S. (2022). Look Both Ways: Self-supervising Driver Gaze Estimation and Road Scene Saliency. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13673. Springer, Cham. https://doi.org/10.1007/978-3-031-19778-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19778-9_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19777-2

  • Online ISBN: 978-3-031-19778-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics