Skip to main content

Listen to Look Into the Future: Audio-Visual Egocentric Gaze Anticipation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15067))

Included in the following conference series:

  • 390 Accesses

Abstract

Egocentric gaze anticipation serves as a key building block for the emerging capability of Augmented Reality. Notably, gaze behavior is driven by both visual cues and audio signals during daily activities. Motivated by this observation, we introduce the first model that leverages both the video and audio modalities for egocentric gaze anticipation. Specifically, we propose a Contrastive Spatial-Temporal Separable (CSTS) fusion approach that adopts two modules to separately capture audio-visual correlations in spatial and temporal dimensions, and applies a contrastive loss on the re-weighted audio-visual features from fusion modules for representation learning. We conduct extensive ablation studies and thorough analysis using two egocentric video datasets: Ego4D and Aria, to validate our model design. We demonstrate that audio improves the performance by +2.5% and +2.4% on the two datasets. Our model also outperforms the prior state-of-the-art methods by at least +1.9% and +1.6%. Moreover, we provide visualizations to show the gaze anticipation results and share additional insights into audio-visual representation learning. The code and data split are available on our website (https://bolinlai.github.io/CSTS-EgoGazeAnticipation/).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    \(^1\)We only use the subset collected in social scenarios [31, 32].

References

  1. Agrawal, R., Jyoti, S., Girmaji, R., Sivaprasad, S., Gandhi, V.: Does audio help in deep audio-visual saliency prediction models? In: Proceedings of the 2022 International Conference on Multimodal Interaction, pp. 48–56 (2022)

    Google Scholar 

  2. Akbari, H., et al.: VATT: transformers for multimodal self-supervised learning from raw video, audio and text. In: Advances in Neural Information Processing Systems, vol. 34, pp. 24206–24221 (2021)

    Google Scholar 

  3. Alayrac, J.B., et al.: Self-supervised multimodal versatile networks. In: Advances in Neural Information Processing Systems, vol. 33, pp. 25–37 (2020)

    Google Scholar 

  4. Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 609–617 (2017)

    Google Scholar 

  5. Arandjelovic, R., Zisserman, A.: Objects that sound. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 435–451 (2018)

    Google Scholar 

  6. Chang, Q., Zhu, S.: Temporal-spatial feature pyramid for video saliency detection. Cognitive Computation (2021)

    Google Scholar 

  7. Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., Zisserman, A.: Localizing visual sounds the hard way. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16867–16876 (2021)

    Google Scholar 

  8. Cheng, S., Gao, X., Song, L., Xiahou, J.: Audio-visual salieny network with audio attention module. In: 2021 2nd International Conference on Artificial Intelligence and Information Systems, pp. 1–5 (2021)

    Google Scholar 

  9. Coutrot, A., Guyader, N.: Multimodal saliency models for videos. In: From Human Attention to Computational Attention: A Multidisciplinary Approach, pp. 291–304 (2016)

    Google Scholar 

  10. Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 720–736 (2018)

    Google Scholar 

  11. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)

    Google Scholar 

  12. Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)

    Google Scholar 

  13. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)

    Google Scholar 

  14. Gao, R., Oh, T.H., Grauman, K., Torresani, L.: Listen to look: action recognition by previewing audio. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10457–10467 (2020)

    Google Scholar 

  15. Gong, Y., et al.: Contrastive audio-visual masked autoencoder. In: International Conference on Learning Representations (2022)

    Google Scholar 

  16. Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012 (2022)

    Google Scholar 

  17. Gurram, S., Fang, A., Chan, D., Canny, J.: Lava: language audio vision alignment for contrastive video pre-training. arXiv preprint arXiv:2207.08024 (2022)

  18. Hayhoe, M., Ballard, D.: Eye movements in natural behavior. Trends Cogn. Sci. 9(4), 188–194 (2005)

    Article  Google Scholar 

  19. Hu, D., Nie, F., Li, X.: Deep multimodal clustering for unsupervised audiovisual learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9248–9257 (2019)

    Google Scholar 

  20. Hu, D., et al.: Discriminative sounding objects localization via self-supervised audiovisual matching. In: Advances in Neural Information Processing Systems, vol. 33, pp. 10077–10087 (2020)

    Google Scholar 

  21. Hu, X., Chen, Z., Owens, A.: Mix and localize: localizing sound sources in mixtures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10483–10492 (2022)

    Google Scholar 

  22. Huang, C., Tian, Y., Kumar, A., Xu, C.: Egocentric audio-visual object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22910–22921 (2023)

    Google Scholar 

  23. Huang, Y., Cai, M., Li, Z., Lu, F., Sato, Y.: Mutual context network for jointly estimating egocentric gaze and action. IEEE Trans. Image Process. 29, 7795–7806 (2020)

    Article  Google Scholar 

  24. Huang, Y., Cai, M., Li, Z., Sato, Y.: Predicting gaze in egocentric video by learning task-dependent attention transition. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 754–769 (2018)

    Google Scholar 

  25. Huang, Y., Cai, M., Sato, Y.: An ego-vision system for discovering human joint attention. IEEE Trans. Hum.-Mach. Syst. 50(4), 306–316 (2020)

    Article  Google Scholar 

  26. Jain, S., Yarlagadda, P., Jyoti, S., Karthik, S., Subramanian, R., Gandhi, V.: Vinet: pushing the limits of visual modality for audio-visual saliency prediction. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3520–3527. IEEE (2021)

    Google Scholar 

  27. Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Epic-fusion: audio-visual temporal binding for egocentric action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5492–5501 (2019)

    Google Scholar 

  28. Korbar, B., Tran, D., Torresani, L.: Cooperative learning of audio and video models from self-supervised synchronization. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

    Google Scholar 

  29. Lai, B., Liu, M., Ryan, F., Rehg, J.: In the eye of transformer: global-local correlation for egocentric gaze estimation. In: British Machine Vision Conference (2022)

    Google Scholar 

  30. Lai, B., Liu, M., Ryan, F., Rehg, J.M.: In the eye of transformer: global-local correlation for egocentric gaze estimation and beyond. Int. J. Comput. Vision 132(3), 854–871 (2024)

    Article  Google Scholar 

  31. Lai, B., et al.: Werewolf among us: multimodal resources for modeling persuasion behaviors in social deduction games. In: Association for Computational Linguistics: ACL 2023 (2023)

    Google Scholar 

  32. Lee, S., Lai, B., Ryan, F., Boote, B., Rehg, J.M.: Modeling multimodal social interactions: new challenges and baselines with densely aligned representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14585–14595 (2024)

    Google Scholar 

  33. Li, Y., Fathi, A., Rehg, J.M.: Learning to predict gaze in egocentric video. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3216–3223 (2013)

    Google Scholar 

  34. Li, Y., Liu, M., Rehg, J.: In the eye of the beholder: gaze and actions in first person video. IEEE Trans. Pattern Anal. Mach. Intell. 45(6), 6731–6747 (2021)

    Article  Google Scholar 

  35. Lin, K.Q., et al.: Egocentric video-language pretraining. In: Advances in Neural Information Processing Systems (2022)

    Google Scholar 

  36. Lin, Y.B., Sung, Y.L., Lei, J., Bansal, M., Bertasius, G.: Vision transformers are parameter-efficient audio-visual learners. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023)

    Google Scholar 

  37. Lv, Z., et al.: Aria pilot dataset (2022). https://about.facebook.com/realitylabs/projectaria/datasets

  38. Ma, S., Zeng, Z., McDuff, D., Song, Y.: Active contrastive learning of audio-visual video representations. In: International Conference on Learning Representations (2020)

    Google Scholar 

  39. Ma, S., Zeng, Z., McDuff, D., Song, Y.: Contrastive learning of global-local video representations. arXiv preprint arXiv:2104.05418 (2021)

  40. Min, X., Zhai, G., Gu, K., Yang, X.: Fixation prediction through multimodal analysis. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 13(1), 1–23 (2016)

    Google Scholar 

  41. Min, X., Zhai, G., Zhou, J., Zhang, X.P., Yang, X., Guan, X.: A multimodal saliency model for videos with high audio-visual correspondence. IEEE Trans. Image Process. 29, 3805–3819 (2020)

    Article  MathSciNet  Google Scholar 

  42. Morgado, P., Li, Y., Nvasconcelos, N.: Learning representations from audio-visual spatial alignment. In: Advances in Neural Information Processing Systems, vol. 33, pp. 4733–4744 (2020)

    Google Scholar 

  43. Morgado, P., Misra, I., Vasconcelos, N.: Robust audio-visual instance discrimination. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12934–12945 (2021)

    Google Scholar 

  44. Morgado, P., Vasconcelos, N., Misra, I.: Audio-visual instance discrimination with cross-modal agreement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12475–12486 (2021)

    Google Scholar 

  45. Patrick, M., et al.: Multi-modal self-supervision from generalized data transformations. arXiv preprint (2020)

    Google Scholar 

  46. Qian, R., Hu, D., Dinkel, H., Wu, M., Xu, N., Lin, W.: Multiple sound sources localization from coarse to fine. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 292–308. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_18

    Chapter  Google Scholar 

  47. Ratajczak, R., Pellerin, D., Labourey, Q., Garbay, C.: A fast audiovisual attention model for human detection and localization on a companion robot. In: VISUAL 2016-The First International Conference on Applications and Systems of Visual Paradigms (VISUAL 2016) (2016)

    Google Scholar 

  48. Ruesch, J., Lopes, M., Bernardino, A., Hornstein, J., Santos-Victor, J., Pfeifer, R.: Multimodal saliency-based bottom-up attention a framework for the humanoid robot icub. In: 2008 IEEE International Conference on Robotics and Automation, pp. 962–967. IEEE (2008)

    Google Scholar 

  49. Ryan, F., Jiang, H., Shukla, A., Rehg, J.M., Ithapu, V.K.: Egocentric auditory attention localization in conversations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14663–14674 (2023)

    Google Scholar 

  50. Schaefer, K., Süss, K., Fiebig, E.: Acoustic-induced eye movements. Ann. N. Y. Acad. Sci. 374, 674–688 (1981)

    Article  Google Scholar 

  51. Schauerte, B., Kühn, B., Kroschel, K., Stiefelhagen, R.: Multimodal saliency-based attention for object-based scene analysis. In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1173–1179. IEEE (2011)

    Google Scholar 

  52. Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4358–4366 (2018)

    Google Scholar 

  53. Sidaty, N., Larabi, M.C., Saadane, A.: Toward an audiovisual attention model for multimodal video content. Neurocomputing 259, 94–111 (2017)

    Article  Google Scholar 

  54. Sigurdsson, G.A., Gupta, A., Schmid, C., Farhadi, A., Alahari, K.: Charades-ego: a large-scale dataset of paired third and first person videos. arXiv preprint arXiv:1804.09626 (2018)

  55. Soo Park, H., Shi, J.: Social saliency prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4777–4785 (2015)

    Google Scholar 

  56. Tavakoli, H.R., Borji, A., Rahtu, E., Kannala, J.: Dave: a deep audio-visual embedding for dynamic saliency prediction. arXiv preprint arXiv:1905.10693 (2019)

  57. Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

    Google Scholar 

  58. Tsiami, A., Koutras, P., Katsamanis, A., Vatakis, A., Maragos, P.: A behaviorally inspired fusion approach for computational audiovisual saliency modeling. Signal Process. Image Commun. 76, 186–200 (2019)

    Article  Google Scholar 

  59. Tsiami, A., Koutras, P., Maragos, P.: Stavis: spatio-temporal audiovisual saliency network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4766–4776 (2020)

    Google Scholar 

  60. Wang, G., Chen, C., Fan, D.P., Hao, A., Qin, H.: From semantic categories to fixations: a novel weakly-supervised visual-auditory saliency detection approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15119–15128 (2021)

    Google Scholar 

  61. Wang, G., Chen, C., Fan, D.P., Hao, A., Qin, H.: Weakly supervised visual-auditory fixation prediction with multigranularity perception. arXiv preprint arXiv:2112.13697 (2021)

  62. Wang, W., Tran, D., Feiszli, M.: What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12695–12705 (2020)

    Google Scholar 

  63. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)

    Google Scholar 

  64. Xiong, J., Wang, G., Zhang, P., Huang, W., Zha, Y., Zhai, G.: Casp-net: rethinking video saliency prediction from an audio-visual consistency perceptual perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6441–6450 (2023)

    Google Scholar 

  65. Yang, Q., et al.: SVGC-AVA: 360-degree video saliency prediction with spherical vector-based graph convolution and audio-visual attention. IEEE Trans. Multimedia (2023)

    Google Scholar 

  66. Yao, S., Min, X., Zhai, G.: Deep audio-visual fusion neural network for saliency estimation. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 1604–1608. IEEE (2021)

    Google Scholar 

  67. Zhang, M., Ma, K.T., Lim, J.H., Zhao, Q., Feng, J.: Anticipating where people will look using adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1783–1796 (2018)

    Article  Google Scholar 

  68. Zhang, M., Teck Ma, K., Hwee Lim, J., Zhao, Q., Feng, J.: Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4372–4381 (2017)

    Google Scholar 

Download references

Acknowledgements

Portions of this work were supported in part by a gift from Meta and a grant from the Toyota Research Institute University 2.0 program. The second author is supported by an NSF Graduate Research Fellowship.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Miao Liu or James M. Rehg .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lai, B., Ryan, F., Jia, W., Liu, M., Rehg, J.M. (2025). Listen to Look Into the Future: Audio-Visual Egocentric Gaze Anticipation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15067. Springer, Cham. https://doi.org/10.1007/978-3-031-72673-6_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72673-6_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72672-9

  • Online ISBN: 978-3-031-72673-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics