Abstract
Egocentric gaze anticipation serves as a key building block for the emerging capability of Augmented Reality. Notably, gaze behavior is driven by both visual cues and audio signals during daily activities. Motivated by this observation, we introduce the first model that leverages both the video and audio modalities for egocentric gaze anticipation. Specifically, we propose a Contrastive Spatial-Temporal Separable (CSTS) fusion approach that adopts two modules to separately capture audio-visual correlations in spatial and temporal dimensions, and applies a contrastive loss on the re-weighted audio-visual features from fusion modules for representation learning. We conduct extensive ablation studies and thorough analysis using two egocentric video datasets: Ego4D and Aria, to validate our model design. We demonstrate that audio improves the performance by +2.5% and +2.4% on the two datasets. Our model also outperforms the prior state-of-the-art methods by at least +1.9% and +1.6%. Moreover, we provide visualizations to show the gaze anticipation results and share additional insights into audio-visual representation learning. The code and data split are available on our website (https://bolinlai.github.io/CSTS-EgoGazeAnticipation/).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agrawal, R., Jyoti, S., Girmaji, R., Sivaprasad, S., Gandhi, V.: Does audio help in deep audio-visual saliency prediction models? In: Proceedings of the 2022 International Conference on Multimodal Interaction, pp. 48–56 (2022)
Akbari, H., et al.: VATT: transformers for multimodal self-supervised learning from raw video, audio and text. In: Advances in Neural Information Processing Systems, vol. 34, pp. 24206–24221 (2021)
Alayrac, J.B., et al.: Self-supervised multimodal versatile networks. In: Advances in Neural Information Processing Systems, vol. 33, pp. 25–37 (2020)
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 609–617 (2017)
Arandjelovic, R., Zisserman, A.: Objects that sound. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 435–451 (2018)
Chang, Q., Zhu, S.: Temporal-spatial feature pyramid for video saliency detection. Cognitive Computation (2021)
Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., Zisserman, A.: Localizing visual sounds the hard way. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16867–16876 (2021)
Cheng, S., Gao, X., Song, L., Xiahou, J.: Audio-visual salieny network with audio attention module. In: 2021 2nd International Conference on Artificial Intelligence and Information Systems, pp. 1–5 (2021)
Coutrot, A., Guyader, N.: Multimodal saliency models for videos. In: From Human Attention to Computational Attention: A Multidisciplinary Approach, pp. 291–304 (2016)
Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 720–736 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Gao, R., Oh, T.H., Grauman, K., Torresani, L.: Listen to look: action recognition by previewing audio. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10457–10467 (2020)
Gong, Y., et al.: Contrastive audio-visual masked autoencoder. In: International Conference on Learning Representations (2022)
Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012 (2022)
Gurram, S., Fang, A., Chan, D., Canny, J.: Lava: language audio vision alignment for contrastive video pre-training. arXiv preprint arXiv:2207.08024 (2022)
Hayhoe, M., Ballard, D.: Eye movements in natural behavior. Trends Cogn. Sci. 9(4), 188–194 (2005)
Hu, D., Nie, F., Li, X.: Deep multimodal clustering for unsupervised audiovisual learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9248–9257 (2019)
Hu, D., et al.: Discriminative sounding objects localization via self-supervised audiovisual matching. In: Advances in Neural Information Processing Systems, vol. 33, pp. 10077–10087 (2020)
Hu, X., Chen, Z., Owens, A.: Mix and localize: localizing sound sources in mixtures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10483–10492 (2022)
Huang, C., Tian, Y., Kumar, A., Xu, C.: Egocentric audio-visual object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22910–22921 (2023)
Huang, Y., Cai, M., Li, Z., Lu, F., Sato, Y.: Mutual context network for jointly estimating egocentric gaze and action. IEEE Trans. Image Process. 29, 7795–7806 (2020)
Huang, Y., Cai, M., Li, Z., Sato, Y.: Predicting gaze in egocentric video by learning task-dependent attention transition. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 754–769 (2018)
Huang, Y., Cai, M., Sato, Y.: An ego-vision system for discovering human joint attention. IEEE Trans. Hum.-Mach. Syst. 50(4), 306–316 (2020)
Jain, S., Yarlagadda, P., Jyoti, S., Karthik, S., Subramanian, R., Gandhi, V.: Vinet: pushing the limits of visual modality for audio-visual saliency prediction. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3520–3527. IEEE (2021)
Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Epic-fusion: audio-visual temporal binding for egocentric action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5492–5501 (2019)
Korbar, B., Tran, D., Torresani, L.: Cooperative learning of audio and video models from self-supervised synchronization. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Lai, B., Liu, M., Ryan, F., Rehg, J.: In the eye of transformer: global-local correlation for egocentric gaze estimation. In: British Machine Vision Conference (2022)
Lai, B., Liu, M., Ryan, F., Rehg, J.M.: In the eye of transformer: global-local correlation for egocentric gaze estimation and beyond. Int. J. Comput. Vision 132(3), 854–871 (2024)
Lai, B., et al.: Werewolf among us: multimodal resources for modeling persuasion behaviors in social deduction games. In: Association for Computational Linguistics: ACL 2023 (2023)
Lee, S., Lai, B., Ryan, F., Boote, B., Rehg, J.M.: Modeling multimodal social interactions: new challenges and baselines with densely aligned representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14585–14595 (2024)
Li, Y., Fathi, A., Rehg, J.M.: Learning to predict gaze in egocentric video. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3216–3223 (2013)
Li, Y., Liu, M., Rehg, J.: In the eye of the beholder: gaze and actions in first person video. IEEE Trans. Pattern Anal. Mach. Intell. 45(6), 6731–6747 (2021)
Lin, K.Q., et al.: Egocentric video-language pretraining. In: Advances in Neural Information Processing Systems (2022)
Lin, Y.B., Sung, Y.L., Lei, J., Bansal, M., Bertasius, G.: Vision transformers are parameter-efficient audio-visual learners. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023)
Lv, Z., et al.: Aria pilot dataset (2022). https://about.facebook.com/realitylabs/projectaria/datasets
Ma, S., Zeng, Z., McDuff, D., Song, Y.: Active contrastive learning of audio-visual video representations. In: International Conference on Learning Representations (2020)
Ma, S., Zeng, Z., McDuff, D., Song, Y.: Contrastive learning of global-local video representations. arXiv preprint arXiv:2104.05418 (2021)
Min, X., Zhai, G., Gu, K., Yang, X.: Fixation prediction through multimodal analysis. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 13(1), 1–23 (2016)
Min, X., Zhai, G., Zhou, J., Zhang, X.P., Yang, X., Guan, X.: A multimodal saliency model for videos with high audio-visual correspondence. IEEE Trans. Image Process. 29, 3805–3819 (2020)
Morgado, P., Li, Y., Nvasconcelos, N.: Learning representations from audio-visual spatial alignment. In: Advances in Neural Information Processing Systems, vol. 33, pp. 4733–4744 (2020)
Morgado, P., Misra, I., Vasconcelos, N.: Robust audio-visual instance discrimination. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12934–12945 (2021)
Morgado, P., Vasconcelos, N., Misra, I.: Audio-visual instance discrimination with cross-modal agreement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12475–12486 (2021)
Patrick, M., et al.: Multi-modal self-supervision from generalized data transformations. arXiv preprint (2020)
Qian, R., Hu, D., Dinkel, H., Wu, M., Xu, N., Lin, W.: Multiple sound sources localization from coarse to fine. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 292–308. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_18
Ratajczak, R., Pellerin, D., Labourey, Q., Garbay, C.: A fast audiovisual attention model for human detection and localization on a companion robot. In: VISUAL 2016-The First International Conference on Applications and Systems of Visual Paradigms (VISUAL 2016) (2016)
Ruesch, J., Lopes, M., Bernardino, A., Hornstein, J., Santos-Victor, J., Pfeifer, R.: Multimodal saliency-based bottom-up attention a framework for the humanoid robot icub. In: 2008 IEEE International Conference on Robotics and Automation, pp. 962–967. IEEE (2008)
Ryan, F., Jiang, H., Shukla, A., Rehg, J.M., Ithapu, V.K.: Egocentric auditory attention localization in conversations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14663–14674 (2023)
Schaefer, K., Süss, K., Fiebig, E.: Acoustic-induced eye movements. Ann. N. Y. Acad. Sci. 374, 674–688 (1981)
Schauerte, B., Kühn, B., Kroschel, K., Stiefelhagen, R.: Multimodal saliency-based attention for object-based scene analysis. In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1173–1179. IEEE (2011)
Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4358–4366 (2018)
Sidaty, N., Larabi, M.C., Saadane, A.: Toward an audiovisual attention model for multimodal video content. Neurocomputing 259, 94–111 (2017)
Sigurdsson, G.A., Gupta, A., Schmid, C., Farhadi, A., Alahari, K.: Charades-ego: a large-scale dataset of paired third and first person videos. arXiv preprint arXiv:1804.09626 (2018)
Soo Park, H., Shi, J.: Social saliency prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4777–4785 (2015)
Tavakoli, H.R., Borji, A., Rahtu, E., Kannala, J.: Dave: a deep audio-visual embedding for dynamic saliency prediction. arXiv preprint arXiv:1905.10693 (2019)
Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Tsiami, A., Koutras, P., Katsamanis, A., Vatakis, A., Maragos, P.: A behaviorally inspired fusion approach for computational audiovisual saliency modeling. Signal Process. Image Commun. 76, 186–200 (2019)
Tsiami, A., Koutras, P., Maragos, P.: Stavis: spatio-temporal audiovisual saliency network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4766–4776 (2020)
Wang, G., Chen, C., Fan, D.P., Hao, A., Qin, H.: From semantic categories to fixations: a novel weakly-supervised visual-auditory saliency detection approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15119–15128 (2021)
Wang, G., Chen, C., Fan, D.P., Hao, A., Qin, H.: Weakly supervised visual-auditory fixation prediction with multigranularity perception. arXiv preprint arXiv:2112.13697 (2021)
Wang, W., Tran, D., Feiszli, M.: What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12695–12705 (2020)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Xiong, J., Wang, G., Zhang, P., Huang, W., Zha, Y., Zhai, G.: Casp-net: rethinking video saliency prediction from an audio-visual consistency perceptual perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6441–6450 (2023)
Yang, Q., et al.: SVGC-AVA: 360-degree video saliency prediction with spherical vector-based graph convolution and audio-visual attention. IEEE Trans. Multimedia (2023)
Yao, S., Min, X., Zhai, G.: Deep audio-visual fusion neural network for saliency estimation. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 1604–1608. IEEE (2021)
Zhang, M., Ma, K.T., Lim, J.H., Zhao, Q., Feng, J.: Anticipating where people will look using adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1783–1796 (2018)
Zhang, M., Teck Ma, K., Hwee Lim, J., Zhao, Q., Feng, J.: Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4372–4381 (2017)
Acknowledgements
Portions of this work were supported in part by a gift from Meta and a grant from the Toyota Research Institute University 2.0 program. The second author is supported by an NSF Graduate Research Fellowship.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lai, B., Ryan, F., Jia, W., Liu, M., Rehg, J.M. (2025). Listen to Look Into the Future: Audio-Visual Egocentric Gaze Anticipation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15067. Springer, Cham. https://doi.org/10.1007/978-3-031-72673-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-72673-6_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72672-9
Online ISBN: 978-3-031-72673-6
eBook Packages: Computer ScienceComputer Science (R0)