skip to main content
10.1145/3664647.3681315acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

ScanTD: 360° Scanpath Prediction based on Time-Series Diffusion

Published: 28 October 2024 Publication History

Abstract

Scanpath generation in 360° images aims to model the realistic trajectories of gaze points that viewers follow when exploring panoramic environments. Existing methods for scanpath genera- tion suffer from various limitations, including a lack of global atten-tion to panoramic environments, insufficient diversity in generated scanpaths, and inadequate consideration of the temporal sequence of gaze points. To address these challenges, we propose a novel approach, named ScanTD, which employs a conditional Diffusion Model-based method to generate multiple scanpaths. Notably, a transformer-based time-series (TTS) module with a novel attention mechanism is integrated into ScanTD to capture the temporal de- pendency of gaze points effectively. Additionally, ScanTD utilizes a Vision Transformer-based method for image feature extraction, en- abling better learning of scene semantic information. Experimental results demonstrate that our approach outperforms state-of-the-art methods across three datasets. We further demonstrate its general- izability by applying it to the 360° saliency detection task.

References

[1]
Marc Assens, Xavier Giro i Nieto, Kevin McGuinness, and Noel E. O'Connor. 2017. SaltiNet: Scan-path prediction on 360 degree images using saliency volumes. In IEEE International Conference on Computer Vision Workshops. 2331--2338.
[2]
Marc Assens, Xavier Giro i Nieto, Kevin McGuinness, and Noel E. O'Connor. 2018. PathGAN: Visual Scanpath Prediction with Generative Adversarial Networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 0--0.
[3]
Marc Assens, Xavier Giro i Nieto, Kevin McGuinness, and Noel E. O'Connor. 2018. Scanpath and saliency prediction on 360 degree images. In Signal Processing: Image Communication, Vol. 69. 8--14.
[4]
Georgios Batzolis, Jan Stanczuk, Carola-Bibiane Schönlieb, and Christian Etmann. 2021. Conditional image generation with score-based diffusion models. arXiv preprint (2021). arXiv:2111.13606
[5]
Giuseppe Boccignone, Vittorio Cuculo, and Alessandro D'Amelio. 2020. How to look next? A data-driven approach for scanpath prediction. In Formal Methods. FM 2019 International Workshops: Porto, Portugal, October 7--11, 2019, Revised Selected Papers, Part I, Vol. 3. Springer International Publishing, 131--145.
[6]
Fang-Yi Chao, Lu Zhang, Wassim Hamidouche, and Olivier Deforges. 2018. Salgan360: Visual saliency prediction on 360 degree images with generative adversarial networks. In 2018 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). 1--4.
[7]
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. 2020. WaveGrad: Estimating Gradients for Waveform Generation. arXiv preprint arXiv:2009.00713 (2020).
[8]
Xianyu Chen, Ming Jiang, and Qi Zhao. 2021. Predicting human scanpaths in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 10876--10885.
[9]
Taco S. Cohen, Mario Geiger, Jonas Koehler, and Max Welling. 2018. Spherical CNNs. arXiv preprint arXiv:1801.10130 (2018).
[10]
Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. 2018. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 518--533.
[11]
Xavier Corbillon, Francesca De Simone, and Gwendal Simon. 2017. 360-degree video head movement dataset. Proceedings of the 8th ACM Multimedia Systems Conference (2017), 199--204. https://doi.org/10.1145/3083187.3083215
[12]
Antoine Coutrot, Janet H. Hsiao, and Antoni B. Chan. 2018. Scanpath modeling and classification with hidden Markov models. Behavior Research Methods 50, 1 (2018), 362--379. https://doi.org/10.3758/s13428-017-0876--8
[13]
Roberto G. de A. Azevedo, Neil Birkbeck, Francesca De Simone, Ivan Janatra, Balu Adsumilli, and Pascal Frossard. 2019. Visual distortions in 360-degree videos. IEEE Transactions on Circuits and Systems for Video Technology 30, 8 (2019), 2524--2537. https://doi.org/10.1109/TCSVT.2019.2927344
[14]
Ryan Anthony Jalova de Belen, Tomasz Bednarz, and Arcot Sowmya. 2022. Scanpathnet: A recurrent mixture density network for scanpath prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 5010--5020.
[15]
Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248--255.
[16]
Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis. Advances in Neural Information Processing Systems 34 (2021), 8780--8794.
[17]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[18]
Ramin Fahimi and Neil D. B. Bruce. 2021. On metrics for measuring scanpath similarity. Behavior Research Methods 53 (2021), 609--628. https://doi.org/10.3758/ s13428-020-01441-0
[19]
Ching-Ling Fan, Jean Lee, Wen-Chih Lo, Chun-Ying Huang, Kuan-Ta Chen, and Cheng-Hsin Hsu. 2017. Fixation Prediction for 360° Video Streaming in Head- Mounted Virtual Reality. In Workshop on Network and Operating Systems Support for Digital Audio and Video. ACM, 67--72.
[20]
Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. 2022. Diffuseq: Sequence to Sequence Text Generation with Diffusion Models. arXiv preprint arXiv:2210.08933.
[21]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems, Vol. 33. Neural Information Processing Systems Foundation, 6840--6851.
[22]
Zhiming Hu, Sheng Li, Congyi Zhang, Kangrui Yi, Guoping Wang, and Dinesh Manocha. 2020. DGaze: CNN-based Gaze Prediction in Dynamic Scenes. IEEE Transactions on Visualization and Computer Graphics 26, 5 (2020), 1902--1911. https://doi.org/10.1109/TVCG.2020.2973473
[23]
Chuhan Jiao, Yao Wang, Guanhua Zhang, Mihai Bce, Zhiming Hu, and Andreas Bulling. 2024. DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images. arXiv preprint arXiv:2403.17477 (2024). https: //doi.org/10.48550/arXiv.2403.17477.
[24]
Matthias Kümmerer and Matthias Bethge. 2021. State-of-the-art in human scanpath prediction. CoRR abs/2102.12239 (2021). https://arxiv.org/abs/2102.12239
[25]
Matthias Kümmerer, Matthias Bethge, and Thomas S. A. Wallis. 2022. DeepGaze III: Modeling free-viewing human scanpaths with deep learning. Journal of Vision 22, 5 (2022), 7--7.
[26]
Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Jiawei Chen, Xu Tan, Danilo Mandic, Lei He, Xiangyang Li, Tao Qin, Sheng Zhao, and Tie-Yan Liu. 2022. BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis. Advances in Neural Information Processing Systems 35 (2022), 23689--23700.
[27]
Daniel Martin, Ana Serrano, Alexander W. Bergman, Gordon Wetzstein, and Belen Masia. 2022. ScanGAN360: A generative model of realistic scanpaths for 360° images. IEEE Transactions on Visualization and Computer Graphics 28, 5 (2022), 2003--2013. https://doi.org/10.1109/TVCG.2022.3150502
[28]
Daniel Martin, Ana Serrano, and Belen Masia. 2020. Panoramic convolutions for 360 single-image saliency prediction. In CVPR Workshop on Computer Vision for Augmented and Virtual Reality, Vol. 2. 2.
[29]
Rafael Monroy, Sebastian Lutz, Tejo Chalasani, and Aljosa Smolic. 2018. Salnet360: Saliency maps for omni-directional images with cnn. Signal Processing: Image Communication 69 (2018), 26--34.
[30]
Anh Nguyen, Zhisheng Yan, and Klara Nahrstedt. 2018. Your attention is unique: Detecting 360-degree video saliency in head-mounted display for head movement prediction. In ACM International Conference on Multimedia. ACM, 1190--1198. https://doi.org/10.1145/3240508.3240669
[31]
Alexander Q. Nichol and Prafulla Dhariwal. 2021. Improved Denoising Diffusion Probabilistic Models. In International Conference on Machine Learning. PMLR, 8162--8171.
[32]
Yashas Rai, Jesús Gutiérrez, and Patrick Le Callet. 2017. A Dataset of Head and Eye Movements for 360 Degree Images. In Proceedings of the 8th ACM on Multimedia Systems Conference. ACM, 205--210.
[33]
Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. 2022. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings. ACM, 1--10.
[34]
Nahian Siddique, Sidike Paheding, Colin P. Elkin, and Vijay Devabhaktuni. 2021. U-Net and its variants for medical image segmentation: A review of theory and applications. IEEE Access 9 (2021), 82031--82057.
[35]
Vincent Sitzmann, Ana Serrano, Amy Pavel, Maneesh Agrawala, Diego Gutierrez, Belen Masia, and GordonWetzstein. 2018. Saliency in VR: How do people explore virtual environments? IEEE Transactions on Visualization and Computer Graphics 24, 4 (2018), 1633--1642. https://github.com/vsitzmann/vr-saliency
[36]
Mikhail Startsev and Michael Dorr. 2018. 360-aware saliency estimation with conventional image saliency predictors. Signal Processing: Image Communication 69 (2018), 43--52.
[37]
Xiangjie Sui, Yuming Fang, Hanwei Zhu, Shiqi Wang, and Zhou Wang. 2023. ScanDMM: A Deep Markov Model of Scanpath Prediction for 360° Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 6989--6999.
[38]
Xiangjie Sui, Hanwei Zhu, Xuelin Liu, Yuming Fang, Shiqi Wang, and Zhou Wang. 2023. Perceptual Quality Assessment of 360? Images Based on Generative Scanpath Representation. arXiv preprint arXiv:2309.03472 (2023).
[39]
Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. 2021. CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation. Advances in Neural Information Processing Systems 34 (2021), 24804--24816.
[40]
Evgeniy Upenik and Touradj Ebrahimi. 2017. A simple method to obtain visual attention data in head mounted virtual reality. In IEEE International Conference on Multimedia and Expo Workshops. 73--78.
[41]
TianheWu, Shuwei Shi, Haoming Cai, Mingdeng Cao, Jing Xiao, Yinqiang Zheng, and Yujiu Yang. 2024. Assessor360: Multi-sequence network for blind omnidirectional image quality assessment. Advances in Neural Information Processing Systems 36 (2024), 64957--64970.
[42]
Mai Xu, Li Yang, Xiaoming Tao, Yiping Duan, and Zulin Wang. 2021. Saliency Prediction on Omnidirectional Image With Generative Adversarial Imitation Learning. IEEE Transactions on Image Processing 30 (2021), 2087--2102. https://doi.org/10.1109/TIP.2021.3050861
[43]
Li Yang, Mai Xu, Tie Liu, Liangyu Huo, and Xinbo Gao. 2022. TVFormer: Trajectory-guided visual quality assessment on 360° images with transformers. In Proceedings of the 30th ACM International Conference on Multimedia. ACM, 799--808.
[44]
Z. Yang, L. Huang, Y. Chen, Z. Wei, S. Ahn, G. Zelinsky, and M. Hoai. 2020. Predicting goal-directed human attention using inverse reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 193--202.
[45]
Zhibo Yang, Lihan Huang, Yupei Chen, Zijun Wei, Seoyoung Ahn, Gregory Zelinsky, Dimitris Samaras, and Minh Hoai. 2020. Predicting goal-directed human attention using inverse reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 193--202.
[46]
Z. Yang, S. Mondal, S. Ahn, R. Xue, G. Zelinsky, M. Hoai, and D. Samaras. 2024. Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 1683--1693.
[47]
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Vol. 2. 3836--3847.
[48]
Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 2021. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 11106--11115. https://doi.org/10.1609/aaai.v35i12. 17325
[49]
Yuanzhi Zhu, Zhaohai Li, Tianwei Wang, Mengchao He, and Cong Yao. 2023. Conditional text image generation with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 14235-- 14245.
[50]
Yucheng Zhu, Guangtao Zhai, and Xiongkuo Min. 2018. The Prediction of Head and Eye Movement for 360 Degree Images. Signal Processing: Image Communication 69 (2018), 15--25. Special Award of IEEE ICME 2017 Salient360! Grand Challenge.
[51]
Yucheng Zhu, Guangtao Zhai, Xiongkuo Min, and Jiantao Zhou. 2020. The prediction of saliency map for head and eye movements in 360 degree images. IEEE Transactions on Multimedia 22, 9 (2020), 2331--2344.

Cited By

View all
  • (2024)Robust 360° Visual Tracking with Dynamic Gnomonic Projection2024 39th International Conference on Image and Vision Computing New Zealand (IVCNZ)10.1109/IVCNZ64857.2024.10794447(1-6)Online publication date: 4-Dec-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
October 2024
11719 pages
ISBN:9798400706868
DOI:10.1145/3664647
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. 360 images
  2. diffusion model
  3. scanpath
  4. vision transformer

Qualifiers

  • Research-article

Funding Sources

  • the Royal Society of New Zealand

Conference

MM '24
Sponsor:
MM '24: The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)59
  • Downloads (Last 6 weeks)15
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Robust 360° Visual Tracking with Dynamic Gnomonic Projection2024 39th International Conference on Image and Vision Computing New Zealand (IVCNZ)10.1109/IVCNZ64857.2024.10794447(1-6)Online publication date: 4-Dec-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media