research-article

ScanTD: 360° Scanpath Prediction based on Time-Series Diffusion

Authors:

Fang-Lue Zhang,

Neil A. DodgsonAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 7764 - 7773

https://doi.org/10.1145/3664647.3681315

Published: 28 October 2024 Publication History

Abstract

Scanpath generation in 360° images aims to model the realistic trajectories of gaze points that viewers follow when exploring panoramic environments. Existing methods for scanpath genera- tion suffer from various limitations, including a lack of global atten-tion to panoramic environments, insufficient diversity in generated scanpaths, and inadequate consideration of the temporal sequence of gaze points. To address these challenges, we propose a novel approach, named ScanTD, which employs a conditional Diffusion Model-based method to generate multiple scanpaths. Notably, a transformer-based time-series (TTS) module with a novel attention mechanism is integrated into ScanTD to capture the temporal de- pendency of gaze points effectively. Additionally, ScanTD utilizes a Vision Transformer-based method for image feature extraction, en- abling better learning of scene semantic information. Experimental results demonstrate that our approach outperforms state-of-the-art methods across three datasets. We further demonstrate its general- izability by applying it to the 360° saliency detection task.

References

[1]

Marc Assens, Xavier Giro i Nieto, Kevin McGuinness, and Noel E. O'Connor. 2017. SaltiNet: Scan-path prediction on 360 degree images using saliency volumes. In IEEE International Conference on Computer Vision Workshops. 2331--2338.

[2]

Marc Assens, Xavier Giro i Nieto, Kevin McGuinness, and Noel E. O'Connor. 2018. PathGAN: Visual Scanpath Prediction with Generative Adversarial Networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 0--0.

[3]

Marc Assens, Xavier Giro i Nieto, Kevin McGuinness, and Noel E. O'Connor. 2018. Scanpath and saliency prediction on 360 degree images. In Signal Processing: Image Communication, Vol. 69. 8--14.

[4]

Georgios Batzolis, Jan Stanczuk, Carola-Bibiane Schönlieb, and Christian Etmann. 2021. Conditional image generation with score-based diffusion models. arXiv preprint (2021). arXiv:2111.13606

[5]

Giuseppe Boccignone, Vittorio Cuculo, and Alessandro D'Amelio. 2020. How to look next? A data-driven approach for scanpath prediction. In Formal Methods. FM 2019 International Workshops: Porto, Portugal, October 7--11, 2019, Revised Selected Papers, Part I, Vol. 3. Springer International Publishing, 131--145.

Digital Library

[6]

Fang-Yi Chao, Lu Zhang, Wassim Hamidouche, and Olivier Deforges. 2018. Salgan360: Visual saliency prediction on 360 degree images with generative adversarial networks. In 2018 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). 1--4.

[7]

Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. 2020. WaveGrad: Estimating Gradients for Waveform Generation. arXiv preprint arXiv:2009.00713 (2020).

[8]

Xianyu Chen, Ming Jiang, and Qi Zhao. 2021. Predicting human scanpaths in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 10876--10885.

[9]

Taco S. Cohen, Mario Geiger, Jonas Koehler, and Max Welling. 2018. Spherical CNNs. arXiv preprint arXiv:1801.10130 (2018).

[10]

Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. 2018. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 518--533.

Digital Library

[11]

Xavier Corbillon, Francesca De Simone, and Gwendal Simon. 2017. 360-degree video head movement dataset. Proceedings of the 8th ACM Multimedia Systems Conference (2017), 199--204. https://doi.org/10.1145/3083187.3083215

Digital Library

[12]

Antoine Coutrot, Janet H. Hsiao, and Antoni B. Chan. 2018. Scanpath modeling and classification with hidden Markov models. Behavior Research Methods 50, 1 (2018), 362--379. https://doi.org/10.3758/s13428-017-0876--8

[13]

Roberto G. de A. Azevedo, Neil Birkbeck, Francesca De Simone, Ivan Janatra, Balu Adsumilli, and Pascal Frossard. 2019. Visual distortions in 360-degree videos. IEEE Transactions on Circuits and Systems for Video Technology 30, 8 (2019), 2524--2537. https://doi.org/10.1109/TCSVT.2019.2927344

Digital Library

[14]

Ryan Anthony Jalova de Belen, Tomasz Bednarz, and Arcot Sowmya. 2022. Scanpathnet: A recurrent mixture density network for scanpath prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 5010--5020.

[15]

Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248--255.

[16]

Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis. Advances in Neural Information Processing Systems 34 (2021), 8780--8794.

[17]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[18]

Ramin Fahimi and Neil D. B. Bruce. 2021. On metrics for measuring scanpath similarity. Behavior Research Methods 53 (2021), 609--628. https://doi.org/10.3758/ s13428-020-01441-0

[19]

Ching-Ling Fan, Jean Lee, Wen-Chih Lo, Chun-Ying Huang, Kuan-Ta Chen, and Cheng-Hsin Hsu. 2017. Fixation Prediction for 360° Video Streaming in Head- Mounted Virtual Reality. In Workshop on Network and Operating Systems Support for Digital Audio and Video. ACM, 67--72.

[20]

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. 2022. Diffuseq: Sequence to Sequence Text Generation with Diffusion Models. arXiv preprint arXiv:2210.08933.

[21]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems, Vol. 33. Neural Information Processing Systems Foundation, 6840--6851.

[22]

Zhiming Hu, Sheng Li, Congyi Zhang, Kangrui Yi, Guoping Wang, and Dinesh Manocha. 2020. DGaze: CNN-based Gaze Prediction in Dynamic Scenes. IEEE Transactions on Visualization and Computer Graphics 26, 5 (2020), 1902--1911. https://doi.org/10.1109/TVCG.2020.2973473

[23]

Chuhan Jiao, Yao Wang, Guanhua Zhang, Mihai Bce, Zhiming Hu, and Andreas Bulling. 2024. DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images. arXiv preprint arXiv:2403.17477 (2024). https: //doi.org/10.48550/arXiv.2403.17477.

[24]

Matthias Kümmerer and Matthias Bethge. 2021. State-of-the-art in human scanpath prediction. CoRR abs/2102.12239 (2021). https://arxiv.org/abs/2102.12239

[25]

Matthias Kümmerer, Matthias Bethge, and Thomas S. A. Wallis. 2022. DeepGaze III: Modeling free-viewing human scanpaths with deep learning. Journal of Vision 22, 5 (2022), 7--7.

[26]

Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Jiawei Chen, Xu Tan, Danilo Mandic, Lei He, Xiangyang Li, Tao Qin, Sheng Zhao, and Tie-Yan Liu. 2022. BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis. Advances in Neural Information Processing Systems 35 (2022), 23689--23700.

[27]

Daniel Martin, Ana Serrano, Alexander W. Bergman, Gordon Wetzstein, and Belen Masia. 2022. ScanGAN360: A generative model of realistic scanpaths for 360° images. IEEE Transactions on Visualization and Computer Graphics 28, 5 (2022), 2003--2013. https://doi.org/10.1109/TVCG.2022.3150502

[28]

Daniel Martin, Ana Serrano, and Belen Masia. 2020. Panoramic convolutions for 360 single-image saliency prediction. In CVPR Workshop on Computer Vision for Augmented and Virtual Reality, Vol. 2. 2.

[29]

Rafael Monroy, Sebastian Lutz, Tejo Chalasani, and Aljosa Smolic. 2018. Salnet360: Saliency maps for omni-directional images with cnn. Signal Processing: Image Communication 69 (2018), 26--34.

[30]

Anh Nguyen, Zhisheng Yan, and Klara Nahrstedt. 2018. Your attention is unique: Detecting 360-degree video saliency in head-mounted display for head movement prediction. In ACM International Conference on Multimedia. ACM, 1190--1198. https://doi.org/10.1145/3240508.3240669

Digital Library

[31]

Alexander Q. Nichol and Prafulla Dhariwal. 2021. Improved Denoising Diffusion Probabilistic Models. In International Conference on Machine Learning. PMLR, 8162--8171.

[32]

Yashas Rai, Jesús Gutiérrez, and Patrick Le Callet. 2017. A Dataset of Head and Eye Movements for 360 Degree Images. In Proceedings of the 8th ACM on Multimedia Systems Conference. ACM, 205--210.

Digital Library

[33]

Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. 2022. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings. ACM, 1--10.

Digital Library

[34]

Nahian Siddique, Sidike Paheding, Colin P. Elkin, and Vijay Devabhaktuni. 2021. U-Net and its variants for medical image segmentation: A review of theory and applications. IEEE Access 9 (2021), 82031--82057.

[35]

Vincent Sitzmann, Ana Serrano, Amy Pavel, Maneesh Agrawala, Diego Gutierrez, Belen Masia, and GordonWetzstein. 2018. Saliency in VR: How do people explore virtual environments? IEEE Transactions on Visualization and Computer Graphics 24, 4 (2018), 1633--1642. https://github.com/vsitzmann/vr-saliency

Digital Library

[36]

Mikhail Startsev and Michael Dorr. 2018. 360-aware saliency estimation with conventional image saliency predictors. Signal Processing: Image Communication 69 (2018), 43--52.

[37]

Xiangjie Sui, Yuming Fang, Hanwei Zhu, Shiqi Wang, and Zhou Wang. 2023. ScanDMM: A Deep Markov Model of Scanpath Prediction for 360° Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 6989--6999.

[38]

Xiangjie Sui, Hanwei Zhu, Xuelin Liu, Yuming Fang, Shiqi Wang, and Zhou Wang. 2023. Perceptual Quality Assessment of 360? Images Based on Generative Scanpath Representation. arXiv preprint arXiv:2309.03472 (2023).

[39]

Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. 2021. CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation. Advances in Neural Information Processing Systems 34 (2021), 24804--24816.

[40]

Evgeniy Upenik and Touradj Ebrahimi. 2017. A simple method to obtain visual attention data in head mounted virtual reality. In IEEE International Conference on Multimedia and Expo Workshops. 73--78.

[41]

TianheWu, Shuwei Shi, Haoming Cai, Mingdeng Cao, Jing Xiao, Yinqiang Zheng, and Yujiu Yang. 2024. Assessor360: Multi-sequence network for blind omnidirectional image quality assessment. Advances in Neural Information Processing Systems 36 (2024), 64957--64970.

[42]

Mai Xu, Li Yang, Xiaoming Tao, Yiping Duan, and Zulin Wang. 2021. Saliency Prediction on Omnidirectional Image With Generative Adversarial Imitation Learning. IEEE Transactions on Image Processing 30 (2021), 2087--2102. https://doi.org/10.1109/TIP.2021.3050861

[43]

Li Yang, Mai Xu, Tie Liu, Liangyu Huo, and Xinbo Gao. 2022. TVFormer: Trajectory-guided visual quality assessment on 360° images with transformers. In Proceedings of the 30th ACM International Conference on Multimedia. ACM, 799--808.

Digital Library

[44]

Z. Yang, L. Huang, Y. Chen, Z. Wei, S. Ahn, G. Zelinsky, and M. Hoai. 2020. Predicting goal-directed human attention using inverse reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 193--202.

[45]

Zhibo Yang, Lihan Huang, Yupei Chen, Zijun Wei, Seoyoung Ahn, Gregory Zelinsky, Dimitris Samaras, and Minh Hoai. 2020. Predicting goal-directed human attention using inverse reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 193--202.

[46]

Z. Yang, S. Mondal, S. Ahn, R. Xue, G. Zelinsky, M. Hoai, and D. Samaras. 2024. Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 1683--1693.

[47]

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Vol. 2. 3836--3847.

[48]

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 2021. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 11106--11115. https://doi.org/10.1609/aaai.v35i12. 17325

[49]

Yuanzhi Zhu, Zhaohai Li, Tianwei Wang, Mengchao He, and Cong Yao. 2023. Conditional text image generation with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 14235-- 14245.

[50]

Yucheng Zhu, Guangtao Zhai, and Xiongkuo Min. 2018. The Prediction of Head and Eye Movement for 360 Degree Images. Signal Processing: Image Communication 69 (2018), 15--25. Special Award of IEEE ICME 2017 Salient360! Grand Challenge.

[51]

Yucheng Zhu, Guangtao Zhai, Xiongkuo Min, and Jiantao Zhou. 2020. The prediction of saliency map for head and eye movements in 360 degree images. IEEE Transactions on Multimedia 22, 9 (2020), 2331--2344.

Cited By

Peng HZhang F(2024)Robust 360° Visual Tracking with Dynamic Gnomonic Projection2024 39th International Conference on Image and Vision Computing New Zealand (IVCNZ)10.1109/IVCNZ64857.2024.10794447(1-6)Online publication date: 4-Dec-2024
https://doi.org/10.1109/IVCNZ64857.2024.10794447

Index Terms

ScanTD: 360° Scanpath Prediction based on Time-Series Diffusion
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Human Visual Scanpath Prediction Based on RGB-D Saliency
ICIGP '18: Proceedings of the 2018 International Conference on Image and Graphics Processing

Human visual perception is considered as a dynamic process of information acquisition, while the visual scanpath can clearly reflect the shift of our eye fixations. In the previous study of visual attention, researchers generally do the saliency ...
Multiscale scanpath visualization and filtering
ETVIS '18: Proceedings of the 3rd Workshop on Eye Tracking and Visualization

The analysis of eye-tracking data can be very useful when evaluating controlled user studies. To support the analysis in a fast and easy fashion, we have developed a web-based framework for a visual inspection of eye-tracking data and a comparison of ...
Pathformer3D: A 3D Scanpath Transformer for $360^{\circ}$ Images
Computer Vision – ECCV 2024
Abstract
Scanpath prediction in $360^{\circ}$ images can help realize rapid rendering and better user interaction in Virtual/Augmented Reality applications. However, existing scanpath prediction models for $360^{\circ}$ images execute scanpath prediction on 2D ... $^{}$ $^{}$

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the Royal Society of New Zealand

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
59
Total Downloads

Downloads (Last 12 months)59
Downloads (Last 6 weeks)15

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Peng HZhang F(2024)Robust 360° Visual Tracking with Dynamic Gnomonic Projection2024 39th International Conference on Image and Vision Computing New Zealand (IVCNZ)10.1109/IVCNZ64857.2024.10794447(1-6)Online publication date: 4-Dec-2024
https://doi.org/10.1109/IVCNZ64857.2024.10794447

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten