skip to main content
10.1145/3607822.3616408acmconferencesArticle/Chapter ViewAbstractPublication PagessuiConference Proceedingsconference-collections
research-article

Simulating Human Visual System Based on Vision Transformer

Published: 13 October 2023 Publication History

Abstract

The human visual system (HVS) is capable of responding in real-time to complex visual environments. During the process of freely observing visual scenes, predicting eye movements and visual fixations is a task known as scanpath prediction, which aims to simulate the HVS. In this paper, we propose a visual transformer-based model to study the attentional processes of the human visual system in analyzing visual scenes, thereby achieving scanpath prediction. This technology has important applications in human-computer interaction, virtual reality, augmented reality, and other fields. We have significantly simplified the workflow of scanpath prediction and the overall model architecture, achieving performance superior to existing methods.

References

[1]
Marc Assens, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E. O’Connor. 2018. PathGAN: Visual Scanpath Prediction with Generative Adversarial Networks. In ECCVW.
[2]
Marc Assens Reina, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor. 2017. Saltinet: Scan-path prediction on 360 degree images using saliency volumes. In ICCV. 2331–2338.
[3]
Ali Borji, Hamed R Tavakoli, Dicky N Sihite, and Laurent Itti. 2013. Analysis of scores, datasets, and models in visual saliency prediction. In ICCV. 921–928.
[4]
Xianyu Chen, Ming Jiang, and Qi Zhao. 2021. Predicting human scanpaths in visual question answering. In CVPR. 10876–10885.
[5]
Filipe Cristino, Sebastiaan Mathôt, Jan Theeuwes, and Iain D Gilchrist. 2010. ScanMatch: A novel method for comparing fixation sequences. Behavior research methods 42, 3 (2010), 692–700.
[6]
Richard Dewhurst, Marcus Nyström, Halszka Jarodzka, Tom Foulsham, Roger Johansson, and Kenneth Holmqvist. 2012. It depends on how you look at it: Scanpath comparison in multiple dimensions with MultiMatch, a vector-based approach. Behavior research methods 44, 4 (2012), 1079–1100.
[7]
Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013).
[8]
Huazhang Hu, Sixun Dong, Yiqun Zhao, Dongze Lian, Zhengxin Li, and Shenghua Gao. 2022. Transrac: Encoding multi-scale temporal correlation with transformers for repetitive action counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19013–19022.
[9]
Laurent Itti, Christof Koch, and Ernst Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis. TPAMI 20, 11 (1998), 1254–1259.
[10]
Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. 2015. SALICON: Saliency in Context. In CVPR.
[11]
Bin Kang, Dong Liang, Junxi Mei, Xiaoyang Tan, Quan Zhou, and Dengyin Zhang. 2022. Robust rgb-t tracking via graph attention-based bilinear pooling. IEEE Transactions on Neural Networks and Learning Systems (2022).
[12]
Olivier Le Meur and Zhi Liu. 2015. Saccadic model of eye movements for free-viewing condition. VR 116 (2015), 152–164.
[13]
Dong Liang, Qixiang Geng, Zongqi Wei, Dmitry A Vorontsov, Ekaterina L Kim, Mingqiang Wei, and Huiyu Zhou. 2021. Anchor retouching via model interaction for robust object detection in aerial images. IEEE Transactions on Geoscience and Remote Sensing 60 (2021), 1–13.
[14]
Dong Liang, Manabu Hashimoto, Kenji Iwata, Xinyue Zhao, 2015. Co-occurrence probability-based pixel pairs background model for robust object detection in dynamic scenes. Pattern Recognition 48, 4 (2015), 1374–1390.
[15]
Dong Liang, Jing-Wei Zhang, Ying-Peng Tang, and Sheng-Jun Huang. 2023. MUS-CDB: Mixed Uncertainty Sampling with Class Distribution Balancing for Active Annotation in Aerial Object Detection. IEEE Transactions on Geoscience and Remote Sensing (2023).
[16]
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In ICML. PMLR, 4055–4064.
[17]
Wanjie Sun, Zhenzhong Chen, and Feng Wu. 2019. Visual scanpath prediction using IOR-ROI recurrent mixture density network. TPAMI 43, 6 (2019), 2101–2118.
[18]
Xiaoshuai Sun, Hongxun Yao, and Rongrong Ji. 2012. What are we looking for: Towards statistical modeling of saccadic eye movements and visual saliency. In CVPR. 1552–1559.
[19]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NIPS 30 (2017).
[20]
Wei Wang, Cheng Chen, Yizhou Wang, Tingting Jiang, Fang Fang, and Yuan Yao. 2011. Simulating human saccadic scanpaths on natural images. In CVPR. IEEE, 441–448.
[21]
Zongqi Wei, Dong Liang, Dong Zhang, Liyan Zhang, Qixiang Geng, Mingqiang Wei, and Huiyu Zhou. 2022. Learning calibrated-guidance for object detection in aerial images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 15 (2022), 2721–2733.
[22]
Haibo Ye, Xinjie Li, Yuan Yao, and Hanghang Tong. 2023. Towards robust neural graph collaborative filtering via structure denoising and embedding perturbation. ACM Transactions on Information Systems 41, 3 (2023), 1–28.
[23]
Haibo Ye, Fangyu Zhou, Xinjie Li, and Qingheng Zhang. 2023. Balanced Mixup Loss for Long-Tailed Visual Recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.

Cited By

View all
  • (2024)Beyond Average: Individualized Visual Scanpath Prediction2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02402(25420-25431)Online publication date: 16-Jun-2024
  • (2024)GazeXplain: Learning to Predict Natural Language Explanations of Visual ScanpathsComputer Vision – ECCV 202410.1007/978-3-031-73242-3_18(314-333)Online publication date: 29-Sep-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SUI '23: Proceedings of the 2023 ACM Symposium on Spatial User Interaction
October 2023
505 pages
ISBN:9798400702815
DOI:10.1145/3607822
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Visual scanpath prediction
  2. fixation duration prediction
  3. saccade Sequences
  4. scene analysis
  5. visual attention

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SUI '23
SUI '23: ACM Symposium on Spatial User Interaction
October 13 - 15, 2023
NSW, Sydney, Australia

Acceptance Rates

Overall Acceptance Rate 86 of 279 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)50
  • Downloads (Last 6 weeks)14
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Beyond Average: Individualized Visual Scanpath Prediction2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02402(25420-25431)Online publication date: 16-Jun-2024
  • (2024)GazeXplain: Learning to Predict Natural Language Explanations of Visual ScanpathsComputer Vision – ECCV 202410.1007/978-3-031-73242-3_18(314-333)Online publication date: 29-Sep-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media