research-article

Simulating Human Visual System Based on Vision Transformer

Authors:

Mingguang Zhang,

Zhilin LiuAuthors Info & Claims

SUI '23: Proceedings of the 2023 ACM Symposium on Spatial User Interaction

Article No.: 58, Pages 1 - 5

https://doi.org/10.1145/3607822.3616408

Published: 13 October 2023 Publication History

Abstract

The human visual system (HVS) is capable of responding in real-time to complex visual environments. During the process of freely observing visual scenes, predicting eye movements and visual fixations is a task known as scanpath prediction, which aims to simulate the HVS. In this paper, we propose a visual transformer-based model to study the attentional processes of the human visual system in analyzing visual scenes, thereby achieving scanpath prediction. This technology has important applications in human-computer interaction, virtual reality, augmented reality, and other fields. We have significantly simplified the workflow of scanpath prediction and the overall model architecture, achieving performance superior to existing methods.

References

[1]

Marc Assens, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E. O’Connor. 2018. PathGAN: Visual Scanpath Prediction with Generative Adversarial Networks. In ECCVW.

[2]

Marc Assens Reina, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor. 2017. Saltinet: Scan-path prediction on 360 degree images using saliency volumes. In ICCV. 2331–2338.

[3]

Ali Borji, Hamed R Tavakoli, Dicky N Sihite, and Laurent Itti. 2013. Analysis of scores, datasets, and models in visual saliency prediction. In ICCV. 921–928.

[4]

Xianyu Chen, Ming Jiang, and Qi Zhao. 2021. Predicting human scanpaths in visual question answering. In CVPR. 10876–10885.

[5]

Filipe Cristino, Sebastiaan Mathôt, Jan Theeuwes, and Iain D Gilchrist. 2010. ScanMatch: A novel method for comparing fixation sequences. Behavior research methods 42, 3 (2010), 692–700.

[6]

Richard Dewhurst, Marcus Nyström, Halszka Jarodzka, Tom Foulsham, Roger Johansson, and Kenneth Holmqvist. 2012. It depends on how you look at it: Scanpath comparison in multiple dimensions with MultiMatch, a vector-based approach. Behavior research methods 44, 4 (2012), 1079–1100.

[7]

Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013).

[8]

Huazhang Hu, Sixun Dong, Yiqun Zhao, Dongze Lian, Zhengxin Li, and Shenghua Gao. 2022. Transrac: Encoding multi-scale temporal correlation with transformers for repetitive action counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19013–19022.

[9]

Laurent Itti, Christof Koch, and Ernst Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis. TPAMI 20, 11 (1998), 1254–1259.

Digital Library

[10]

Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. 2015. SALICON: Saliency in Context. In CVPR.

[11]

Bin Kang, Dong Liang, Junxi Mei, Xiaoyang Tan, Quan Zhou, and Dengyin Zhang. 2022. Robust rgb-t tracking via graph attention-based bilinear pooling. IEEE Transactions on Neural Networks and Learning Systems (2022).

[12]

Olivier Le Meur and Zhi Liu. 2015. Saccadic model of eye movements for free-viewing condition. VR 116 (2015), 152–164.

[13]

Dong Liang, Qixiang Geng, Zongqi Wei, Dmitry A Vorontsov, Ekaterina L Kim, Mingqiang Wei, and Huiyu Zhou. 2021. Anchor retouching via model interaction for robust object detection in aerial images. IEEE Transactions on Geoscience and Remote Sensing 60 (2021), 1–13.

[14]

Dong Liang, Manabu Hashimoto, Kenji Iwata, Xinyue Zhao, 2015. Co-occurrence probability-based pixel pairs background model for robust object detection in dynamic scenes. Pattern Recognition 48, 4 (2015), 1374–1390.

Digital Library

[15]

Dong Liang, Jing-Wei Zhang, Ying-Peng Tang, and Sheng-Jun Huang. 2023. MUS-CDB: Mixed Uncertainty Sampling with Class Distribution Balancing for Active Annotation in Aerial Object Detection. IEEE Transactions on Geoscience and Remote Sensing (2023).

[16]

Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In ICML. PMLR, 4055–4064.

[17]

Wanjie Sun, Zhenzhong Chen, and Feng Wu. 2019. Visual scanpath prediction using IOR-ROI recurrent mixture density network. TPAMI 43, 6 (2019), 2101–2118.

[18]

Xiaoshuai Sun, Hongxun Yao, and Rongrong Ji. 2012. What are we looking for: Towards statistical modeling of saccadic eye movements and visual saliency. In CVPR. 1552–1559.

[19]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NIPS 30 (2017).

[20]

Wei Wang, Cheng Chen, Yizhou Wang, Tingting Jiang, Fang Fang, and Yuan Yao. 2011. Simulating human saccadic scanpaths on natural images. In CVPR. IEEE, 441–448.

[21]

Zongqi Wei, Dong Liang, Dong Zhang, Liyan Zhang, Qixiang Geng, Mingqiang Wei, and Huiyu Zhou. 2022. Learning calibrated-guidance for object detection in aerial images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 15 (2022), 2721–2733.

[22]

Haibo Ye, Xinjie Li, Yuan Yao, and Hanghang Tong. 2023. Towards robust neural graph collaborative filtering via structure denoising and embedding perturbation. ACM Transactions on Information Systems 41, 3 (2023), 1–28.

Digital Library

[23]

Haibo Ye, Fangyu Zhou, Xinjie Li, and Qingheng Zhang. 2023. Balanced Mixup Loss for Long-Tailed Visual Recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.

Cited By

Chen XJiang MZhao Q(2024)Beyond Average: Individualized Visual Scanpath Prediction2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02402(25420-25431)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.02402
Chen XJiang MZhao Q(2024)GazeXplain: Learning to Predict Natural Language Explanations of Visual ScanpathsComputer Vision – ECCV 202410.1007/978-3-031-73242-3_18(314-333)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73242-3_18

Index Terms

Simulating Human Visual System Based on Vision Transformer
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction techniques
      1. Pointing

Recommendations

Human Visual Scanpath Prediction Based on RGB-D Saliency
ICIGP '18: Proceedings of the 2018 International Conference on Image and Graphics Processing

Human visual perception is considered as a dynamic process of information acquisition, while the visual scanpath can clearly reflect the shift of our eye fixations. In the previous study of visual attention, researchers generally do the saliency ...
Personalized Visual Scanpath Prediction Using IOR-ROI Weighted Attention Network
IMXw '23: Proceedings of the 2023 ACM International Conference on Interactive Media Experiences Workshops

Predicting visual scanpaths plays an important role in modeling overt human visual attention and search behavior. Due to the rapid development of deep learning, previous scanpath prediction models have made significant progress. However, these methods ...
A Model of Saliency-Based Visual Attention for Rapid Scene Analysis

A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented. Multiscale image features are combined into a single topographical saliency map. A dynamical neural network then selects ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SUI '23: Proceedings of the 2023 ACM Symposium on Spatial User Interaction

October 2023

505 pages

ISBN:9798400702815

DOI:10.1145/3607822

Editors:
Tony Huang
UTS, Sydney, Australia
,
Misha Sra
UC Santa Barbara, USA
,
Ferran Argelaguet
INRIA, France
,
Pedro Lopes
University of Chicago, USA
,
Mayra Barrera Machuca
Dalhousie University, Canada

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SUI '23

Sponsor:

SUI '23: ACM Symposium on Spatial User Interaction

October 13 - 15, 2023

NSW, Sydney, Australia

Acceptance Rates

Overall Acceptance Rate 86 of 279 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
117
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)14

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen XJiang MZhao Q(2024)Beyond Average: Individualized Visual Scanpath Prediction2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02402(25420-25431)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.02402
Chen XJiang MZhao Q(2024)GazeXplain: Learning to Predict Natural Language Explanations of Visual ScanpathsComputer Vision – ECCV 202410.1007/978-3-031-73242-3_18(314-333)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73242-3_18

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten