research-article

Toward Visual Behavior and Attention Understanding for Augmented 360 Degree Videos

Authors:
Yucheng Zhu

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China

0000-0002-3069-060X
View Profile

,
Xiongkuo Min

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China

0000-0001-5693-0416
View Profile

,
Dandan Zhu

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China

0000-0003-0329-6321
View Profile

,
Guangtao Zhai

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China

0000-0001-8165-9322
View Profile

,
Xiaokang Yang

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China

0000-0003-4029-3322
View Profile

,
Wenjun Zhang

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China

0000-0001-8799-1182
View Profile

,
Ke Gu

Beijing University of Technology, Beijing, China

Beijing University of Technology, Beijing, China

0000-0001-5540-3235
View Profile

,
Jiantao Zhou

University of Macau, Macau, China

University of Macau, Macau, China

0000-0002-6015-2618
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19 Issue 2sArticle No.: 99pp 1–24https://doi.org/10.1145/3565024

Published:17 February 2023Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Augmented reality (AR) overlays digital content onto reality. In an AR system, correct and precise estimations of user visual fixations and head movements can enhance the quality of experience by allocating more computational resources for analyzing, rendering, and 3D registration on the areas of interest. However, there is inadequate research to help in understanding the visual explorations of the users when using an AR system or modeling AR visual attention. To bridge the gap between the saliency prediction on real-world scenes and on scenes augmented by virtual information, we construct the ARVR saliency dataset. The virtual reality (VR) technique is employed to simulate the real-world. Annotations of object recognition and tracking as augmented contents are blended into omnidirectional videos. The saliency annotations of head and eye movements for both original and augmented videos are collected and together constitute the ARVR dataset. We also design a model that is capable of solving the saliency prediction problem in AR. Local block images are extracted to simulate the viewport and offset the projection distortion. Conspicuous visual cues in the local block images are extracted to constitute the spatial features. The optical flow information is estimated as an important temporal feature. We also consider the interplay between virtual information and reality. The composition of the augmentation information is distinguished, and the joint effects of adversarial augmentation and complementary augmentation are estimated. The Markov chain is constructed with block images as graph nodes. In the determination of the edge weights, both the characteristics of the viewing behaviors and the visual saliency mechanisms are considered. The order of importance for block images is estimated through the state of equilibrium of the Markov chain. Extensive experiments are conducted to demonstrate the effectiveness of the proposed method.

REFERENCES

[1] Azuma Ronald. 1993. Tracking requirements for augmented reality. Commun. ACM 36, 7 (1993), 50–52.Google ScholarDigital Library
[2] Krevelen D. W. F. Van and Poelman Ronald. 2010. A survey of augmented reality technologies, applications and limitations. Int. J. Virtual Real. 9, 2 (2010), 1–20.Google ScholarCross Ref
[3] Furht Borko. 2011. Handbook of Augmented Reality. Springer Science & Business Media.Google ScholarCross Ref
[4] Billinghurst Mark and Kato Hirokazu. 2002. Collaborative augmented reality. Commun. ACM 45, 7 (2002), 64–70.Google ScholarDigital Library
[5] Kruijff E., Swan J. E., and Feiner S.. 2010. Perceptual issues in augmented reality revisited. In Proceedings of the IEEE International Symposium on Mixed and Augmented Reality. 3–12.Google ScholarCross Ref
[6] Itti L. and Borji A.. 2015. Computational models: Bottom-up and top-down aspects. Retrieved from https://arXiv:cs.CV/1510.07748.Google Scholar
[7] Duan Huiyu, Min Xiongkuo, Fang Yi, Fan Lei, Yang Xiaokang, and Zhai Guangtao. 2019. Visual attention analysis and prediction on human faces for children with autism spectrum disorder. ACM Trans. Multimedia Comput. Commun. Appl. 15, 3s (2019), 1–23.Google ScholarDigital Library
[8] Jiang Qiuping, Shao Feng, Lin Weisi, and Jiang Gangyi. 2017. Learning sparse representation for objective image retargeting quality assessment. IEEE Trans. Cybernet. 48, 4 (2017), 1276–1289.Google ScholarCross Ref
[9] Rai Yashas, Callet Patrick Le, and Guillotel Philippe. 2017. Which saliency weighting for omni directional image quality assessment? In Proceedings of the International Conference on Quality of Multimedia Experience. IEEE, 1–6.Google ScholarCross Ref
[10] Startsev Mikhail and Dorr Michael. 2018. 360-aware saliency estimation with conventional image saliency predictors. Signal Process.: Image Commun. 69 (2018), 43–52.Google ScholarCross Ref
[11] Battisti Federica, Baldoni Sara, Brizzi Michele, and Carli Marco. 2018. A feature-based approach for saliency estimation of omni-directional images. Signal Process.: Image Commun. 69 (2018), 53–59.Google ScholarCross Ref
[12] Ling Jing, Zhang Kao, Zhang Yingxue, Yang Daiqin, and Chen Zhenzhong. 2018. A saliency prediction model on 360 degree images using color dictionary based sparse representation. Signal Process.: Image Commun. 69 (2018), 60–68.Google ScholarCross Ref
[13] Lebreton Pierre and Raake Alexander. 2018. GBVS360, BMS360, ProSal: Extending existing saliency prediction models from 2D to omnidirectional images. Signal Process.: Image Commun. 69 (2018), 69–78.Google ScholarCross Ref
[14] Zhu Yucheng, Zhai Guangtao, and Min Xiongkuo. 2018. The prediction of head and eye movement for 360 degree images. Signal Process.: Image Commun. 69 (2018), 15–25.Google ScholarCross Ref
[15] Cheng Hsien-Tzu, Chao Chun-Hung, Dong Jin-Dong, Wen Hao-Kai, Liu Tyng-Luh, and Sun Min. 2018. Cube padding for weakly supervised saliency prediction in 360 videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1420–1429.Google ScholarCross Ref
[16] Xu Yanyu, Dong Yanbing, Wu Junru, Sun Zhengzhong, Shi Zhiru, Yu Jingyi, and Gao Shenghua. 2018. Gaze prediction in dynamic \(360^\circ\) immersive videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5333–5342.Google ScholarCross Ref
[17] Shenhav Amitai, Musslick Sebastian, Lieder Falk, Kool Wouter, Griffiths Thomas L., Cohen Jonathan D., Botvinick Matthew M., et al. 2017. Toward a rational and mechanistic account of mental effort. Annu. Rev. Neurosci. 40, 1 (2017), 99–124.Google ScholarCross Ref
[18] Deng Chenwei, Wang Shuigen, Bovik Alan C., Huang Guang-Bin, and Zhao Baojun. 2019. Blind noisy image quality assessment using sub-band kurtosis. IEEE Trans. Cybernet. 50, 3 (2019), 1146–1156.Google ScholarCross Ref
[19] Harel Jonathan, Koch Christof, and Perona Pietro. 2007. Graph-based visual saliency. In Advances in Neural Information Processing Systems. MIT Press, 545–552.Google ScholarDigital Library
[20] Dataset. 2017. Large-scale scene understanding (LSUN) database. Retrieved from http://salicon.net/challenge-2017/.Google Scholar
[21] Dosovitskiy Alexey, Fischer Philipp, Ilg Eddy, Hausser Philip, Hazirbas Caner, Golkov Vladimir, Smagt Patrick Van Der, Cremers Daniel, and Brox Thomas. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2758–2766.Google ScholarDigital Library
[22] Abreu Ana De, Ozcinar Cagri, and Smolic Aljosa. 2017. Look around you: Saliency maps for omnidirectional images in VR applications. In Proceedings of the 9th International Conference on Quality of Multimedia Experience. IEEE, 1–6.Google Scholar
[23] Zhu Yucheng, Zhai Guangtao, Min Xiongkuo, and Zhou Jiantao. 2019. The prediction of saliency map for head and eye movements in 360 degree images. IEEE Trans. Multimedia 22, 9 (2019), 2331–2344.Google Scholar
[24] Cornia Marcella, Baraldi Lorenzo, Serra Giuseppe, and Cucchiara Rita. 2018. Predicting human eye fixations via an lstm-based saliency attentive model. IEEE Trans. Image Process. 27, 10 (2018), 5142–5154.Google ScholarCross Ref
[25] Jia Sen and Bruce Neil D. B.. 2020. Eml-net: An expandable multi-layer network for saliency prediction. Image Vision Comput. 95 (2020), 103887.Google ScholarDigital Library
[26] Wang Wenguan, Shen Jianbing, Guo Fang, Cheng Ming-Ming, and Borji Ali. 2018. Revisiting video saliency: A large-scale benchmark and a new model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4894–4903.Google ScholarCross Ref
[27] Lai Qiuxia, Wang Wenguan, Sun Hanqiu, and Shen Jianbing. 2019. Video saliency prediction using spatiotemporal residual attentive networks. IEEE Trans. Image Process. 29 (2019), 1113–1126.Google ScholarCross Ref
[28] Lai W., Huang Y., Joshi N., Buehler C., Yang M., and Kang S. B.. 2018. Semantic-driven generation of hyperlapse from 360 degree video. IEEE Trans. Visual. Comput. Graph. 24, 9 (2018), 2610–2621. Google ScholarCross Ref
[29] Hu Hou Ning, Lin Yen Chen, Liu Ming Yu, Cheng Hsien Tzu, Chang Yung Ju, and Sun Min. 2017. Deep 360 pilot: Learning a deep agent for piloting through 360 sports video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1396–1405.Google ScholarCross Ref
[30] Zhu Yucheng, Zhai Guangtao, Min Xiongkuo, and Zhou Jiantao. 2020. Learning a deep agent to predict head movement in 360-degree images. ACM Trans. Multimedia Comput. Commun. Appl. 16, 4 (2020), 1–23.Google ScholarDigital Library
[31] Chao Fang-Yi, Ozcinar Cagri, and Smolic Aljosa. 2021. Transformer-based long-term viewport prediction in \(360^\circ\) video: Scanpath is all you need. In Proceedings of the IEEE Workshop on Multimedia Signal Processing. 6–8.Google Scholar
[32] Su Yu-Chuan and Grauman Kristen. 2017. Learning spherical convolution for fast features from \(360^\circ\) imagery. In Advances in Neural Information Processing Systems. MIT Press, 529–539.Google Scholar
[33] Cohen Taco S., Geiger Mario, Köhler Jonas, and Welling Max. 2018. Spherical CNNs. Retrieved from https://arxiv.org/abs/1801.10130.Google Scholar
[34] Li Yunhao, Shen Wei, Gao Zhongpai, Zhu Yucheng, Zhai Guangtao, and Guo Guodong. 2021. Looking here or there? Gaze following in 360-degree images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3742–3751.Google ScholarCross Ref
[35] Martin Daniel, Serrano Ana, Bergman Alexander W., Wetzstein Gordon, and Masia Belen. 2022. ScanGAN360: A generative model of realistic scanpaths for 360° images. IEEE Trans. Visual. Comput. Graph. 28, 5 (2022), 2003–2013.Google ScholarCross Ref
[36] Li Jie, Han Ling, Zhang Chong, Li Qiyue, and Liu Zhi. 2022. Spherical convolution empowered viewport prediction in 360 video multicast with limited FoV feedback. ACM Trans. Multimedia Comput. Commun. Appl. Just Accepted (January 2022). Google ScholarDigital Library
[37] Xu Mai, Yang Li, Tao Xiaoming, Duan Yiping, and Wang Zulin. 2021. Saliency prediction on omnidirectional image with generative adversarial imitation learning. IEEE Trans. Image Process. 30 (2021), 2087–2102.Google ScholarCross Ref
[38] Yang Yiwei, Zhu Yucheng, Gao Zhongpai, and Zhai Guangtao. 2021. SalGFCN: Graph based fully convolutional network for panoramic saliency prediction. In Proceedings of the International Conference on Visual Communications and Image Processing (VCIP’21). IEEE, 1–5.Google ScholarCross Ref
[39] Guastello Stephen J.. 2013. Human Factors Engineering and Ergonomics: A Systems Approach. CRC Press, Boca Raton, FL.Google ScholarCross Ref
[40] Gutiérrez Jesús, David Erwan, Rai Yashas, and Callet Patrick Le. 2018. Toolbox and dataset for the development of saliency and scanpath models for omnidirectional/360 still images. Signal Process.: Image Commun. 69 (2018), 35–42.Google ScholarCross Ref
[41] Carlson Christopher. [n.d.]. How I Made Wine Glasses from Sunflowers. Retrieved from http://blog.wolfram.com/2011/07/28/how-i-made-wine-glasses-from-sunflowers/.Google Scholar
[42] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770–778.Google ScholarCross Ref
[43] Bak Cagdas, Kocak Aysun, Erdem Erkut, and Erdem Aykut. 2017. Spatio-temporal saliency networks for dynamic saliency prediction. IEEE Trans. Multimedia 20, 7 (2017), 1688–1698.Google ScholarCross Ref
[44] Min Xiongkuo, Zhai Guangtao, Gu Ke, and Yang Xiaokang. 2016. Fixation prediction through multimodal analysis. ACM Trans. Multimedia Comput. Commun. Appl. 13, 1 (2016), 1–23.Google ScholarDigital Library
[45] Karwowski Waldemar. 2006. International Encyclopedia of Ergonomics and Human Factors. CRC Press, Boca Raton, FL.Google ScholarDigital Library
[46] Brandt Th, Dichgans Jo, and Koenig E.. 1973. Differential effects of central versus peripheral vision on egocentric and exocentric motion perception. Exper. Brain Res. 16, 5 (1973), 476–491.Google ScholarCross Ref
[47] Chang Kung-Ching, Pearson Kelly, and Zhang Tan. 2008. Perron-frobenius theorem for nonnegative tensors. Commun. Math. Sci. 6, 2 (2008), 507–520.Google ScholarCross Ref
[48] Pro Vive. 2019. VIVE Pro Eye: HMD with Precise Eye Tracking. Retrieved from https://enterprise.vive.com/us/product/vive-pro-eye/.Google Scholar
[49] Rayner Keith. 1995. Eye movements and cognitive processes in reading, visual search, and scene perception. In Eye Movement Research, Vol. 6. North-Holland, 3–22.Google Scholar
[50] Nguyen Anh and Yan Zhisheng. 2019. A saliency dataset for 360-degree videos. In Proceedings of the 10th ACM Multimedia Systems Conference. ACM, 279–284.Google ScholarDigital Library
[51] Bylinskii Zoya, Judd Tilke, Borji Ali, Itti Laurent, Durand Frédo, Oliva Aude, and Torralba Antonio. [n.d.]. MIT Saliency Benchmark. Retrieved from http://saliency.mit.edu/.Google Scholar
[52] John Brendan, Raiturkar Pallavi, Meur Olivier Le, and Jain Eakta. 2019. A benchmark of four methods for generating 360 saliency maps from eye tracking data. Int. J. Semant. Comput. 13, 03 (2019), 329–341.Google ScholarCross Ref
[53] Pan Junting, Ferrer Cristian Canton, McGuinness Kevin, O’Connor Noel E., Torres Jordi, Sayrol Elisa, and Nieto Xavier Giro-i. 2017. SalGAN: Visual saliency prediction with generative adversarial networks. Retrieved from https://arxiv.org/abs/1701.01081.Google Scholar
[54] Zhang Jianming and Sclaroff Stan. 2013. Saliency detection: A boolean map approach. In Proceedings of the IEEE International Conference on Computer Vision. 153–160.Google ScholarDigital Library
[55] Itti Laurent, Koch Christof, and Niebur Ernst. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20, 11 (1998), 1254–1259.Google ScholarDigital Library
[56] Cornia Marcella, Baraldi Lorenzo, Serra Giuseppe, and Cucchiara Rita. 2016. A deep multi-level network for saliency prediction. In Proceedings of the International Conference on Pattern Recognition. IEEE, 3488–3493.Google ScholarCross Ref
[57] Jiang Ming, Huang Shengsheng, Duan Juanyong, and Zhao Qi. 2015. Salicon: Saliency in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1072–1080.Google ScholarCross Ref
[58] Guo Chenlei, Ma Qi, and Zhang Liming. 2008. Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1–8.Google Scholar
[59] Seo Hae Jong and Milanfar Peyman. 2009. Static and space-time visual saliency detection by self-resemblance. J. Vision 9, 12 (2009), 15–15.Google ScholarCross Ref
[60] Linardos Panagiotis, Mohedano Eva, Nieto Juan Jose, O’Connor Noel E., Nieto Xavier Giro-i, and McGuinness Kevin. 2019. Simple vs. complex temporal recurrences for video saliency prediction. Retrieved from https://arXiv:1907.01869.Google Scholar
[61] Min Kyle and Corso Jason J.. 2019. TASED-net: Temporally aggregating spatial encoder-decoder network for video saliency detection. In Proceedings of the IEEE International Conference on Computer Vision. 2394–2403.Google ScholarCross Ref
[62] Monroy Rafael, Lutz Sebastian, Chalasani Tejo, and Smolic Aljosa. 2018. Salnet360: Saliency maps for omni-directional images with cnn. Signal Process.: Image Commun. 69 (2018), 26–34.Google ScholarCross Ref
[63] Zhang Kao and Chen Zhenzhong. 2018. Video saliency prediction based on spatial-temporal two-stream network. IEEE Trans. Circ. Syst. Video Technol. 29, 12 (2018), 3544–3557.Google ScholarCross Ref
[64] Judd Tilke, Durand Frédo, and Torralba Antonio. 2012. A Benchmark of Computational Models of Saliency to Predict Human Fxations. MIT tech report, Tech. Rep. http://hdl.handle.net/1721.1/68590.Google Scholar
[65] Peters R. J., Iyer A., Itti L., and Koch C.. 2005. Components of bottom-up gaze allocation in natural images.Vision Res. 45, 18 (2005), 2397–2416.Google ScholarCross Ref
[66] Tatler Benjamin W., Baddeley Roland J., and Gilchrist Iain D.. 2005. Visual correlates of fixation selection: Effects of scale and time. Vision Res. 45, 5 (2005), 643–659.Google ScholarCross Ref
[67] Jost Timothée, Ouerhani Nabil, Wartburg Roman Von, Müri René, and Hügli Heinz. 2005. Assessing the contribution of color in visual attention. Comput. Vision Image Understand. 100, 1-2 (2005), 107–123.Google ScholarDigital Library
[68] Djilali Yasser Abdelaziz Dahou, McGuinness Kevin, and O’Connor Noel E.. 2021. Simple baselines can fool 360deg saliency metrics. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3750–3756.Google Scholar

Index Terms

Toward Visual Behavior and Attention Understanding for Augmented 360 Degree Videos
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis

Recommendations

Haptics in Augmented Reality
ICMCS '99: Proceedings of the IEEE International Conference on Multimedia Computing and Systems - Volume 2

An augmented reality system merges synthetic sensory information into a user's perception of a three-dimensional environment. An important performance goal for an augmented reality system is that the user perceives a single seamless environment. In most ...
Read More
Perceptual thresholds of visual size discrimination in augmented and virtual reality
Abstract
The perception of size in virtual objects in Augmented Reality (AR) and Virtual Reality (VR) is a not trivial issue, as the effectiveness of manipulating and interacting with virtual content depends on the accuracy of size perception. However, ...
Graphical abstract

Display Omitted
Highlights
- A comparative experiment to understand the difference in size perception in AR vs VR.
- The thresholds for size discrimination in AR and VR are not the same.
- The accuracy of judgments is asymmetric for increases and decreases of ...
Read More
Saliency in Augmented Reality
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

With the rapid development of multimedia technology, Augmented Reality (AR) has become a promising next-generation mobile platform. The primary theory underlying AR is human visual confusion, which allows users to perceive the real-world scenes and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19, Issue 2s
April 2023
545 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3572861
Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 February 2023
- Online AM: 29 September 2022
- Accepted: 25 September 2022
- Revised: 23 September 2022
- Received: 30 March 2022
Published in tomm Volume 19, Issue 2s

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Augmented reality
visual attention
visual behavior
virtual reality
saliency prediction
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 510
  Total Downloads
- Downloads (Last 12 months)325
- Downloads (Last 6 weeks)32
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

Toward Visual Behavior and Attention Understanding for Augmented 360 Degree Videos

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Haptics in Augmented Reality

Perceptual thresholds of visual size discrimination in augmented and virtual reality

Saliency in Augmented Reality