Panoramic Vision Transformer for Saliency Detection in 360 $$^\circ $$ Videos

Yun, Heeseung; Lee, Sehun; Kim, Gunhee

doi:10.1007/978-3-031-19833-5_25

Heeseung Yun¹²,
Sehun Lee¹² &
Gunhee Kim¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13695))

Included in the following conference series:

European Conference on Computer Vision

2034 Accesses
7 Citations

Abstract

360$^\circ $ video saliency detection is one of the challenging benchmarks for 360$^\circ $ video understanding since non-negligible distortion and discontinuity occur in the projection of any format of 360$^\circ $ videos, and capture-worthy viewpoint in the omnidirectional sphere is ambiguous by nature. We present a new framework named Panoramic Vision Transformer (PAVER). We design the encoder using Vision Transformer with deformable convolution, which enables us not only to plug pretrained models from normal videos into our architecture without additional modules or finetuning but also to perform geometric approximation only once, unlike previous deep CNN-based approaches. Thanks to its powerful encoder, PAVER can learn the saliency from three simple relative relations among local patch features, outperforming state-of-the-art models for the Wild360 benchmark by large margins without supervision or auxiliary information like class activation. We demonstrate the utility of our saliency prediction model with the omnidirectional video quality assessment task in VQA-ODV, where we consistently improve performance without any form of supervision, including head movement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/Samsung/360tools.

References

Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: CVPR (2018)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)
Google Scholar
Borji, A., Tavakoli, H.R., Sihite, D.N., Itti, L.: Analysis of scores, datasets, and models in visual saliency prediction. In: ICCV (2013)
Google Scholar
Bruce, N., Tsotsos, J.: Saliency based on information maximization. In: NIPS (2005)
Google Scholar
Bylinskii, Z., et al.: MIT saliency benchmark (2015)
Google Scholar
Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics tell us about saliency models? IEEE TPAMI 41, 740–757 (2018)
Article Google Scholar
Caron, G., Morbidi, F.: Spherical visual gyroscope for autonomous robots using the mixture of photometric potentials. In: ICRA (2018)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
Google Scholar
Caruso, D., Engel, J., Cremers, D.: Large-scale direct SLAM for omnidirectional cameras. In: IROS (2015)
Google Scholar
Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: ICCV (2021)
Google Scholar
Cheng, H.T., Chao, C.H., Dong, J.D., Wen, H.K., Liu, T.L., Sun, M.: Cube padding for weakly-supervised saliency prediction in 360 videos. In: CVPR (2018)
Google Scholar
Chou, S.H., Chen, Y.C., Zeng, K.H., Hu, H.N., Fu, J., Sun, M.: Self-view grounding given a narrated 360 video. In: AAAI (2018)
Google Scholar
Cohen, T.S., Geiger, M., Köhler, J., Welling, M.: Spherical CNNs. In: ICLR (2018)
Google Scholar
Dai, J., et al.: Deformable convolutional networks. In: ICCV (2017)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Devaraju, B.: Understanding filtering on the sphere: experiences from filtering GRACE data. Ph.D. dissertation, Inst. Geodesy, Univ. Stuttgart (2015)
Google Scholar
Dosovitskiy, A., et al.: An image is worth $16\times 16$ words: transformers for image recognition at scale. arXiv:2010.11929 (2020)
Eder, M., Shvets, M., Lim, J., Frahm, J.M.: Tangent images for mitigating spherical distortion. In: CVPR (2020)
Google Scholar
Esteves, C., Allen-Blanchette, C., Makadia, A., Daniilidis, K.: Learning SO(3) equivariant representations with spherical CNNs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 54–70. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_4
Chapter Google Scholar
Fan, H., et al.: Multiscale vision transformers. In: ICCV (2021)
Google Scholar
Gao, W., et al.: Token semantic coupled attention map for weakly supervised object localization. In: ICCV (2021)
Google Scholar
Greene, N.: Environment mapping and other applications of world projections. IEEE CGA 6, 21–29 (1986)
Google Scholar
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). arXiv:1606.08415 (2016)
Hu, H.N., Lin, Y.C., Liu, M.Y., Cheng, H.T., Chang, Y.J., Sun, M.: Deep 360 pilot: learning a deep agent for piloting through 360 sports videos. In: CVPR (2017)
Google Scholar
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI 20, 1254–1259 (1998)
Article Google Scholar
Jiang, C.M., Huang, J., Kashinath, K., Marcus, P., Niessner, M., et al.: Spherical CNNs on unstructured grids. In: ICLR (2018)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Lee, S., Sung, J., Yu, Y., Kim, G.: A memory network approach for story-based temporal summarization of 360 videos. In: CVPR (2018)
Google Scholar
Lee, Y., Jeong, J., Yun, J., Cho, W., Yoon, K.J.: SpherePHD: applying CNNs on a spherical PolyHeDron representation of 360deg images. In: CVPR (2019)
Google Scholar
Li, C., Xu, M., Du, X., Wang, Z.: Bridge the gap between VQA and human behavior on omnidirectional video: a large-scale dataset and a deep learning model. In: ACMMM (2018)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Google Scholar
Meng, M., Zhang, T., Tian, Q., Zhang, Y., Wu, F.: Foreground activation maps for weakly supervised object localization. In: ICCV (2021)
Google Scholar
Pan, J., et al.: SalGAN: visual saliency prediction with generative adversarial networks. arXiv:1701.01081 (2017)
Riche, N., Duvinage, M., Mancas, M., Gosselin, B., Dutoit, T.: Saliency and human fixations: state-of-the-art and study of comparison metrics. In: ICCV (2013)
Google Scholar
Seo, H.J., Milanfar, P.: Nonparametric bottom-up saliency detection by self-resemblance. In: CVPRw (2009)
Google Scholar
Siméoni, O., et al.: Localizing objects with self-supervised transformers and no labels. In: BMVC (2021)
Google Scholar
Su, Y.C., Grauman, K.: Learning spherical convolution for fast features from 360 imagery. In: NIPS (2017)
Google Scholar
Su, Y.C., Grauman, K.: Kernel transformer networks for compact spherical convolution. In: CVPR (2019)
Google Scholar
Su, Y.-C., Jayaraman, D., Grauman, K.: Pano2Vid: automatic cinematography for watching 360$^{\circ }$ videos. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10114, pp. 154–171. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54190-7_10
Chapter Google Scholar
Sun, Y., Lu, A., Yu, L.: Weighted-to-spherically-uniform quality evaluation for omnidirectional video. SPL (2017)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)
Google Scholar
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: ICCV (2021)
Google Scholar
Ullah, I., et al.: A brief survey of visual saliency detection. MTA (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Google Scholar
Wang, M., Konrad, J., Ishwar, P., Jing, K., Rowley, H.: Image saliency: from intrinsic to extrinsic context. In: CVPR (2011)
Google Scholar
Wang, W., Shen, J., Shao, L.: Consistent video saliency using local gradient flow optimization and global refinement. TIP 24, 4185–4196 (2015)
MathSciNet MATH Google Scholar
Wang, Y., Shen, X., Hu, S., Yuan, Y., Crowley, J., Vaufreydaz, D.: Self-supervised transformers for unsupervised object discovery using normalized cut. In: CVPR (2022)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. TIP 13, 600–612 (2004)
Google Scholar
Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: DeepFlow: large displacement optical flow with deep matching. In: ICCV (2013)
Google Scholar
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: NeurIPS (2021)
Google Scholar
Xie, J., Luo, C., Zhu, X., Jin, Z., Lu, W., Shen, L.: Online refinement of low-level feature based activation map for weakly supervised object localization. In: ICCV (2021)
Google Scholar
Yogamani, S., et al.: WoodScape: a multi-task, multi-camera fisheye dataset for autonomous driving. In: ICCV (2019)
Google Scholar
Yu, M., Lakshman, H., Girod, B.: A framework to evaluate omnidirectional video coding schemes. In: ISMAR (2015)
Google Scholar
Yu, Y., Lee, S., Na, J., Kang, J., Kim, G.: A deep ranking model for spatio-temporal highlight detection from a 360$^\circ $ video. In: AAAI (2018)
Google Scholar
Yuan, L., et al.: Tokens-to-token ViT: training vision transformers from scratch on ImageNet. In: ICCV (2021)
Google Scholar
Yun, H., Yu, Y., Yang, W., Lee, K., Kim, G.: Pano-AVQA: grounded audio-visual question answering on 360deg videos. In: ICCV (2021)
Google Scholar
Yun, I., Lee, H.J., Rhee, C.E.: Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learning. In: AAAI (2022)
Google Scholar
Zeng, Y., Zhuge, Y., Lu, H., Zhang, L.: Joint learning of saliency detection and weakly supervised semantic segmentation. In: ICCV (2019)
Google Scholar
Zeng, Y., Zhuge, Y., Lu, H., Zhang, L., Qian, M., Yu, Y.: Multi-source weak supervision for saliency detection. In: CVPR (2019)
Google Scholar
Zhang, C., Liwicki, S., Smith, W., Cipolla, R.: Orientation-aware semantic segmentation on icosahedron spheres. In: ICCV (2019)
Google Scholar
Zhang, Y., et al.: VidTr: video transformer without convolutions. In: ICCV (2021)
Google Scholar
Zhang, Z., Xu, Y., Yu, J., Gao, S.: Saliency detection in 360$^\circ $ videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 504–520. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_30
Chapter Google Scholar
Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: ICCV (2021)
Google Scholar
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR (2021)
Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)
Google Scholar
Zhu, F., Zhu, Y., Chang, X., Liang, X.: Vision-language navigation with self-supervised auxiliary reasoning tasks. In: CVPR (2020)
Google Scholar

Download references

Acknowledgement

We thank Youngjae Yu, Sangho Lee, and Joonil Na for their constructive comments. This work was supported by AIRS Company in Hyundai Motor Company & Kia Corporation through HKMC-SNU AI Consortium Fund and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-01309, No. 2019-0-01082).

Author information

Authors and Affiliations

Seoul National University, Seoul, Korea
Heeseung Yun, Sehun Lee & Gunhee Kim

Authors

Heeseung Yun
View author publications
You can also search for this author in PubMed Google Scholar
Sehun Lee
View author publications
You can also search for this author in PubMed Google Scholar
Gunhee Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gunhee Kim .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1001 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yun, H., Lee, S., Kim, G. (2022). Panoramic Vision Transformer for Saliency Detection in 360$^\circ $ Videos. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13695. Springer, Cham. https://doi.org/10.1007/978-3-031-19833-5_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-19833-5_25
Published: 04 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19832-8
Online ISBN: 978-3-031-19833-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Panoramic Vision Transformer for Saliency Detection in 360\(^\circ \) Videos

Abstract

Access this chapter

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1001 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Panoramic Vision Transformer for Saliency Detection in 360\(^\circ \) Videos

Abstract

Access this chapter

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1001 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation