Skip to main content

Panoramic Vision Transformer for Saliency Detection in 360\(^\circ \) Videos

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13695))

Included in the following conference series:

Abstract

360\(^\circ \) video saliency detection is one of the challenging benchmarks for 360\(^\circ \) video understanding since non-negligible distortion and discontinuity occur in the projection of any format of 360\(^\circ \) videos, and capture-worthy viewpoint in the omnidirectional sphere is ambiguous by nature. We present a new framework named Panoramic Vision Transformer (PAVER). We design the encoder using Vision Transformer with deformable convolution, which enables us not only to plug pretrained models from normal videos into our architecture without additional modules or finetuning but also to perform geometric approximation only once, unlike previous deep CNN-based approaches. Thanks to its powerful encoder, PAVER can learn the saliency from three simple relative relations among local patch features, outperforming state-of-the-art models for the Wild360 benchmark by large margins without supervision or auxiliary information like class activation. We demonstrate the utility of our saliency prediction model with the omnidirectional video quality assessment task in VQA-ODV, where we consistently improve performance without any form of supervision, including head movement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/Samsung/360tools.

References

  1. Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: CVPR (2018)

    Google Scholar 

  2. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)

    Google Scholar 

  3. Borji, A., Tavakoli, H.R., Sihite, D.N., Itti, L.: Analysis of scores, datasets, and models in visual saliency prediction. In: ICCV (2013)

    Google Scholar 

  4. Bruce, N., Tsotsos, J.: Saliency based on information maximization. In: NIPS (2005)

    Google Scholar 

  5. Bylinskii, Z., et al.: MIT saliency benchmark (2015)

    Google Scholar 

  6. Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics tell us about saliency models? IEEE TPAMI 41, 740–757 (2018)

    Article  Google Scholar 

  7. Caron, G., Morbidi, F.: Spherical visual gyroscope for autonomous robots using the mixture of photometric potentials. In: ICRA (2018)

    Google Scholar 

  8. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)

    Google Scholar 

  9. Caruso, D., Engel, J., Cremers, D.: Large-scale direct SLAM for omnidirectional cameras. In: IROS (2015)

    Google Scholar 

  10. Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: ICCV (2021)

    Google Scholar 

  11. Cheng, H.T., Chao, C.H., Dong, J.D., Wen, H.K., Liu, T.L., Sun, M.: Cube padding for weakly-supervised saliency prediction in 360 videos. In: CVPR (2018)

    Google Scholar 

  12. Chou, S.H., Chen, Y.C., Zeng, K.H., Hu, H.N., Fu, J., Sun, M.: Self-view grounding given a narrated 360 video. In: AAAI (2018)

    Google Scholar 

  13. Cohen, T.S., Geiger, M., Köhler, J., Welling, M.: Spherical CNNs. In: ICLR (2018)

    Google Scholar 

  14. Dai, J., et al.: Deformable convolutional networks. In: ICCV (2017)

    Google Scholar 

  15. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)

    Google Scholar 

  16. Devaraju, B.: Understanding filtering on the sphere: experiences from filtering GRACE data. Ph.D. dissertation, Inst. Geodesy, Univ. Stuttgart (2015)

    Google Scholar 

  17. Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. arXiv:2010.11929 (2020)

  18. Eder, M., Shvets, M., Lim, J., Frahm, J.M.: Tangent images for mitigating spherical distortion. In: CVPR (2020)

    Google Scholar 

  19. Esteves, C., Allen-Blanchette, C., Makadia, A., Daniilidis, K.: Learning SO(3) equivariant representations with spherical CNNs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 54–70. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_4

    Chapter  Google Scholar 

  20. Fan, H., et al.: Multiscale vision transformers. In: ICCV (2021)

    Google Scholar 

  21. Gao, W., et al.: Token semantic coupled attention map for weakly supervised object localization. In: ICCV (2021)

    Google Scholar 

  22. Greene, N.: Environment mapping and other applications of world projections. IEEE CGA 6, 21–29 (1986)

    Google Scholar 

  23. Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). arXiv:1606.08415 (2016)

  24. Hu, H.N., Lin, Y.C., Liu, M.Y., Cheng, H.T., Chang, Y.J., Sun, M.: Deep 360 pilot: learning a deep agent for piloting through 360 sports videos. In: CVPR (2017)

    Google Scholar 

  25. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI 20, 1254–1259 (1998)

    Article  Google Scholar 

  26. Jiang, C.M., Huang, J., Kashinath, K., Marcus, P., Niessner, M., et al.: Spherical CNNs on unstructured grids. In: ICLR (2018)

    Google Scholar 

  27. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  28. Lee, S., Sung, J., Yu, Y., Kim, G.: A memory network approach for story-based temporal summarization of 360 videos. In: CVPR (2018)

    Google Scholar 

  29. Lee, Y., Jeong, J., Yun, J., Cho, W., Yoon, K.J.: SpherePHD: applying CNNs on a spherical PolyHeDron representation of 360deg images. In: CVPR (2019)

    Google Scholar 

  30. Li, C., Xu, M., Du, X., Wang, Z.: Bridge the gap between VQA and human behavior on omnidirectional video: a large-scale dataset and a deep learning model. In: ACMMM (2018)

    Google Scholar 

  31. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)

    Google Scholar 

  32. Meng, M., Zhang, T., Tian, Q., Zhang, Y., Wu, F.: Foreground activation maps for weakly supervised object localization. In: ICCV (2021)

    Google Scholar 

  33. Pan, J., et al.: SalGAN: visual saliency prediction with generative adversarial networks. arXiv:1701.01081 (2017)

  34. Riche, N., Duvinage, M., Mancas, M., Gosselin, B., Dutoit, T.: Saliency and human fixations: state-of-the-art and study of comparison metrics. In: ICCV (2013)

    Google Scholar 

  35. Seo, H.J., Milanfar, P.: Nonparametric bottom-up saliency detection by self-resemblance. In: CVPRw (2009)

    Google Scholar 

  36. Siméoni, O., et al.: Localizing objects with self-supervised transformers and no labels. In: BMVC (2021)

    Google Scholar 

  37. Su, Y.C., Grauman, K.: Learning spherical convolution for fast features from 360 imagery. In: NIPS (2017)

    Google Scholar 

  38. Su, Y.C., Grauman, K.: Kernel transformer networks for compact spherical convolution. In: CVPR (2019)

    Google Scholar 

  39. Su, Y.-C., Jayaraman, D., Grauman, K.: Pano2Vid: automatic cinematography for watching 360\(^{\circ }\) videos. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10114, pp. 154–171. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54190-7_10

    Chapter  Google Scholar 

  40. Sun, Y., Lu, A., Yu, L.: Weighted-to-spherically-uniform quality evaluation for omnidirectional video. SPL (2017)

    Google Scholar 

  41. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)

    Google Scholar 

  42. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: ICCV (2021)

    Google Scholar 

  43. Ullah, I., et al.: A brief survey of visual saliency detection. MTA (2020)

    Google Scholar 

  44. Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)

    Google Scholar 

  45. Wang, M., Konrad, J., Ishwar, P., Jing, K., Rowley, H.: Image saliency: from intrinsic to extrinsic context. In: CVPR (2011)

    Google Scholar 

  46. Wang, W., Shen, J., Shao, L.: Consistent video saliency using local gradient flow optimization and global refinement. TIP 24, 4185–4196 (2015)

    MathSciNet  MATH  Google Scholar 

  47. Wang, Y., Shen, X., Hu, S., Yuan, Y., Crowley, J., Vaufreydaz, D.: Self-supervised transformers for unsupervised object discovery using normalized cut. In: CVPR (2022)

    Google Scholar 

  48. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. TIP 13, 600–612 (2004)

    Google Scholar 

  49. Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: DeepFlow: large displacement optical flow with deep matching. In: ICCV (2013)

    Google Scholar 

  50. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: NeurIPS (2021)

    Google Scholar 

  51. Xie, J., Luo, C., Zhu, X., Jin, Z., Lu, W., Shen, L.: Online refinement of low-level feature based activation map for weakly supervised object localization. In: ICCV (2021)

    Google Scholar 

  52. Yogamani, S., et al.: WoodScape: a multi-task, multi-camera fisheye dataset for autonomous driving. In: ICCV (2019)

    Google Scholar 

  53. Yu, M., Lakshman, H., Girod, B.: A framework to evaluate omnidirectional video coding schemes. In: ISMAR (2015)

    Google Scholar 

  54. Yu, Y., Lee, S., Na, J., Kang, J., Kim, G.: A deep ranking model for spatio-temporal highlight detection from a 360\(^\circ \) video. In: AAAI (2018)

    Google Scholar 

  55. Yuan, L., et al.: Tokens-to-token ViT: training vision transformers from scratch on ImageNet. In: ICCV (2021)

    Google Scholar 

  56. Yun, H., Yu, Y., Yang, W., Lee, K., Kim, G.: Pano-AVQA: grounded audio-visual question answering on 360deg videos. In: ICCV (2021)

    Google Scholar 

  57. Yun, I., Lee, H.J., Rhee, C.E.: Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learning. In: AAAI (2022)

    Google Scholar 

  58. Zeng, Y., Zhuge, Y., Lu, H., Zhang, L.: Joint learning of saliency detection and weakly supervised semantic segmentation. In: ICCV (2019)

    Google Scholar 

  59. Zeng, Y., Zhuge, Y., Lu, H., Zhang, L., Qian, M., Yu, Y.: Multi-source weak supervision for saliency detection. In: CVPR (2019)

    Google Scholar 

  60. Zhang, C., Liwicki, S., Smith, W., Cipolla, R.: Orientation-aware semantic segmentation on icosahedron spheres. In: ICCV (2019)

    Google Scholar 

  61. Zhang, Y., et al.: VidTr: video transformer without convolutions. In: ICCV (2021)

    Google Scholar 

  62. Zhang, Z., Xu, Y., Yu, J., Gao, S.: Saliency detection in 360\(^\circ \) videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 504–520. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_30

    Chapter  Google Scholar 

  63. Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: ICCV (2021)

    Google Scholar 

  64. Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR (2021)

    Google Scholar 

  65. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)

    Google Scholar 

  66. Zhu, F., Zhu, Y., Chang, X., Liang, X.: Vision-language navigation with self-supervised auxiliary reasoning tasks. In: CVPR (2020)

    Google Scholar 

Download references

Acknowledgement

We thank Youngjae Yu, Sangho Lee, and Joonil Na for their constructive comments. This work was supported by AIRS Company in Hyundai Motor Company & Kia Corporation through HKMC-SNU AI Consortium Fund and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-01309, No. 2019-0-01082).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gunhee Kim .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1001 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yun, H., Lee, S., Kim, G. (2022). Panoramic Vision Transformer for Saliency Detection in 360\(^\circ \) Videos. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13695. Springer, Cham. https://doi.org/10.1007/978-3-031-19833-5_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19833-5_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19832-8

  • Online ISBN: 978-3-031-19833-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics