Abstract
Panoramic images, capturing a 360\(^\circ \) field of view (FoV), encompass omnidirectional spatial information crucial for scene understanding. However, it is not only costly to obtain training-sufficient dense-annotated panoramas but also application-restricted when training models in a close-vocabulary setting. To tackle this problem, in this work, we define a new task termed Open Panoramic Segmentation (OPS), where models are trained with FoV-restricted pinhole images in the source domain in an open-vocabulary setting while evaluated with FoV-open panoramic images in the target domain, enabling the zero-shot open panoramic semantic segmentation ability of models. Moreover, we propose a model named OOOPS with a Deformable Adapter Network (DAN), which significantly improves zero-shot panoramic semantic segmentation performance. To further enhance the distortion-aware modeling ability from the pinhole source domain, we propose a novel data augmentation method called Random Equirectangular Projection (RERP) which is specifically designed to address object deformations in advance. Surpassing other state-of-the-art open-vocabulary semantic segmentation approaches, a remarkable performance boost on three panoramic datasets, WildPASS, Stanford2D3D, and Matterport3D, proves the effectiveness of our proposed OOOPS model with RERP on the OPS task, especially \({\textbf {+2.2}}\boldsymbol{\%}\) on outdoor WildPASS and \({\textbf {+2.4}}\boldsymbol{\%}\) mIoU on indoor Stanford2D3D. The source code is publicly available at OPS.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ai, H., Cao, Z., Zhu, J., Bai, H., Chen, Y., Wang, L.: Deep learning for omnidirectional vision: a survey and new perspectives. arXiv preprint arXiv:2205.10468 (2022)
Armeni, I., Sax, S., Zamir, A.R., Savarese, S.: Joint 2D-3D-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105 (2017)
Athwale, A., Afrasiyabi, A., Lagüe, J., Shili, I., Ahmad, O., Lalonde, J.F.: DarSwin: distortion aware radial swin transformer. In: ICCV (2023)
Berenguel-Baeta, B., Bermudez-Cameo, J., Guerrero, J.J.: FreDSNet: joint monocular depth and semantic segmentation with fast Fourier convolutions from single panoramas. In: ICRA (2023)
Caesar, H., Uijlings, J.R.R., Ferrari, V.: COCO-stuff: thing and stuff classes in context. In: CVPR (2018)
Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: 3DV (2017)
Chen, J., et al.: Exploring open-vocabulary semantic segmentation without human labels. arXiv preprint arXiv:2306.00450 (2023)
Chen, X., Li, S., Lim, S.N., Torralba, A., Zhao, H.: Open-vocabulary panoptic segmentation with embedding modulation. arXiv preprint arXiv:2303.11324 (2023)
Cho, S., Shin, H., Hong, S., An, S., Lee, S., Arnab, A., Seo, P.H., Kim, S.: CAT-Seg: cost aggregation for open-vocabulary semantic segmentation. In: CVPR (2023)
Dai, J., et al.: Deformable convolutional networks. In: ICCV (2017)
Dao, S.D., Shi, H., Phung, D., Cai, J.: Class enhancement losses with pseudo labels for open-vocabulary semantic segmentation. TMM (2023)
Ding, Z., Wang, J., Tu, Z.: Open-vocabulary universal image segmentation with MaskCLIP. In: ICML (2023)
Dong, B., Gu, J., Han, J., Xu, H., Zuo, W.: Towards universal vision-language omni-supervised segmentation. arXiv preprint arXiv:2303.06547 (2023)
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: ICLR (2021)
Fu, J., et al.: Dual attention network for scene segmentation. In: CVPR (2019)
Fu, X., et al.: PanopticNeRF-360: panoramic 3D-to-2D label transfer in urban scenes. arXiv preprint arXiv:2309.10815 (2023)
Gao, S., Yang, K., Shi, H., Wang, K., Bai, J.: Review on panoramic imaging and its applications in scene understanding. TIM 71, 1–34 (2022)
Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 540–557. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_31
Guo, J., et al.: MVP-SEG: multi-view prompt learning for open-vocabulary semantic segmentation. arXiv preprint arXiv:2304.06957 (2023)
Guttikonda, S., Rambach, J.: Single frame semantic segmentation using multi-modal spherical images. In: WACV (2024)
Han, K., et al.: Global knowledge calibration for fast open-vocabulary segmentation. arXiv preprint arXiv:2303.09181 (2023)
Hu, X., An, Y., Shao, C., Hu, H.: Distortion convolution module for semantic segmentation of panoramic images based on the image-forming principle. TIM 71, 1–12 (2022)
Jang, S., Na, J., Oh, D.: DaDA: distortion-aware domain adaptation for unsupervised semantic segmentation. In: NeurIPS (2022)
Jaus, A., Yang, K., Stiefelhagen, R.: Panoramic panoptic segmentation: towards complete surrounding understanding via unsupervised contrastive learning. In: IV (2021)
Jaus, A., Yang, K., Stiefelhagen, R.: Panoramic panoptic segmentation: insights into surrounding parsing for mobile agents via unsupervised contrastive learning. T-ITS 24, 4438–4453 (2023)
Jiang, Q., et al.: Minimalist and high-quality panoramic imaging with PSF-aware transformers. TIP (2024)
Jiang, Q., Shi, H., Sun, L., Gao, S., Yang, K., Wang, K.: Annular computational imaging: capture clear panoramic images through simple lens. TCI 8, 1250–1264 (2022)
Jiang, W., Wu, Y.: DFNet: semantic segmentation on panoramic images with dynamic loss weights and residual fusion block. In: ICRA (2018)
Jiayun, L., Khandelwal, S., Sigal, L., Li, B.: Plug-and-play, dense-label-free extraction of open-vocabulary semantic segmentation from vision-language models. arXiv preprint arXiv:2311.17095 (2023)
Kim, J., Jeong, S., Sohn, K.: PASTS: toward effective distilling transformer for panoramic semantic segmentation. In: ICIP (2022)
Li, B., Weinberger, K.Q., Belongie, S.J., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: ICLR (2022)
Li, J., Chen, P., Qian, S., Jia, J.: TagCLIP: improving discrimination ability of open-vocabulary semantic segmentation. arXiv preprint arXiv:2304.07547 (2023)
Li, X., Wu, T., Qi, Z., Wang, G., Shan, Y., Li, X.: SGAT4PASS: spherical geometry-aware transformer for panoramic semantic segmentation. In: IJCAI (2023)
Li, Z., Zhou, Q., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Open-vocabulary object segmentation with diffusion models. In: ICCV (2023)
Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted CLIP. In: CVPR (2023)
Ling, Z., Xing, Z., Zhou, X., Cao, M., Zhou, G.: PanoSwin: a pano-style swin transformer for panorama understanding. In: CVPR (2023)
Liu, Y., Ge, P., Liu, Q., Huang, D.: Multi-grained cross-modal alignment for learning open-vocabulary semantic segmentation from text supervision. arXiv preprint arXiv:2403.03707 (2024)
Luo, H., Bao, J., Wu, Y., He, X., Li, T.: SegCLIP: patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: ICML (2023)
Ma, C., Yang, Y.H., Wang, Y., Zhang, Y., Xie, W.: Open-vocabulary semantic segmentation with frozen vision-language models. In: BMVC (2022)
Ma, C., Yang, Y., Ju, C., Zhang, F., Zhang, Y., Wang, Y.: Open-vocabulary semantic segmentation via attribute decomposition-aggregation. arXiv preprint arXiv:2309.00096 (2023)
Ma, C., Zhang, J., Yang, K., Roitberg, A., Stiefelhagen, R.: DensePASS: dense panoramic semantic segmentation via unsupervised domain adaptation with attention-augmented context exchange. In: ITSC (2021)
Mei, J., et al.: Waymo open dataset: panoramic video panoptic segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13689, pp. 53–72. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_4
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Mukhoti, J., et al.: Open vocabulary semantic segmentation with patch aligned contrastive learning. In: CVPR (2023)
Oin, J., et al.: FreeSeg: unified, universal and open-vocabulary image segmentation. In: CVPR (2023)
Orhan, S., Bastanlar, Y.: Semantic segmentation of outdoor panoramic images. SIVP 16, 643–650 (2021)
Orsic, M., Kreso, I., Bevandic, P., Segvic, S.: In defense of pre-trained ImageNet architectures for real-time semantic segmentation of road-driving images. In: CVPR (2019)
Poudel, R.P.K., Liwicki, S., Cipolla, R.: Fast-SCNN: fast semantic segmentation network. In: BMVC (2019)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Ray, B., Jung, J., Larabi, M.: A low-complexity video encoder for equirectangular projected 360 video content. In: ICASSP (2018)
Shen, Z., Lin, C., Liao, K., Nie, L., Zheng, Z., Zhao, Y.: PanoFormer: panorama transformer for indoor 360\(^\circ \) depth estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13661, pp. 195–211. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19769-7_12
Shi, H., et al.: PanoFlow: learning 360\(^\circ \) optical flow for surrounding temporal understanding. T-ITS 24, 5570–5585 (2023)
Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: OpenMask3D: open-vocabulary 3D instance segmentation. arXiv preprint arXiv:2306.13631 (2023)
Teng, Z., et al.: 360BEV: panoramic semantic mapping for indoor bird’s-eye view. In: WACV (2024)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wang, J., et al.: Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773 (2023)
Wang, W., et al.: InternImage: exploring large-scale vision foundation models with deformable convolutions. In: CVPR (2023)
Wang, X., Li, S., Kallidromitis, K., Kato, Y., Kozuka, K., Darrell, T.: Hierarchical open-vocabulary universal image segmentation. arXiv preprint arXiv:2307.00764 (2023)
Wei, M., Yue, X., Zhang, W., Kong, S., Liu, X., Pang, J.: OV-PARTS: towards open-vocabulary part segmentation. arXiv preprint arXiv:2310.05107 (2023)
Wu, C., Zheng, J., Pfrommer, J., Beyerer, J.: Attention-based point cloud edge sampling. In: CVPR (2023)
Wysocza’nska, M., Ramamonjisoa, M., Trzci’nski, T., Siméoni, O.: CLIP-DIY: CLIP dense inference yields open-vocabulary semantic segmentation for-free. arXiv preprint arXiv:2309.14289 (2023)
Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Vision transformer with deformable attention. In: CVPR (2022)
Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: DAT++: spatially dynamic vision transformer with deformable attention. arXiv preprint arXiv:2309.01430 (2023)
Xie, B., Cao, J., Xie, J., Khan, F.S., Pang, Y.: SED: a simple encoder-decoder for open-vocabulary semantic segmentation. arXiv preprint arXiv:2311.15537 (2023)
Xiong, Y., et al.: Efficient deformable ConvNets: rethinking dynamic and sparse operator for vision applications. arXiv preprint arXiv:2401.06197 (2024)
Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: CVPR (2023)
Xu, J., et al.: Learning open-vocabulary semantic segmentation models from natural language supervision. In: CVPR (2023)
Xu, M., Zhang, Z., Wei, F., Hu, H., Bai, X.: SAN: side adapter network for open-vocabulary semantic segmentation. TPAMI (2023)
Xu, M., Zhang, Z., Wei, F., Hu, H., Bai, X.: Side adapter network for open-vocabulary semantic segmentation. In: CVPR (2023)
Xu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Hu, H., Bai, X.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13689, pp. 736–753. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_42
Xu, X., Xiong, T., Ding, Z., Tu, Z.: MasQCLIP for open-vocabulary universal image segmentation. In: ICCV (2023)
Yang, K., et al.: Can we PASS beyond the field of view? Panoramic annular semantic segmentation for real-world surrounding perception. In: IV (2019)
Yang, K., Hu, X., Bergasa, L.M., Romera, E., Wang, K.: PASS: panoramic annular semantic segmentation. T-ITS 21, 4171–4185 (2020)
Yang, K., Hu, X., Fang, Y., Wang, K., Stiefelhagen, R.: Omnisupervised omnidirectional semantic segmentation. T-ITS (2022)
Yang, K., Hu, X., Stiefelhagen, R.: Is context-aware CNN ready for the surroundings? Panoramic semantic segmentation in the wild. TIP 30, 1866–1881 (2021)
Yang, K., Zhang, J., Reiß, S., Hu, X., Stiefelhagen, R.: Capturing omni-range context for omnidirectional segmentation. In: CVPR (2021)
Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K.: DenseASPP for semantic segmentation in street scenes. In: CVPR (2018)
Yin, W., Liu, Y., Shen, C., Hengel, A.V.D., Sun, B.: The devil is in the labels: semantic segmentation from sentences. arXiv preprint arXiv:2202.02002 (2022)
Yu, H., He, L., Jian, B., Feng, W., Liu, S.: PanelNet: understanding 360 indoor environment via panel representation. In: CVPR (2023)
Zhang, H., et al.: A simple framework for open-vocabulary segmentation and detection. In: ICCV (2023)
Zhang, J., Ma, C., Yang, K., Roitberg, A., Peng, K., Stiefelhagen, R.: Transfer beyond the field of view: dense panoramic semantic segmentation via unsupervised domain adaptation. T-ITS 23, 9478–9491 (2022)
Zhang, J., Yang, K., Ma, C., Reiß, S., Peng, K., Stiefelhagen, R.: Bending reality: distortion-aware transformers for adapting to panoramic semantic segmentation. In: CVPR (2022)
Zhang, J., et al.: Behind every domain there is a shift: adapting distortion-aware vision transformers for panoramic semantic segmentation. TPAMI (2024)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)
Zheng, J., Zhang, J., Yang, K., Peng, K., Stiefelhagen, R.: MateRobot: material recognition in wearable robotics for people with visual impairments. In: 2024 IEEE International Conference on Robotics and Automation (ICRA) (2024)
Zheng, X., Pan, T., Luo, Y., Wang, L.: Look at the neighbor: distortion-aware unsupervised domain adaptation for panoramic semantic segmentation. In: ICCV (2023)
Zheng, X., Zhu, J., Liu, Y., Cao, Z., Fu, C., Wang, L.: Both style and distortion matter: dual-path unsupervised domain adaptation for panoramic semantic segmentation. In: CVPR (2023)
Zhou, H., et al.: Rethinking evaluation metrics of open-vocabulary segmentaion. arXiv preprint arXiv:2311.03352 (2023)
Zhou, Q., Liu, Y., Yu, C., Li, J., Wang, Z., Wang, F.: LMSeg: language-guided multi-dataset segmentation. In: ICLR (2023)
Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable ConvNets V2: more deformable, better results. In: CVPR (2019)
Acknowledgements
This work was supported in part by the Ministry of Science, Research and the Arts of Baden-Württemberg (MWK) through the Cooperative Graduate School Accessibility through AI-based Assistive Technology (KATE) under Grant BW6-03, in part by BMBF through a fellowship within the IFI programme of DAAD, in part by the Helmholtz Association Initiative and Networking Fund on the HAICORE@KIT and HOREKA@KIT partition, in part by the National Key RD Program under Grant 2022YFB4701400, and in part by Hangzhou SurImage Technology Company Ltd.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zheng, J. et al. (2025). Open Panoramic Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15097. Springer, Cham. https://doi.org/10.1007/978-3-031-72933-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-72933-1_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72932-4
Online ISBN: 978-3-031-72933-1
eBook Packages: Computer ScienceComputer Science (R0)