VEON: Vocabulary-Enhanced Occupancy Prediction

Zheng, Jilai; Tang, Pin; Wang, Zhongdao; Wang, Guoqing; Ren, Xiangxuan; Feng, Bailan; Ma, Chao

doi:10.1007/978-3-031-72949-2_6

Jilai Zheng¹³,
Pin Tang¹³,
Zhongdao Wang¹⁴,
Guoqing Wang¹³,
Xiangxuan Ren¹³,
Bailan Feng¹⁴ &
…
Chao Ma¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15112))

Included in the following conference series:

European Conference on Computer Vision

273 Accesses

Abstract

Perceiving the world as 3D occupancy supports embodied agents to avoid collision with any types of obstacle. While open-vocabulary image understanding has prospered recently, how to bind the predicted 3D occupancy grids with open-world semantics still remains under-explored due to limited open-world annotations. Hence, instead of building our model from scratch, we try to blend 2D foundation models, specifically a depth model MiDaS and a semantic model CLIP, to lift the semantics to 3D space, thus fulfilling 3D occupancy. However, building upon these foundation models is not trivial. First, the MiDaS faces the depth ambiguity problem, i.e., it only produces relative depth but fails to estimate bin depth for feature lifting. Second, the CLIP image features lack high-resolution pixel-level information, which limits the 3D occupancy accuracy. Third, open vocabulary is often trapped by the long-tail problem. To address these issues, we propose VEON for Vocabulary-Enhanced Occupancy predictioN by not only assembling but also adapting these foundation models. We first equip MiDaS with a Zoedepth head and low-rank adaptation (LoRA) for relative-metric-bin depth transformation while reserving beneficial depth prior. Then, a lightweight side adaptor network is attached to the CLIP vision encoder to generate high-resolution features for fine-grained 3D occupancy prediction. Moreover, we design a class reweighting strategy to give priority to the tail classes. With only 46M trainable parameters and zero manual semantic labels, VEON achieves 15.14 mIoU on Occ3D-nuScenes, and shows the capability of recognizing objects with open-vocabulary categories, meaning that our VEON is label-efficient, parameter-efficient, and precise enough.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.99; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SurfOcc: Surface-Based Feature Lifting for Vision-Centric 3D Occupancy Prediction

nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

References

The bevdet codebase. https://github.com/HuangJunJie2017/BEVDet. Accessed 28 Oct 2023
CVPR 2023 3D occupancy prediction challenge. https://github.com/CVPR2023-3D-Occupancy-Prediction/CVPR2023-3D-Occupancy-Prediction. Accessed 28 Oct 2023
Bao, H., Dong, L., Piao, S., Wei, F.: Beit: BERT pre-training of image transformers. In: ICLR (2021)
Google Scholar
Behley, J., et al.: Semantickitti: a dataset for semantic scene understanding of lidar sequences. In: ICCV, pp. 9297–9307 (2019)
Google Scholar
Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: depth estimation using adaptive bins. In: CVPR, pp. 4009–4018 (2021)
Google Scholar
Bhat, S.F., Alhashim, I., Wonka, P.: Localbins: improving depth estimation by learning local distributions. In: ECCV, pp. 480–496 (2022)
Google Scholar
Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., Müller, M.: Zoedepth: zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)
Birkl, R., Wofk, D., Müller, M.: Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460 (2023)
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR, pp. 11621–11631 (2020)
Google Scholar
Cao, A.Q., de Charette, R.: Monoscene: monocular 3D semantic scene completion. In: CVPR, pp. 3991–4001 (2022)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV, pp. 9650–9660 (2021)
Google Scholar
Ding, R., Yang, J., Xue, C., Zhang, W., Bai, S., Qi, X.: PLA: language-driven open-vocabulary 3D scene understanding. In: CVPR, pp. 7010–7019 (2023)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NeurIPS, pp. 2366–2374 (2014)
Google Scholar
Fong, W.K., et al.: Panoptic nuscenes: a large-scale benchmark for lidar panoptic segmentation and tracking. RA-L 7(2), 3795–3802 (2022)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. In: ICLR (2021)
Google Scholar
Hu, Y., et al.: Planning-oriented autonomous driving. In: CVPR, pp. 17853–17862 (2023)
Google Scholar
Huang, J., Huang, G.: Bevdet4d: exploit temporal cues in multi-camera 3D object detection. arXiv preprint arXiv:2203.17054 (2022)
Huang, J., Huang, G.: Bevpoolv2: a cutting-edge implementation of bevdet toward deployment. arXiv preprint arXiv:2211.17111 (2022)
Huang, J., Huang, G., Zhu, Z., Yun, Y., Du, D.: Bevdet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)
Huang, X., Wang, P., Cheng, X., Zhou, D., Geng, Q., Yang, R.: The apolloscape open dataset for autonomous driving and its application. TPAMI 42(10), 2702–2719 (2019)
Article Google Scholar
Huang, Y., Zheng, W., Zhang, B., Zhou, J., Lu, J.: Selfocc: self-supervised vision-based 3D occupancy prediction. arXiv preprint arXiv:2311.12754 (2023)
Huang, Y., Zheng, W., Zhang, Y., Zhou, J., Lu, J.: Tri-perspective view for vision-based 3D semantic occupancy prediction. In: CVPR, pp. 9223–9232 (2023)
Google Scholar
Huang, Z., Wu, X., Chen, X., Zhao, H., Zhu, L., Lasenby, J.: Openins3d: snap and lookup for 3D open-vocabulary instance segmentation. arXiv preprint arXiv:2309.00616 (2023)
Jiang, B., et al.: VAD: vectorized scene representation for efficient autonomous driving. In: ICCV, pp. 8340–8350 (2023)
Google Scholar
Li, Y., et al.: Voxformer: sparse voxel transformer for camera-based 3D semantic scene completion. In: CVPR, pp. 9087–9098 (2023)
Google Scholar
Li, Z., Snavely, N.: Megadepth: learning single-view depth prediction from internet photos. In: CVPR, pp. 2041–2050 (2018)
Google Scholar
Li, Z., et al.: Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: ECCV, pp. 1–18 (2022)
Google Scholar
Liu, K., et al.: Weakly supervised 3D open-vocabulary segmentation. arXiv preprint arXiv:2305.14093 (2023)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2018)
Google Scholar
Lu, S., Chang, H., Jing, E.P., Boularias, A., Bekris, K.: OVIR-3D: open-vocabulary 3D instance retrieval without training on 3D data. CoRL (2023)
Google Scholar
Miao, R., et al.: Occdepth: a depth-aware method for 3D semantic scene completion. arXiv preprint arXiv:2302.13540 (2023)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Article Google Scholar
Peng, S., et al.: Openscene: 3D scene understanding with open vocabularies. In: CVPR, pp. 815–824 (2023)
Google Scholar
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: ECCV, pp. 194–210 (2020)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Google Scholar
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. TPAMI 44(3), 1623–1637 (2020)
Article Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: ECCV, pp. 746–760 (2012)
Google Scholar
Sun, P., et al.: Scalability in perception for autonomous driving: waymo open dataset. In: CVPR, pp. 2446–2454 (2020)
Google Scholar
Tan, Z., Dong, Z., Zhang, C., Zhang, W., Ji, H., Li, H.: OVO: open-vocabulary occupancy. arXiv preprint arXiv:2305.16133 (2023)
Tang, P., et al.: Sparseocc: rethinking sparse latent representation for vision-based semantic occupancy prediction. In: CVPR, pp. 15035–15044 (2024)
Google Scholar
Tian, X., Jiang, T., Yun, L., Wang, Y., Wang, Y., Zhao, H.: OCC3D: a large-scale 3D occupancy prediction benchmark for autonomous driving. arXiv preprint arXiv:2304.14365 (2023)
Tong, W., et al.: Scene as occupancy. In: ICCV, pp. 8406–8415 (2023)
Google Scholar
Vobecky, A., et al.: POP-3D: open-vocabulary 3D occupancy prediction from images. In: NeurIPS, pp. 50545–50557 (2023)
Google Scholar
Wang, G., et al.: Occgen: generative multi-modal 3D occupancy prediction for autonomous driving. arXiv preprint arXiv:2404.15014 (2024)
Wang, X., et al.: Openoccupancy: a large scale benchmark for surrounding semantic occupancy perception. In: ICCV, pp. 17850–17859 (2023)
Google Scholar
Wei, Y., Zhao, L., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: Surroundocc: multi-camera 3D occupancy prediction for autonomous driving. In: ICCV, pp. 21729–21740 (2023)
Google Scholar
Xian, K., Zhang, J., Wang, O., Mai, L., Lin, Z., Cao, Z.: Structure-guided ranking loss for single image depth prediction. In: CVPR, pp. 611–620 (2020)
Google Scholar
Xu, M., Zhang, Z., Wei, F., Hu, H., Bai, X.: Side adapter network for open-vocabulary semantic segmentation. In: CVPR, pp. 2945–2954 (2023)
Google Scholar
Yao, Y., et al.: Blendedmvs: a large-scale dataset for generalized multi-view stereo networks. In: CVPR, pp. 1790–1799 (2020)
Google Scholar
Zhang, C., et al.: Occnerf: self-supervised multi-camera occupancy prediction with neural radiance fields. arXiv preprint arXiv:2312.09243 (2023)
Zhang, Y., Zhu, Z., Du, D.: Occformer: dual-path transformer for vision-based 3D semantic occupancy prediction. arXiv preprint arXiv:2304.05316 (2023)
Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: ECCV, pp. 696–712 (2022)
Google Scholar

Download references

Acknowledgements

This work was supported by NSFC (62322113, 62376156), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
Jilai Zheng, Pin Tang, Guoqing Wang, Xiangxuan Ren & Chao Ma
Huawei Noah’s Ark Lab, Beijing, China
Zhongdao Wang & Bailan Feng

Authors

Jilai Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Pin Tang
View author publications
You can also search for this author in PubMed Google Scholar
Zhongdao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Guoqing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangxuan Ren
View author publications
You can also search for this author in PubMed Google Scholar
Bailan Feng
View author publications
You can also search for this author in PubMed Google Scholar
Chao Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chao Ma .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1681 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zheng, J. et al. (2025). VEON: Vocabulary-Enhanced Occupancy Prediction. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15112. Springer, Cham. https://doi.org/10.1007/978-3-031-72949-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-72949-2_6
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72948-5
Online ISBN: 978-3-031-72949-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

VEON: Vocabulary-Enhanced Occupancy Prediction

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SurfOcc: Surface-Based Feature Lifting for Vision-Centric 3D Occupancy Prediction

nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1681 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

VEON: Vocabulary-Enhanced Occupancy Prediction

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SurfOcc: Surface-Based Feature Lifting for Vision-Centric 3D Occupancy Prediction

nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1681 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation