Skip to main content

VEON: Vocabulary-Enhanced Occupancy Prediction

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15112))

Included in the following conference series:

  • 273 Accesses

Abstract

Perceiving the world as 3D occupancy supports embodied agents to avoid collision with any types of obstacle. While open-vocabulary image understanding has prospered recently, how to bind the predicted 3D occupancy grids with open-world semantics still remains under-explored due to limited open-world annotations. Hence, instead of building our model from scratch, we try to blend 2D foundation models, specifically a depth model MiDaS and a semantic model CLIP, to lift the semantics to 3D space, thus fulfilling 3D occupancy. However, building upon these foundation models is not trivial. First, the MiDaS faces the depth ambiguity problem, i.e., it only produces relative depth but fails to estimate bin depth for feature lifting. Second, the CLIP image features lack high-resolution pixel-level information, which limits the 3D occupancy accuracy. Third, open vocabulary is often trapped by the long-tail problem. To address these issues, we propose VEON for Vocabulary-Enhanced Occupancy predictioN by not only assembling but also adapting these foundation models. We first equip MiDaS with a Zoedepth head and low-rank adaptation (LoRA) for relative-metric-bin depth transformation while reserving beneficial depth prior. Then, a lightweight side adaptor network is attached to the CLIP vision encoder to generate high-resolution features for fine-grained 3D occupancy prediction. Moreover, we design a class reweighting strategy to give priority to the tail classes. With only 46M trainable parameters and zero manual semantic labels, VEON achieves 15.14 mIoU on Occ3D-nuScenes, and shows the capability of recognizing objects with open-vocabulary categories, meaning that our VEON is label-efficient, parameter-efficient, and precise enough.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. The bevdet codebase. https://github.com/HuangJunJie2017/BEVDet. Accessed 28 Oct 2023

  2. CVPR 2023 3D occupancy prediction challenge. https://github.com/CVPR2023-3D-Occupancy-Prediction/CVPR2023-3D-Occupancy-Prediction. Accessed 28 Oct 2023

  3. Bao, H., Dong, L., Piao, S., Wei, F.: Beit: BERT pre-training of image transformers. In: ICLR (2021)

    Google Scholar 

  4. Behley, J., et al.: Semantickitti: a dataset for semantic scene understanding of lidar sequences. In: ICCV, pp. 9297–9307 (2019)

    Google Scholar 

  5. Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: depth estimation using adaptive bins. In: CVPR, pp. 4009–4018 (2021)

    Google Scholar 

  6. Bhat, S.F., Alhashim, I., Wonka, P.: Localbins: improving depth estimation by learning local distributions. In: ECCV, pp. 480–496 (2022)

    Google Scholar 

  7. Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., Müller, M.: Zoedepth: zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)

  8. Birkl, R., Wofk, D., Müller, M.: Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460 (2023)

  9. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR, pp. 11621–11631 (2020)

    Google Scholar 

  10. Cao, A.Q., de Charette, R.: Monoscene: monocular 3D semantic scene completion. In: CVPR, pp. 3991–4001 (2022)

    Google Scholar 

  11. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV, pp. 9650–9660 (2021)

    Google Scholar 

  12. Ding, R., Yang, J., Xue, C., Zhang, W., Bai, S., Qi, X.: PLA: language-driven open-vocabulary 3D scene understanding. In: CVPR, pp. 7010–7019 (2023)

    Google Scholar 

  13. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  14. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NeurIPS, pp. 2366–2374 (2014)

    Google Scholar 

  15. Fong, W.K., et al.: Panoptic nuscenes: a large-scale benchmark for lidar panoptic segmentation and tracking. RA-L 7(2), 3795–3802 (2022)

    Google Scholar 

  16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  17. Hu, E.J., et al.: Lora: low-rank adaptation of large language models. In: ICLR (2021)

    Google Scholar 

  18. Hu, Y., et al.: Planning-oriented autonomous driving. In: CVPR, pp. 17853–17862 (2023)

    Google Scholar 

  19. Huang, J., Huang, G.: Bevdet4d: exploit temporal cues in multi-camera 3D object detection. arXiv preprint arXiv:2203.17054 (2022)

  20. Huang, J., Huang, G.: Bevpoolv2: a cutting-edge implementation of bevdet toward deployment. arXiv preprint arXiv:2211.17111 (2022)

  21. Huang, J., Huang, G., Zhu, Z., Yun, Y., Du, D.: Bevdet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)

  22. Huang, X., Wang, P., Cheng, X., Zhou, D., Geng, Q., Yang, R.: The apolloscape open dataset for autonomous driving and its application. TPAMI 42(10), 2702–2719 (2019)

    Article  Google Scholar 

  23. Huang, Y., Zheng, W., Zhang, B., Zhou, J., Lu, J.: Selfocc: self-supervised vision-based 3D occupancy prediction. arXiv preprint arXiv:2311.12754 (2023)

  24. Huang, Y., Zheng, W., Zhang, Y., Zhou, J., Lu, J.: Tri-perspective view for vision-based 3D semantic occupancy prediction. In: CVPR, pp. 9223–9232 (2023)

    Google Scholar 

  25. Huang, Z., Wu, X., Chen, X., Zhao, H., Zhu, L., Lasenby, J.: Openins3d: snap and lookup for 3D open-vocabulary instance segmentation. arXiv preprint arXiv:2309.00616 (2023)

  26. Jiang, B., et al.: VAD: vectorized scene representation for efficient autonomous driving. In: ICCV, pp. 8340–8350 (2023)

    Google Scholar 

  27. Li, Y., et al.: Voxformer: sparse voxel transformer for camera-based 3D semantic scene completion. In: CVPR, pp. 9087–9098 (2023)

    Google Scholar 

  28. Li, Z., Snavely, N.: Megadepth: learning single-view depth prediction from internet photos. In: CVPR, pp. 2041–2050 (2018)

    Google Scholar 

  29. Li, Z., et al.: Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: ECCV, pp. 1–18 (2022)

    Google Scholar 

  30. Liu, K., et al.: Weakly supervised 3D open-vocabulary segmentation. arXiv preprint arXiv:2305.14093 (2023)

  31. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2018)

    Google Scholar 

  32. Lu, S., Chang, H., Jing, E.P., Boularias, A., Bekris, K.: OVIR-3D: open-vocabulary 3D instance retrieval without training on 3D data. CoRL (2023)

    Google Scholar 

  33. Miao, R., et al.: Occdepth: a depth-aware method for 3D semantic scene completion. arXiv preprint arXiv:2302.13540 (2023)

  34. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)

    Article  Google Scholar 

  35. Peng, S., et al.: Openscene: 3D scene understanding with open vocabularies. In: CVPR, pp. 815–824 (2023)

    Google Scholar 

  36. Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: ECCV, pp. 194–210 (2020)

    Google Scholar 

  37. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)

    Google Scholar 

  38. Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. TPAMI 44(3), 1623–1637 (2020)

    Article  Google Scholar 

  39. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: ECCV, pp. 746–760 (2012)

    Google Scholar 

  40. Sun, P., et al.: Scalability in perception for autonomous driving: waymo open dataset. In: CVPR, pp. 2446–2454 (2020)

    Google Scholar 

  41. Tan, Z., Dong, Z., Zhang, C., Zhang, W., Ji, H., Li, H.: OVO: open-vocabulary occupancy. arXiv preprint arXiv:2305.16133 (2023)

  42. Tang, P., et al.: Sparseocc: rethinking sparse latent representation for vision-based semantic occupancy prediction. In: CVPR, pp. 15035–15044 (2024)

    Google Scholar 

  43. Tian, X., Jiang, T., Yun, L., Wang, Y., Wang, Y., Zhao, H.: OCC3D: a large-scale 3D occupancy prediction benchmark for autonomous driving. arXiv preprint arXiv:2304.14365 (2023)

  44. Tong, W., et al.: Scene as occupancy. In: ICCV, pp. 8406–8415 (2023)

    Google Scholar 

  45. Vobecky, A., et al.: POP-3D: open-vocabulary 3D occupancy prediction from images. In: NeurIPS, pp. 50545–50557 (2023)

    Google Scholar 

  46. Wang, G., et al.: Occgen: generative multi-modal 3D occupancy prediction for autonomous driving. arXiv preprint arXiv:2404.15014 (2024)

  47. Wang, X., et al.: Openoccupancy: a large scale benchmark for surrounding semantic occupancy perception. In: ICCV, pp. 17850–17859 (2023)

    Google Scholar 

  48. Wei, Y., Zhao, L., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: Surroundocc: multi-camera 3D occupancy prediction for autonomous driving. In: ICCV, pp. 21729–21740 (2023)

    Google Scholar 

  49. Xian, K., Zhang, J., Wang, O., Mai, L., Lin, Z., Cao, Z.: Structure-guided ranking loss for single image depth prediction. In: CVPR, pp. 611–620 (2020)

    Google Scholar 

  50. Xu, M., Zhang, Z., Wei, F., Hu, H., Bai, X.: Side adapter network for open-vocabulary semantic segmentation. In: CVPR, pp. 2945–2954 (2023)

    Google Scholar 

  51. Yao, Y., et al.: Blendedmvs: a large-scale dataset for generalized multi-view stereo networks. In: CVPR, pp. 1790–1799 (2020)

    Google Scholar 

  52. Zhang, C., et al.: Occnerf: self-supervised multi-camera occupancy prediction with neural radiance fields. arXiv preprint arXiv:2312.09243 (2023)

  53. Zhang, Y., Zhu, Z., Du, D.: Occformer: dual-path transformer for vision-based 3D semantic occupancy prediction. arXiv preprint arXiv:2304.05316 (2023)

  54. Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: ECCV, pp. 696–712 (2022)

    Google Scholar 

Download references

Acknowledgements

This work was supported by NSFC (62322113, 62376156), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chao Ma .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1681 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zheng, J. et al. (2025). VEON: Vocabulary-Enhanced Occupancy Prediction. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15112. Springer, Cham. https://doi.org/10.1007/978-3-031-72949-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72949-2_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72948-5

  • Online ISBN: 978-3-031-72949-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics