OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving

Wang, Guoqing; Wang, Zhongdao; Tang, Pin; Zheng, Jilai; Ren, Xiangxuan; Feng, Bailan; Ma, Chao

doi:10.1007/978-3-031-72661-3_6

Guoqing Wang¹³,
Zhongdao Wang¹⁴,
Pin Tang¹³,
Jilai Zheng¹³,
Xiangxuan Ren¹³,
Bailan Feng¹⁴ &
…
Chao Ma¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15078))

Included in the following conference series:

European Conference on Computer Vision

280 Accesses

Abstract

Existing 3D semantic occupancy prediction methods typically treat the task as a one-shot 3D voxel-wise segmentation problem, focusing on a single-step mapping between the inputs and occupancy maps, which limits their ability to refine and complete local regions gradually. In this paper, we introduce OccGen, a simple yet powerful generative perception model for 3D semantic occupancy prediction. OccGen adopts a “noise-to-occupancy” generative paradigm, progressively inferring and refining the occupancy map by predicting and eliminating noise originating from a random Gaussian distribution. OccGen consists of two main components: a conditional encoder that is capable of processing multi-modal inputs, and a progressive refinement decoder that applies diffusion denoising using the multi-modal features as conditions. A key insight of this generative pipeline is that the diffusion denoising process is naturally able to model the coarse-to-fine refinement of the dense 3D occupancy map, therefore producing more detailed predictions. Extensive experiments on several occupancy benchmarks demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods. For instance, OccGen relatively enhances the mIoU by 9.5%, 6.3%, and 13.3% on nuScenes-Occupancy dataset under the muli-modal, LiDAR-only, and camera-only settings, respectively. Moreover, as a generative perception model, OccGen exhibits desirable properties that discriminative models cannot achieve, such as providing uncertainty estimates alongside its multiple-step predictions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.99; Price excludes VAT (USA)

Softcover Book: USD 161.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction

nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding

Fully Sparse 3D Occupancy Prediction

References

Amit, T., Nachmani, E., Shaharbany, T., Wolf, L.: Segdiff: image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390 (2021)
Behley, J., et al.: Semantickitti: a dataset for semantic scene understanding of lidar sequences. In: ICCV, pp. 9297–9307 (2019)
Google Scholar
Berman, M., Triki, A.R., Blaschko, M.B.: The lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: CVPR, pp. 4413–4421 (2018)
Google Scholar
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR, pp. 11621–11631 (2020)
Google Scholar
Cao, A.Q., de Charette, R.: Monoscene: monocular 3D semantic scene completion. In: CVPR, pp. 3991–4001 (2022)
Google Scholar
Chen, S., Sun, P., Song, Y., Luo, P.: Diffusiondet: diffusion model for object detection. In: ICCV, pp. 19830–19843 (2023)
Google Scholar
Chen, T., Li, L., Saxena, S., Hinton, G., Fleet, D.J.: A generalist framework for panoptic segmentation of images and videos. arXiv preprint arXiv:2210.06366 (2022)
Chen, X., Lin, K.Y., Qian, C., Zeng, G., Li, H.: 3D sketch-aware semantic scene completion via semi-supervised structure prior. In: CVPR, pp. 4193–4202 (2020)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Everingham, M., Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. IJCV (2009)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS, vol. 27 (2014)
Google Scholar
Harakeh, A., Smart, M., Waslander, S.L.: Bayesod: a Bayesian approach for uncertainty estimation in deep object detectors. In: ICRA, pp. 87–93. IEEE (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS, vol. 33, pp. 6840–6851 (2020)
Google Scholar
Hu, Y., et al.: Planning-oriented autonomous driving. In: CVPR, pp. 17853–17862 (2023)
Google Scholar
Huang, J., Huang, G., Zhu, Z., Du, D.: Bevdet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)
Huang, Y., Zheng, W., Zhang, Y., Zhou, J., Lu, J.: Tri-perspective view for vision-based 3D semantic occupancy prediction. In: CVPR, pp. 9223–9232 (2023)
Google Scholar
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)
Ji, Y., et al.: DDP: diffusion model for dense visual prediction. In: ICCV, pp. 21741–21752 (2023)
Google Scholar
Jia, X., Gao, Y., Chen, L., Yan, J., Liu, P.L., Li, H.: Driveadapter: breaking the coupling barrier of perception and planning in end-to-end autonomous driving. In: CVPR, pp. 7953–7963 (2023)
Google Scholar
Jiang, H., Cheng, T., Gao, N., Zhang, H., Liu, W., Wang, X.: Symphonize 3D semantic scene completion with contextual instance queries. In: CVPR, pp. 20258–20267 (2024)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: CVPR, pp. 12697–12705 (2019)
Google Scholar
Li, J., Han, K., Wang, P., Liu, Y., Yuan, X.: Anisotropic convolutional networks for 3D semantic scene completion. In: CVPR, pp. 3351–3359 (2020)
Google Scholar
Li, Y., et al.: Voxformer: sparse voxel transformer for camera-based 3D semantic scene completion. In: CVPR, pp. 9087–9098 (2023)
Google Scholar
Li, Y., et al.: Bevdepth: acquisition of reliable depth for multi-view 3D object detection. arXiv preprint arXiv:2206.10092 (2022)
Li, Z., et al.: Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 1–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_1
Chapter Google Scholar
Liang, T., et al.: Bevfusion: a simple and robust lidar-camera fusion framework. In: NeurIPS (2022)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, pp. 2117–2125 (2017)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: CVPR, pp. 10012–10022 (2021)
Google Scholar
Liu, Z., et al.: Bevfusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. In: ICRA (2023)
Google Scholar
Loquercio, A., Segu, M., Scaramuzza, D.: A general framework for uncertainty estimation in deep learning. IEEE Robot. Autom. Lett. 5(2), 3153–3160 (2020)
Article Google Scholar
Lu, H., et al.: Scaling multi-camera 3D object detection through weak-to-strong eliciting. arXiv preprint arXiv:2404.06700 (2024)
Lu, H., Zhang, Y., Lian, Q., Du, D., Chen, Y.: Towards generalizable multi-camera 3D object detection via perspective debiasing. arXiv preprint arXiv:2310.11346 (2023)
Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: ICML, pp. 8162–8171. PMLR (2021)
Google Scholar
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Chapter Google Scholar
Roldao, L., de Charette, R., Verroust-Blondet, A.: Lmscnet: lightweight multiscale 3D semantic completion. In: 3DV (2020)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Saharia, C., et al.: Palette: image-to-image diffusion models. In: SIGGRAPH, pp. 1–10 (2022)
Google Scholar
Saxena, S., Kar, A., Norouzi, M., Fleet, D.J.: Monocular depth estimation using diffusion models. arXiv preprint arXiv:2302.14816 (2023)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML, pp. 2256–2265. PMLR (2015)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: CVPR, pp. 1746–1754 (2017)
Google Scholar
Tang, P., et al.: Sparseocc: rethinking sparse latent representation for vision-based semantic occupancy prediction. arXiv preprint arXiv:2404.09502 (2024)
Tian, X., Jiang, T., Yun, L., Wang, Y., Wang, Y., Zhao, H.: OCC3D: a large-scale 3D occupancy prediction benchmark for autonomous driving. arXiv preprint arXiv:2304.14365 (2023)
Tong, W., et al.: Scene as occupancy. In: ICCV, pp. 8406–8415 (2023)
Google Scholar
Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting: sequential fusion for 3D object detection. In: CVPR, pp. 4604–4612 (2020)
Google Scholar
Wang, C., Ma, C., Zhu, M., Yang, X.: Pointaugmenting: cross-modal augmentation for 3D object detection. In: CVPR, pp. 11794–11803 (2021)
Google Scholar
Wang, X., et al.: Openoccupancy: a large scale benchmark for surrounding semantic occupancy perception. In: ICCV, pp. 17850–17859 (2023)
Google Scholar
Wei, Y., Zhao, L., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: Surroundocc: multi-camera 3D occupancy prediction for autonomous driving. In: ICCV, pp. 21729–21740 (2023)
Google Scholar
Wolleb, J., Sandkühler, R., Bieder, F., Valmaggia, P., Cattin, P.C.: Diffusion models for implicit image segmentation ensembles. In: MIDL, pp. 1336–1348 (2022)
Google Scholar
Wu, J., Fang, H., Zhang, Y., Yang, Y., Xu, Y.: Medsegdiff: medical image segmentation with diffusion probabilistic model. arXiv preprint arXiv:2211.00611 (2022)
Yan, X., et al.: Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In: AAAI, vol. 35, pp. 3101–3109 (2021)
Google Scholar
Yan, Y., Mao, Y., Li, B.: SECOND: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)
Article Google Scholar
Zhang, Y., et al.: Polarnet: an improved grid representation for online lidar point clouds semantic segmentation. In: CVPR, pp. 9601–9610 (2020)
Google Scholar
Zhang, Y., Zhu, Z., Du, D.: Occformer: dual-path transformer for vision-based 3D semantic occupancy prediction. In: ICCV, pp. 9433–9443 (2023)
Google Scholar
Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud based 3D object detection. In: CVPR, pp. 4490–4499 (2018)
Google Scholar
Zhu, X., et al.: Cylindrical and asymmetrical 3D convolution networks for lidar segmentation. In: CVPR, pp. 9939–9948 (2021)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2021). https://openreview.net/forum?id=gZ9hCDWe6ke

Download references

Acknowledgements

This work was supported by NSFC (62322113, 62376156), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
Guoqing Wang, Pin Tang, Jilai Zheng, Xiangxuan Ren & Chao Ma
Huawei Noah’s Ark Lab, Beijing, China
Zhongdao Wang & Bailan Feng

Authors

Guoqing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhongdao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Pin Tang
View author publications
You can also search for this author in PubMed Google Scholar
Jilai Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Xiangxuan Ren
View author publications
You can also search for this author in PubMed Google Scholar
Bailan Feng
View author publications
You can also search for this author in PubMed Google Scholar
Chao Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chao Ma .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 20338 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, G. et al. (2025). OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15078. Springer, Cham. https://doi.org/10.1007/978-3-031-72661-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-72661-3_6
Published: 27 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72660-6
Online ISBN: 978-3-031-72661-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving