nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding

Zhu, Benjin; Wang, Zhe; Li, Hongsheng

doi:10.1007/978-3-031-72652-1_8

Benjin Zhu ORCID: orcid.org/0000-0002-0691-2946¹³,
Zhe Wang¹⁴ &
Hongsheng Li^13,15,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15063))

Included in the following conference series:

European Conference on Computer Vision

514 Accesses

Abstract

Existing benchmarks for 3D semantic occupancy prediction in autonomous driving are limited by low resolution (up to $[512\times 512\times 40]$ with 0.2 m voxel size) and inaccurate annotations, hindering the unification of 3D scene understanding through the occupancy representation. Moreover, previous methods can only generate occupancy predictions at 0.4 m resolution or lower, requiring post-upsampling to reach their full resolution (0.2 m). The root of these limitations lies in the sparsity, noise, and even errors present in the raw data. In this paper, we overcome these challenges by introducing nuCraft, a high-resolution and accurate semantic occupancy dataset derived from nuScenes. nuCraft offers an ${8}{\times }$ increase in resolution ($[1024\times 1024\times 80]$ with voxel size of 0.1 m) and more precise semantic annotations compared to previous benchmarks. To address the high memory cost of high-resolution occupancy prediction, we propose VQ-Occ, a novel method that encodes occupancy data into a compact latent feature space using a VQ-VAE. This approach simplifies semantic occupancy prediction into feature simulation in the VQ latent space, making it easier and more memory-efficient. Our method enables direct generation of semantic occupancy fields at high resolution without post-upsampling, facilitating a more unified approach to 3D scene understanding. We validate the superior quality of nuCraft and the effectiveness of VQ-Occ through extensive experiments, demonstrating significant advancements over existing benchmarks and methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Fully Sparse 3D Occupancy Prediction

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving

References

Behley, J., et al.: SemanticKITTI: a dataset for semantic scene understanding of lidar sequences. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9297–9307 (2019)
Google Scholar
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Google Scholar
Cao, A.Q., de Charette, R.: MonoScene: Monocular 3D semantic scene completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3991–4001 (2022)
Google Scholar
Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Cheng, R., Agia, C., Ren, Y., Li, X., Bingbing, L.: S3CNet: a sparse semantic scene completion network for LiDAR point clouds. In: Conference on Robot Learning, pp. 2148–2161. PMLR (2021)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839 (2017)
Google Scholar
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12873–12883 (2021)
Google Scholar
Fei, B., Yang, W., Chen, W.M., Ma, L.: VQ-DcTr: vector-quantized autoencoder with dual-channel transformer points splitting for 3D point cloud completion. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4769–4778 (2022)
Google Scholar
Firman, M., Mac Aodha, O., Julier, S., Brostow, G.J.: Structured prediction of unobserved voxels from a single depth image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5431–5440 (2016)
Google Scholar
Fong, W.K., et al.: Panoptic nuScenes: a large-scale benchmark for LiDAR panoptic segmentation and tracking. IEEE Robot. Autom. Lett. 7(2), 3795–3802 (2022)
Article Google Scholar
Hua, B.S., Pham, Q.H., Nguyen, D.T., Tran, M.K., Yu, L.F., Yeung, S.K.: SceneNN: a scene meshes dataset with annotations. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 92–101. IEEE (2016)
Google Scholar
Huang, J., Gojcic, Z., Atzmon, M., Litany, O., Fidler, S., Williams, F.: Neural kernel surface reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4369–4379 (2023)
Google Scholar
Huang, J., Huang, G.: BEVDet4D: exploit temporal cues in multi-camera 3D object detection. arXiv preprint arXiv:2203.17054 (2022)
Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: BEVDet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)
Huang, Y., Zheng, W., Zhang, B., Zhou, J., Lu, J.: SelfOcc: self-supervised vision-based 3D occupancy prediction. arXiv preprint arXiv:2311.12754 (2023)
Huang, Y., Zheng, W., Zhang, Y., Zhou, J., Lu, J.: Tri-perspective view for vision-based 3D semantic occupancy prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9223–9232 (2023)
Google Scholar
Lee, J., Im, W., Lee, S., Yoon, S.E.: Diffusion probabilistic models for scene-scale 3D categorical data. arXiv preprint arXiv:2301.00527 (2023)
Li, Y., et al.: VoxFormer: sparse voxel transformer for camera-based 3d semantic scene completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9087–9098 (2023)
Google Scholar
Li, Y., Dou, Y., Chen, X., Ni, B., Sun, Y., Liu, Y., Wang, F.: Generalized deep 3D shape prior via part-discretized diffusion process. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16784–16794 (2023)
Google Scholar
Mersch, B., Guadagnino, T., Chen, X., Vizzo, I., Behley, J., Stachniss, C.: Building volumetric beliefs for dynamic environments exploiting map-based moving object segmentation. IEEE Robot. Autom. Lett. 8, 5180–5187 (2023)
Article Google Scholar
Pan, M., et al.: RenderOcc: vision-centric 3D occupancy prediction with 2D rendering supervision. arXiv preprint arXiv:2309.09502 (2023)
Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with VQ-VAE-2. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Roldao, L., de Charette, R., Verroust-Blondet, A.: LMSCNet: lightweight multiscale 3D semantic completion. In: 2020 International Conference on 3D Vision (3DV), pp. 111–119. IEEE (2020)
Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V 12. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Chapter Google Scholar
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1746–1754 (2017)
Google Scholar
Tian, X., et al.: Occ3D: a large-scale 3D occupancy prediction benchmark for autonomous driving. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Tong, W., et al.: Scene as occupancy. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 8406–8415 (2023)
Google Scholar
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Vizzo, I., Guadagnino, T., Mersch, B., Wiesmann, L., Behley, J., Stachniss, C.: KISS-ICP: in defense of point-to-point ICP – simple, accurate, and robust registration if done the right way. IEEE Robot. Autom. Lett. (RA-L) 8(2), 1029–1036 (2023). https://doi.org/10.1109/LRA.2023.3236571
Wang, X., et al.: OpenOccupancy: a large scale benchmark for surrounding semantic occupancy perception. arXiv preprint arXiv:2303.03991 (2023)
Xia, Z., et al.: SCPNet: semantic scene completion on point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17642–17651 (2023)
Google Scholar
Xiao, J., Owens, A., Torralba, A.: SUN3D: a database of big spaces reconstructed using SFM and object labels. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1625–1632 (2013)
Google Scholar
Xiong, Y., Ma, W.C., Wang, J., Urtasun, R.: Learning compact representations for LiDAR completion and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1074–1083 (2023)
Google Scholar
Yan, J., et al.: Cross modal transformer: towards fast and robust 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18268–18278 (2023)
Google Scholar
Yan, X., et al.: Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 3101–3109 (2021)
Google Scholar
Yu, Z., et al.: FlashOcc: fast and memory-efficient occupancy prediction via channel-to-height plugin. arXiv preprint arXiv:2311.12058 (2023)
Zheng, W., Chen, W., Huang, Y., Zhang, B., Duan, Y., Lu, J.: OccWorld: learning a 3D occupancy world model for autonomous driving. arXiv preprint arXiv:2311.16038 (2023)
Zhong, X., Pan, Y., Behley, J., Stachniss, C.: Shine-mapping: large-scale 3D mapping using sparse hierarchical implicit neural representations. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 8371–8377. IEEE (2023)
Google Scholar
Zhu, B., Wang, Z., Shi, S., Xu, H., Hong, L., Li, H.: ConQueR: query contrast voxel-DETR for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9296–9305 (2023)
Google Scholar
Zuo, S., Zheng, W., Huang, Y., Zhou, J., Lu, J.: PointOcc: cylindrical tri-perspective view for point-based 3D semantic occupancy prediction. arXiv preprint arXiv:2308.16896 (2023)

Download references

Acknowledgements

This project is funded in part by National Key R&D Program of China Project 2022ZD0161100, by the Centre for Perceptual and Interactive Intelligence (CPII) Ltd under the Innovation and Technology Commission (ITC)’s InnoHK, by Smart Traffic Fund PSRI/76/2311/PR, by RGC General Research Fund Project 14204021. Hongsheng Li is a PI of CPII under the InnoHK.

Author information

Authors and Affiliations

MMLab, The Chinese University of Hong Kong, Hong Kong, Hong Kong S.A.R., China
Benjin Zhu & Hongsheng Li
SenseTime Research, Hong Kong, Hong Kong S.A.R., China
Zhe Wang
Shanghai AI Laboratory, Shanghai, China
Hongsheng Li
CPII under InnoHK, Hong Kong, Hong Kong S.A.R., China
Hongsheng Li

Authors

Benjin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hongsheng Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Benjin Zhu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 36434 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, B., Wang, Z., Li, H. (2025). nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15063. Springer, Cham. https://doi.org/10.1007/978-3-031-72652-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-72652-1_8
Published: 30 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72651-4
Online ISBN: 978-3-031-72652-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding