Skip to main content

nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15063))

Included in the following conference series:

  • 514 Accesses

Abstract

Existing benchmarks for 3D semantic occupancy prediction in autonomous driving are limited by low resolution (up to \([512\times 512\times 40]\) with 0.2 m voxel size) and inaccurate annotations, hindering the unification of 3D scene understanding through the occupancy representation. Moreover, previous methods can only generate occupancy predictions at 0.4 m resolution or lower, requiring post-upsampling to reach their full resolution (0.2 m). The root of these limitations lies in the sparsity, noise, and even errors present in the raw data. In this paper, we overcome these challenges by introducing nuCraft, a high-resolution and accurate semantic occupancy dataset derived from nuScenes. nuCraft offers an \({8}{\times }\) increase in resolution (\([1024\times 1024\times 80]\) with voxel size of 0.1 m) and more precise semantic annotations compared to previous benchmarks. To address the high memory cost of high-resolution occupancy prediction, we propose VQ-Occ, a novel method that encodes occupancy data into a compact latent feature space using a VQ-VAE. This approach simplifies semantic occupancy prediction into feature simulation in the VQ latent space, making it easier and more memory-efficient. Our method enables direct generation of semantic occupancy fields at high resolution without post-upsampling, facilitating a more unified approach to 3D scene understanding. We validate the superior quality of nuCraft and the effectiveness of VQ-Occ through extensive experiments, demonstrating significant advancements over existing benchmarks and methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Behley, J., et al.: SemanticKITTI: a dataset for semantic scene understanding of lidar sequences. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9297–9307 (2019)

    Google Scholar 

  2. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)

    Google Scholar 

  3. Cao, A.Q., de Charette, R.: MonoScene: Monocular 3D semantic scene completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3991–4001 (2022)

    Google Scholar 

  4. Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)

  5. Cheng, R., Agia, C., Ren, Y., Li, X., Bingbing, L.: S3CNet: a sparse semantic scene completion network for LiDAR point clouds. In: Conference on Robot Learning, pp. 2148–2161. PMLR (2021)

    Google Scholar 

  6. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839 (2017)

    Google Scholar 

  7. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12873–12883 (2021)

    Google Scholar 

  8. Fei, B., Yang, W., Chen, W.M., Ma, L.: VQ-DcTr: vector-quantized autoencoder with dual-channel transformer points splitting for 3D point cloud completion. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4769–4778 (2022)

    Google Scholar 

  9. Firman, M., Mac Aodha, O., Julier, S., Brostow, G.J.: Structured prediction of unobserved voxels from a single depth image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5431–5440 (2016)

    Google Scholar 

  10. Fong, W.K., et al.: Panoptic nuScenes: a large-scale benchmark for LiDAR panoptic segmentation and tracking. IEEE Robot. Autom. Lett. 7(2), 3795–3802 (2022)

    Article  Google Scholar 

  11. Hua, B.S., Pham, Q.H., Nguyen, D.T., Tran, M.K., Yu, L.F., Yeung, S.K.: SceneNN: a scene meshes dataset with annotations. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 92–101. IEEE (2016)

    Google Scholar 

  12. Huang, J., Gojcic, Z., Atzmon, M., Litany, O., Fidler, S., Williams, F.: Neural kernel surface reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4369–4379 (2023)

    Google Scholar 

  13. Huang, J., Huang, G.: BEVDet4D: exploit temporal cues in multi-camera 3D object detection. arXiv preprint arXiv:2203.17054 (2022)

  14. Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: BEVDet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)

  15. Huang, Y., Zheng, W., Zhang, B., Zhou, J., Lu, J.: SelfOcc: self-supervised vision-based 3D occupancy prediction. arXiv preprint arXiv:2311.12754 (2023)

  16. Huang, Y., Zheng, W., Zhang, Y., Zhou, J., Lu, J.: Tri-perspective view for vision-based 3D semantic occupancy prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9223–9232 (2023)

    Google Scholar 

  17. Lee, J., Im, W., Lee, S., Yoon, S.E.: Diffusion probabilistic models for scene-scale 3D categorical data. arXiv preprint arXiv:2301.00527 (2023)

  18. Li, Y., et al.: VoxFormer: sparse voxel transformer for camera-based 3d semantic scene completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9087–9098 (2023)

    Google Scholar 

  19. Li, Y., Dou, Y., Chen, X., Ni, B., Sun, Y., Liu, Y., Wang, F.: Generalized deep 3D shape prior via part-discretized diffusion process. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16784–16794 (2023)

    Google Scholar 

  20. Mersch, B., Guadagnino, T., Chen, X., Vizzo, I., Behley, J., Stachniss, C.: Building volumetric beliefs for dynamic environments exploiting map-based moving object segmentation. IEEE Robot. Autom. Lett. 8, 5180–5187 (2023)

    Article  Google Scholar 

  21. Pan, M., et al.: RenderOcc: vision-centric 3D occupancy prediction with 2D rendering supervision. arXiv preprint arXiv:2309.09502 (2023)

  22. Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with VQ-VAE-2. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  23. Roldao, L., de Charette, R., Verroust-Blondet, A.: LMSCNet: lightweight multiscale 3D semantic completion. In: 2020 International Conference on 3D Vision (3DV), pp. 111–119. IEEE (2020)

    Google Scholar 

  24. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V 12. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54

    Chapter  Google Scholar 

  25. Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1746–1754 (2017)

    Google Scholar 

  26. Tian, X., et al.: Occ3D: a large-scale 3D occupancy prediction benchmark for autonomous driving. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  27. Tong, W., et al.: Scene as occupancy. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 8406–8415 (2023)

    Google Scholar 

  28. Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  29. Vizzo, I., Guadagnino, T., Mersch, B., Wiesmann, L., Behley, J., Stachniss, C.: KISS-ICP: in defense of point-to-point ICP – simple, accurate, and robust registration if done the right way. IEEE Robot. Autom. Lett. (RA-L) 8(2), 1029–1036 (2023). https://doi.org/10.1109/LRA.2023.3236571

  30. Wang, X., et al.: OpenOccupancy: a large scale benchmark for surrounding semantic occupancy perception. arXiv preprint arXiv:2303.03991 (2023)

  31. Xia, Z., et al.: SCPNet: semantic scene completion on point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17642–17651 (2023)

    Google Scholar 

  32. Xiao, J., Owens, A., Torralba, A.: SUN3D: a database of big spaces reconstructed using SFM and object labels. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1625–1632 (2013)

    Google Scholar 

  33. Xiong, Y., Ma, W.C., Wang, J., Urtasun, R.: Learning compact representations for LiDAR completion and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1074–1083 (2023)

    Google Scholar 

  34. Yan, J., et al.: Cross modal transformer: towards fast and robust 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18268–18278 (2023)

    Google Scholar 

  35. Yan, X., et al.: Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 3101–3109 (2021)

    Google Scholar 

  36. Yu, Z., et al.: FlashOcc: fast and memory-efficient occupancy prediction via channel-to-height plugin. arXiv preprint arXiv:2311.12058 (2023)

  37. Zheng, W., Chen, W., Huang, Y., Zhang, B., Duan, Y., Lu, J.: OccWorld: learning a 3D occupancy world model for autonomous driving. arXiv preprint arXiv:2311.16038 (2023)

  38. Zhong, X., Pan, Y., Behley, J., Stachniss, C.: Shine-mapping: large-scale 3D mapping using sparse hierarchical implicit neural representations. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 8371–8377. IEEE (2023)

    Google Scholar 

  39. Zhu, B., Wang, Z., Shi, S., Xu, H., Hong, L., Li, H.: ConQueR: query contrast voxel-DETR for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9296–9305 (2023)

    Google Scholar 

  40. Zuo, S., Zheng, W., Huang, Y., Zhou, J., Lu, J.: PointOcc: cylindrical tri-perspective view for point-based 3D semantic occupancy prediction. arXiv preprint arXiv:2308.16896 (2023)

Download references

Acknowledgements

This project is funded in part by National Key R&D Program of China Project 2022ZD0161100, by the Centre for Perceptual and Interactive Intelligence (CPII) Ltd under the Innovation and Technology Commission (ITC)’s InnoHK, by Smart Traffic Fund PSRI/76/2311/PR, by RGC General Research Fund Project 14204021. Hongsheng Li is a PI of CPII under the InnoHK.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Benjin Zhu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 36434 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, B., Wang, Z., Li, H. (2025). nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15063. Springer, Cham. https://doi.org/10.1007/978-3-031-72652-1_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72652-1_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72651-4

  • Online ISBN: 978-3-031-72652-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics