Skip to main content

OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction

  • Conference paper
  • First Online:
Computer Vision – ACCV 2024 (ACCV 2024)

Abstract

3D occupancy prediction based on multi-sensor fusion, crucial for a reliable autonomous driving system, enables fine-grained under- standing of 3D scenes. Previous fusion-based 3D occupancy predictions relied on depth estimation for processing 2D image features. However, depth estimation is an ill-posed problem, hindering the accuracy and robustness of these methods. Furthermore, fine-grained occupancy prediction demands extensive computational resources. To address these issues, we propose OccFusion, a depth estimation free multi-modal fusion framework. Additionally, we introduce a generalizable active training method and an active decoder that can be applied to any occupancy prediction model, with the potential to enhance their performance. Experiments conducted on nuScenes-Occupancy and nuScenes-Occ3D demonstrate our framework’s superior performance. Detailed ablation studies highlight the effectiveness of each proposed method.

J. Zhang and Y. Ding—These authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C., Gall, J.: Semantickitti: A dataset for semantic scene understanding of lidar sequences. In: ICCV. pp. 9297–9307 (2019). https://doi.org/10.1109/ICCV.2019.00939

  2. Berman, M., Triki, A.R., Blaschko, M.B.: The lovasz-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: CVPR. pp. 4413–4421 (2018).https://doi.org/10.1109/CVPR.2018.00464

  3. Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: CVPR. pp. 11618–11628 (2020).https://doi.org/10.1109/CVPR42600.2020.01164

  4. Cao, A., de Charette, R.: Monoscene: Monocular 3d semantic scene completion. In: CVPR. pp. 3981–3991 (2022https://doi.org/10.1109/CVPR52688.2022.00396

  5. Chang, M.F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., Wang, D., Carr, P., Lucey, S., Ramanan, D., Hays, J.: Argoverse: 3d tracking and forecasting with rich maps. In: CVPR. pp. 8748–8757 (2019).https://doi.org/10.1109/CVPR.2019.00895

  6. Chen, X., Lin, K., Qian, C., Zeng, G., Li, H.: 3d sketch-aware semantic scene completion via semi-supervised structure prior. In: CVPR. pp. 4192–4201 (2020https://doi.org/10.1109/CVPR42600.2020.00425

  7. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. pp. 248–255 (2009).https://doi.org/10.1109/CVPR.2009.5206848

  8. Felzenszwalb, P.F., Girshick, R.B., McAllester, D.A., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010). https://doi.org/10.1109/TPAMI.2009.167

    Article  Google Scholar 

  9. Firman, M., Aodha, O.M., Julier, S., Brostow, G.J.: Structured prediction of unobserved voxels from a single depth image. In: CVPR. pp. 5431–5440 (2016).https://doi.org/10.1109/CVPR.2016.586

  10. Gal, Y., Islam, R., Ghahramani, Z.: Deep bayesian active learning with image data. In: ICML. pp. 1183–1192 (2017)

    Google Scholar 

  11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778 (2016https://doi.org/10.1109/CVPR.2016.90

  12. Houlsby, N., Huszar, F., Ghahramani, Z., Lengyel, M.: Bayesian active learning for classification and preference learning. aeXiv preprint arXiv:1112.5745 (2011)

  13. Hua, B.S., Pham, Q.H., Nguyen, D.T., Tran, M.K., Yu, L.F., Yeung, S.K.: Scenenn: A scene meshes dataset with annotations. In: 3DV. pp. 92–101 (2016).https://doi.org/10.1109/3DV.2016.18

  14. Huang, J., Huang, G.: Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. CoRR abs/2203.17054 (2022).https://doi.org/10.48550/ARXIV.2203.17054

  15. Huang, J., Huang, G., Zhu, Z., Du, D.: Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. aeXiv preprint arXiv:2112.11790 (2021)

  16. Huang, Y., Zheng, W., Zhang, Y., Zhou, J., Lu, J.: Tri-perspective view for vision-based 3d semantic occupancy prediction. In: CVPR. pp. 9223–9232 (2023).https://doi.org/10.1109/CVPR52729.2023.00890

  17. Kim, J., Choi, J., Kim, Y., Koh, J., Chung, C.C., Choi, J.W.: Robust camera lidar sensor fusion via deep gated information fusion network. In: IV. pp. 1620–1625 (2018).https://doi.org/10.1109/IVS.2018.8500711

  18. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  19. Kirsch, A., van Amersfoort, J., Gal, Y.: Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. In: NeurIPS. pp. 7024–7035 (2019)

    Google Scholar 

  20. Li, J., Han, K., Wang, P., Liu, Y., Yuan, X.: Anisotropic convolutional networks for 3d semantic scene completion. In: CVPR. pp. 3348–3356 (2020).https://doi.org/10.1109/CVPR42600.2020.00341

  21. Li, Y., Yu, Z., Choy, C.B., Xiao, C., Álvarez, J.M., Fidler, S., Feng, C., Anandkumar, A.: Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In: CVPR. pp. 9087–9098 (2023).https://doi.org/10.1109/CVPR52729.2023.00877

  22. Li, Y., Yu, A.W., Meng, T., Caine, B., Ngiam, J., Peng, D., Shen, J., Lu, Y., Zhou, D., Le, Q.V., Yuille, A.L., Ta, M.: Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In: CVPR. pp. 17161–17170 (2022).https://doi.org/10.1109/CVPR52688.2022.01667

  23. Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J., Li, Z.: Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In: Williams, B., Chen, Y., Neville, J. (eds.) AAAI. pp. 1486–1494. AAAI Press (2023).https://doi.org/10.1609/AAAI.V37I2.25234

  24. Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Yu, Q., Dai, J.: Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: ECCV. pp. 1–18 (2022).https://doi.org/10.1007/978-3-031-20077-9_1

  25. Li, Z., Yu, Z., Austin, D., Fang, M., Lan, S., Kautz, J., Álvarez, J.M.: FB-OCC: 3d occupancy prediction based on forward-backward view transformation. CoRR abs/2307.01492 (2023).https://doi.org/10.48550/ARXIV.2307.01492

  26. Li, Z., Yu, Z., Austin, D., Fang, M., Lan, S., Kautz, J., Álvarez, J.M.: Fb-occ: 3d occupancy prediction based on forward-backward view transformation. aeXiv preprint arXIv:2307.01492 (2023).https://doi.org/10.48550/arXiv.2307.01492

  27. Lin, T., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 936–944. IEEE Computer Society (2017).https://doi.org/10.1109/CVPR.2017.106

  28. Lin, T., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV. pp. 2999–3007. IEEE Computer Society (2017).https://doi.org/10.1109/ICCV.2017.324

  29. Liu, P., Wang, L., Ranjan, R., He, G., Zhao, L.: A survey on active deep learning: From model driven to data driven. ACM Comput. Surv. 54(10s), 221:1–221:34 (2022).https://doi.org/10.1145/3510414

  30. Liu, Y., Wang, T., Zhang, X., Sun, J.: Petr: Position embedding transformation for multi-view 3d object detection. In: ECCV. pp. 531–548 (2022).https://doi.org/10.1007/978-3-031-19812-0_31

  31. Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D.L., Han, S.: Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In: IEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023. pp. 2774–2781 (2023)https://doi.org/10.1109/ICRA48891.2023.10160968

  32. Lu, Y., Zhu, X., Wang, T., Ma, Y.: Octreeocc: Efficient and multi-granularity occupancy prediction using octree queries. CoRR abs/2312.03774 (2023).https://doi.org/10.48550/ARXIV.2312.03774

  33. Miao, R., Liu, W., Chen, M., Gong, Z., Xu, W., Hu, C., Zhou, S.: Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540 (2023).https://doi.org/10.48550/arXiv.2302.13540

  34. Min, C., Xiao, L., Zhao, D., Nie, Y., Dai, B.: Uniscene: Multi-camera unified pre-training via 3d scene reconstruction. arXiv preprint arXiv:2305.18829 (2023).https://doi.org/10.48550/arXiv.2305.18829

  35. Ming, Z., Berrio, J.S., Shan, M., Worrall, S.: Occfusion: A straightforward and effective multi-sensor fusion framework for 3d occupancy prediction. CoRR abs/2403.01644 (2024).https://doi.org/10.48550/ARXIV.2403.01644

  36. Pan, J., Wang, Z., Wang, L.: Co-occ: Coupling explicit feature fusion with volume rendering regularization for multi-modal 3d semantic occupancy prediction. IEEE Robotics Autom. Lett. 9(6), 5687–5694 (2024). https://doi.org/10.1109/LRA.2024.3396092

    Article  Google Scholar 

  37. Pan, M., Liu, J., Zhang, R., Huang, P., Li, X., Liu, L., Zhang, S.: Renderocc: Vision-centric 3d occupancy prediction with 2d rendering supervision. CoRR abs/2309.09502 (2023).https://doi.org/10.48550/ARXIV.2309.09502

  38. Philion, J., Fidler, S.: Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In: ECCV. pp. 194–210 (2020).https://doi.org/10.1007/978-3-030-58568-6_12

  39. Qi, C.R., Zhou, Y., Najibi, M., Sun, P., Vo, K., Deng, B., Anguelov, D.: Offboard 3d object detection from point cloud sequences. In: CVPR. pp. 6134–6144 (2021).https://doi.org/10.1109/CVPR46437.2021.00607

  40. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In: NeurIPS. pp. 5099–5108 (2017)

    Google Scholar 

  41. Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3d object detection. In: CVPR. pp. 8555–8564 (2021).https://doi.org/10.1109/CVPR46437.2021.00845

  42. Roldão, L., de Charette, R., Verroust-Blondet, A.: Lmscnet: Lightweight multiscale 3d semantic completion. In: 3DV. pp. 111–119 (2020).https://doi.org/10.1109/3DV50981.2020.00021

  43. Shrivastava, A., Gupta, A., Girshick, R.B.: Training region-based object detectors with online hard example mining. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp. 761–769 (2016).https://doi.org/10.1109/CVPR.2016.89

  44. Sung, K.K.: Learning and example selection for object and pattern detection. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA, USA (1995), https://hdl.handle.net/1721.1/9836

  45. Tian, X., Jiang, T., Yun, L., Mao, Y., Yang, H., Wang, Y., Wang, Y., Zhao, H.: Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. In: NeurIPS. pp. 64318–64330 (2023)

    Google Scholar 

  46. Vobecky, A., Siméoni, O., Hurych, D., Gidaris, S., Bursuc, A., Pérez, P., Sivic, J.: Pop-3d: Open-vocabulary 3d occupancy prediction from images. arXiv preprint arXiv:2401.09413 (2024).https://doi.org/10.48550/arXiv.2401.09413

  47. Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting: Sequential fusion for 3d object detection. In: CVPR. pp. 4603–4611 (2020).https://doi.org/10.1109/CVPR42600.2020.00466

  48. Wang, X., Zhu, Z., Xu, W., Zhang, Y., Wei, Y., Chi, X., Ye, Y., Du, D., Lu, J., Wang, X.: Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. In: ICCV. pp. 17804–17813 (2023).https://doi.org/10.1109/ICCV51070.2023.01636

  49. Wang, Y., Chao, W.L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In: CVPR. pp. 8445–8453 (2019).https://doi.org/10.1109/CVPR.2019.00864

  50. Wang, Y., Chen, Y., Liao, X., Fan, L., Zhang, Z.: Panoocc: Unified occupancy representation for camera-based 3d panoptic segmentation. CoRR abs/2306.10013 (2023).https://doi.org/10.48550/ARXIV.2306.10013

  51. Wei, Y., Zhao, L., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In: ICCV. pp. 21672–21683 (2023).https://doi.org/10.1109/ICCV51070.2023.01986

  52. Yan, X., Gao, J., Li, J., Zhang, R., Li, Z., Huang, R., Cui, S.: Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In: AAAI. pp. 3101–3109 (2021).https://doi.org/10.1609/AAAI.V35I4.16419

  53. Yan, Y., Mao, Y., Li, B.: Second: Sparsely embedded convolutional detection. Sensors 18(10) (2018).https://doi.org/10.3390/S18103337

  54. Yang, C., Chen, Y., Tian, H., Tao, C., Zhu, X., Zhang, Z., Huang, G., Li, H., Qiao, Y., Lu, L., Zhou, J., Dai, J.: Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: CVPR. pp. 17830–17839 (2023).https://doi.org/10.1109/CVPR52729.2023.01710

  55. Yin, T., Zhou, X., Krähenbühl, P.: Multimodal virtual point 3d detection. In: NeurIPS. pp. 16494–16507 (2021)

    Google Scholar 

  56. Zhang, C., Yan, J., Wei, Y., Li, J., Liu, L., Tang, Y., Duan, Y., Lu, J.: Occnerf: Self-supervised multi-camera occupancy prediction with neural radiance fields. arXiv preprint arXiv:2312.09243 (2023).https://doi.org/10.48550/arXiv.2312.09243

  57. Zhang, Y., Zheng, W., Zhu, Z., Huang, G., Lu, J., Zhou, J.: A simple baseline for multi-camera 3d object detection. In: AAAI. pp. 3507–3515 (2023).https://doi.org/10.1609/aaai.v37i3.25460

  58. Zhang, Y., Zhu, Z., Du, D.: Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. In: ICCV. pp. 9399–9409 (2023).https://doi.org/10.1109/ICCV51070.2023.00865

  59. Zhou, B., Krähenbühl, P.: Cross-view transformers for real-time map-view semantic segmentation. In: CVPR. pp. 13750–13759 (2022).https://doi.org/10.1109/CVPR52688.2022.01339

  60. Zhou, Y., Tuzel, O.: Voxelnet: End-to-end learning for point cloud based 3d object detection. In: CVPR. pp. 4490–4499 (2018).https://doi.org/10.1109/CVPR.2018.00472

  61. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection. In: ICLR (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ji Zhang or Yiran Ding .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, J., Ding, Y., Liu, Z. (2025). OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction. In: Cho, M., Laptev, I., Tran, D., Yao, A., Zha, H. (eds) Computer Vision – ACCV 2024. ACCV 2024. Lecture Notes in Computer Science, vol 15481. Springer, Singapore. https://doi.org/10.1007/978-981-96-0972-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-981-96-0972-7_14

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-96-0971-0

  • Online ISBN: 978-981-96-0972-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics