Skip to main content

SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

The field of autonomous driving has attracted considerable interest in approaches that directly infer 3D objects in the Bird’s Eye View (BEV) from multiple cameras. Some attempts have also explored utilizing 2D detectors from single images to enhance the performance of 3D detection. However, these approaches rely on a two-stage process with separate detectors, where the 2D detection results are utilized only once for token selection or query initialization. In this paper, we present a single model termed SimPB, which Simultaneously detects 2D objects in the Perspective view and 3D objects in the BEV space from multiple cameras. To achieve this, we introduce a hybrid decoder consisting of several multi-view 2D decoder layers and several 3D decoder layers, specifically designed for their respective detection tasks. A Dynamic Query Allocation module and an Adaptive Query Aggregation module are proposed to continuously update and refine the interaction between 2D and 3D results, in a cyclic 3D-2D-3D manner. Additionally, Query-group Attention is utilized to strengthen the interaction among 2D queries within each camera group. In the experiments, we evaluate our method on the nuScenes dataset and demonstrate promising results for both 2D and 3D detection tasks. Our code is available at: https://github.com/nullmax-vision/SimPB.

Y. Tang and Z. Meng—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: ICCV, pp. 9287–9296 (2019)

    Google Scholar 

  2. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR, pp. 11618–11628 (2020)

    Google Scholar 

  3. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: CVPR, pp. 6154–6162 (2018)

    Google Scholar 

  4. Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: CVPR, pp. 2147–2156 (2016)

    Google Scholar 

  5. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)

    Google Scholar 

  6. Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: CenterNet: keypoint triplets for object detection. In: ICCV, pp. 6569–6578 (2019)

    Google Scholar 

  7. Han, C., et al.: Exploring recurrent long-term temporal fusion for multi-view 3D perception. arXiv preprint arXiv:2303.05970 (2023)

  8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  9. Huang, J., Huang, G.: BEVDet4D: exploit temporal cues in multi-camera 3D object detection. arXiv preprint arXiv:2203.17054 (2022)

  10. Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: BEVDet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)

  11. Jiang, X., et al.: Far3D: expanding the horizon for surround-view 3D object detection. In: AAAI (2023)

    Google Scholar 

  12. Lee, Y., Hwang, J.W., Lee, S., Bae, Y., Park, J.: An energy and GPU-computation efficient backbone network for real-time object detection. In: CVPR (2019)

    Google Scholar 

  13. Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J., Li, Z.: BEVStereo: Enhancing depth estimation in multi-view 3D object detection with temporal stereo. In: AAAI, pp. 1486–1494 (2023)

    Google Scholar 

  14. Li, Y., et al.: BEVDepth: acquisition of reliable depth for multi-view 3D object detection. In: AAAI, pp. 1477–1485 (2023)

    Google Scholar 

  15. Li, Z., Lan, S., Alvarez, J.M., Wu, Z.: BEVNeXt: reviving dense BEV frameworks for 3D object detection. In: CVPR (2024)

    Google Scholar 

  16. Li, Z., et al.: BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: ECCV, pp. 1–18 (2022)

    Google Scholar 

  17. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pp. 740–755. Springer International Publishing, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  18. Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4D: multi-view 3D object detection with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581 (2022)

  19. Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4D v2: recurrent temporal fusion with sparse model. arXiv preprint arXiv:2305.14018 (2023)

  20. Lin, X., Pei, Z., Lin, T., Huang, L., Su, Z.: Sparse4D v3: advancing end-to-end 3D detection and tracking. arXiv preprint arXiv:2311.11722 (2023)

  21. Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: SparseBEV: high-performance sparse 3D object detection from multi-camera videos. In: ICCV, pp. 18580–18590 (2023)

    Google Scholar 

  22. Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., Zhang, L.: DAB-DETR: dynamic anchor boxes are better queries for DETR. In: ICLR (2022)

    Google Scholar 

  23. Liu, Y., Wang, T., Zhang, X., Sun, J.: Petr: Position embedding transformation for multi-view 3D object detection. In: ECCV, pp. 531–548 (2022)

    Google Scholar 

  24. Liu, Y., et al.: PETRv2: a unified framework for 3D perception from multi-camera images. arXiv preprint arXiv:2206.01256 (2022)

  25. Luo, Z., Zhou, C., Zhang, G., Lu, S.: DETR4d: direct multi-view 3D object detection with sparse attention. arXiv preprint arXiv:2212.07849 (2022)

  26. Meng, D., et al.: Conditional DETR for fast training convergence. In: ICCV, pp. 3651–3660 (2021)

    Google Scholar 

  27. Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7074–7082 (2017)

    Google Scholar 

  28. Nicolas, C., Francisco, M., Gabriel, S., Nicolas, U., Alexander, K., Sergey, Z.: End-to-end object detection with transformers. In: ECCV, pp. 213–229 (2020)

    Google Scholar 

  29. Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., Lin, D.: Libra R-CNN: towards balanced learning for object detection. In: CVPR, pp. 821–830 (2019)

    Google Scholar 

  30. Park, J., Xu, C., Yang, S., Keutzer, K., Kitani, K., Tomizuka, M., Zhan, W.: Time will tell: new outlooks and a baseline for temporal multi-view 3D object detection. In: ICLR (2023)

    Google Scholar 

  31. Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: ECCV, pp. 194–210 (2020)

    Google Scholar 

  32. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR, pp. 779–788 (2016)

    Google Scholar 

  33. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE TPAMI, 1137–1149 (2017)

    Google Scholar 

  34. Roh, B., Shin, J., Shin, W., Kim, S.: Sparse DETR: efficient end-to-end object detection with learnable sparsity. In: ICLR (2021)

    Google Scholar 

  35. Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: ICCV, pp. 9627–9636 (2019)

    Google Scholar 

  36. Wang, S., Jiang, X., Li, Y.: Focal-PETR: embracing foreground for efficient multi-camera 3D object detection. IEEE Trans. Intell. Veh. (2023)

    Google Scholar 

  37. Wang, S., Liu, Y., Wang, T., Li, Y., Zhang, X.: Exploring object-centric temporal modeling for efficient multi-view 3D object detection. In: ICCV, pp. 3621–3631 (2023)

    Google Scholar 

  38. Wang, T., Zhu, X., Pang, J., Lin, D.: FCOS3D: fully convolutional one-stage monocular 3D object detection. In: ICCV, pp. 913–922 (2021)

    Google Scholar 

  39. Wang, Y., Guizilini, V., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: COLR, pp. 180–191 (2021)

    Google Scholar 

  40. Wang, Z., Huang, Z., Fu, J., Wang, N., Liu, S.: Object as query: equipping any 2D object detector with 3D detection ability. In: ICCV, pp. 3791–3800 (2023)

    Google Scholar 

  41. Yang, C., et al.: BEVFormer v2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: CVPR, pp. 17830–17839 (2023)

    Google Scholar 

  42. Yao, J., Lai, Y.: DynamicBEV: leveraging dynamic queries and temporal context for 3D object detection. arXiv preprint arXiv:2310.05989 (2023)

  43. Zhang, H., et al.: DINO: DETR with improved denoising anchor boxes for end-to-end object detection. In: ICLR (2022)

    Google Scholar 

  44. Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Class-balanced grouping and sampling for point cloud 3D object detection. arXiv preprint arXiv:1908.09492 (2019)

  45. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2020)

    Google Scholar 

  46. Zong, Z., Jiang, D., Song, G., Xue, Z., Su, J., Li, H., Liu, Y.: Temporal enhanced training of multi-view 3D object detector via historical object prediction. In: ICCV (2023)

    Google Scholar 

  47. Zong, Z., Song, G., Liu, Y.: DETRs with collaborative hybrid assignments training. In: ICCV, pp. 6748–6758 (2023)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Erkang Cheng .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 9004 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tang, Y., Meng, Z., Chen, G., Cheng, E. (2025). SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15060. Springer, Cham. https://doi.org/10.1007/978-3-031-72627-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72627-9_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72626-2

  • Online ISBN: 978-3-031-72627-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics