SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras

Tang, Yingqi; Meng, Zhaotie; Chen, Guoliang; Cheng, Erkang

doi:10.1007/978-3-031-72627-9_1

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15060))

Included in the following conference series:

European Conference on Computer Vision

418 Accesses

Abstract

The field of autonomous driving has attracted considerable interest in approaches that directly infer 3D objects in the Bird’s Eye View (BEV) from multiple cameras. Some attempts have also explored utilizing 2D detectors from single images to enhance the performance of 3D detection. However, these approaches rely on a two-stage process with separate detectors, where the 2D detection results are utilized only once for token selection or query initialization. In this paper, we present a single model termed SimPB, which Simultaneously detects 2D objects in the Perspective view and 3D objects in the BEV space from multiple cameras. To achieve this, we introduce a hybrid decoder consisting of several multi-view 2D decoder layers and several 3D decoder layers, specifically designed for their respective detection tasks. A Dynamic Query Allocation module and an Adaptive Query Aggregation module are proposed to continuously update and refine the interaction between 2D and 3D results, in a cyclic 3D-2D-3D manner. Additionally, Query-group Attention is utilized to strengthen the interaction among 2D queries within each camera group. In the experiments, we evaluate our method on the nuScenes dataset and demonstrate promising results for both 2D and 3D detection tasks. Our code is available at: https://github.com/nullmax-vision/SimPB.

Y. Tang and Z. Meng—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection From Multi-view Camera Images With Global Cross-Sensor Attention

Multiview Detection with Feature Perspective Transformation

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers

References

Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: ICCV, pp. 9287–9296 (2019)
Google Scholar
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR, pp. 11618–11628 (2020)
Google Scholar
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: CVPR, pp. 6154–6162 (2018)
Google Scholar
Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: CVPR, pp. 2147–2156 (2016)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)
Google Scholar
Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: CenterNet: keypoint triplets for object detection. In: ICCV, pp. 6569–6578 (2019)
Google Scholar
Han, C., et al.: Exploring recurrent long-term temporal fusion for multi-view 3D perception. arXiv preprint arXiv:2303.05970 (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Huang, J., Huang, G.: BEVDet4D: exploit temporal cues in multi-camera 3D object detection. arXiv preprint arXiv:2203.17054 (2022)
Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: BEVDet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)
Jiang, X., et al.: Far3D: expanding the horizon for surround-view 3D object detection. In: AAAI (2023)
Google Scholar
Lee, Y., Hwang, J.W., Lee, S., Bae, Y., Park, J.: An energy and GPU-computation efficient backbone network for real-time object detection. In: CVPR (2019)
Google Scholar
Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J., Li, Z.: BEVStereo: Enhancing depth estimation in multi-view 3D object detection with temporal stereo. In: AAAI, pp. 1486–1494 (2023)
Google Scholar
Li, Y., et al.: BEVDepth: acquisition of reliable depth for multi-view 3D object detection. In: AAAI, pp. 1477–1485 (2023)
Google Scholar
Li, Z., Lan, S., Alvarez, J.M., Wu, Z.: BEVNeXt: reviving dense BEV frameworks for 3D object detection. In: CVPR (2024)
Google Scholar
Li, Z., et al.: BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: ECCV, pp. 1–18 (2022)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pp. 740–755. Springer International Publishing, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4D: multi-view 3D object detection with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581 (2022)
Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4D v2: recurrent temporal fusion with sparse model. arXiv preprint arXiv:2305.14018 (2023)
Lin, X., Pei, Z., Lin, T., Huang, L., Su, Z.: Sparse4D v3: advancing end-to-end 3D detection and tracking. arXiv preprint arXiv:2311.11722 (2023)
Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: SparseBEV: high-performance sparse 3D object detection from multi-camera videos. In: ICCV, pp. 18580–18590 (2023)
Google Scholar
Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., Zhang, L.: DAB-DETR: dynamic anchor boxes are better queries for DETR. In: ICLR (2022)
Google Scholar
Liu, Y., Wang, T., Zhang, X., Sun, J.: Petr: Position embedding transformation for multi-view 3D object detection. In: ECCV, pp. 531–548 (2022)
Google Scholar
Liu, Y., et al.: PETRv2: a unified framework for 3D perception from multi-camera images. arXiv preprint arXiv:2206.01256 (2022)
Luo, Z., Zhou, C., Zhang, G., Lu, S.: DETR4d: direct multi-view 3D object detection with sparse attention. arXiv preprint arXiv:2212.07849 (2022)
Meng, D., et al.: Conditional DETR for fast training convergence. In: ICCV, pp. 3651–3660 (2021)
Google Scholar
Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7074–7082 (2017)
Google Scholar
Nicolas, C., Francisco, M., Gabriel, S., Nicolas, U., Alexander, K., Sergey, Z.: End-to-end object detection with transformers. In: ECCV, pp. 213–229 (2020)
Google Scholar
Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., Lin, D.: Libra R-CNN: towards balanced learning for object detection. In: CVPR, pp. 821–830 (2019)
Google Scholar
Park, J., Xu, C., Yang, S., Keutzer, K., Kitani, K., Tomizuka, M., Zhan, W.: Time will tell: new outlooks and a baseline for temporal multi-view 3D object detection. In: ICLR (2023)
Google Scholar
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: ECCV, pp. 194–210 (2020)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR, pp. 779–788 (2016)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE TPAMI, 1137–1149 (2017)
Google Scholar
Roh, B., Shin, J., Shin, W., Kim, S.: Sparse DETR: efficient end-to-end object detection with learnable sparsity. In: ICLR (2021)
Google Scholar
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: ICCV, pp. 9627–9636 (2019)
Google Scholar
Wang, S., Jiang, X., Li, Y.: Focal-PETR: embracing foreground for efficient multi-camera 3D object detection. IEEE Trans. Intell. Veh. (2023)
Google Scholar
Wang, S., Liu, Y., Wang, T., Li, Y., Zhang, X.: Exploring object-centric temporal modeling for efficient multi-view 3D object detection. In: ICCV, pp. 3621–3631 (2023)
Google Scholar
Wang, T., Zhu, X., Pang, J., Lin, D.: FCOS3D: fully convolutional one-stage monocular 3D object detection. In: ICCV, pp. 913–922 (2021)
Google Scholar
Wang, Y., Guizilini, V., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: COLR, pp. 180–191 (2021)
Google Scholar
Wang, Z., Huang, Z., Fu, J., Wang, N., Liu, S.: Object as query: equipping any 2D object detector with 3D detection ability. In: ICCV, pp. 3791–3800 (2023)
Google Scholar
Yang, C., et al.: BEVFormer v2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: CVPR, pp. 17830–17839 (2023)
Google Scholar
Yao, J., Lai, Y.: DynamicBEV: leveraging dynamic queries and temporal context for 3D object detection. arXiv preprint arXiv:2310.05989 (2023)
Zhang, H., et al.: DINO: DETR with improved denoising anchor boxes for end-to-end object detection. In: ICLR (2022)
Google Scholar
Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Class-balanced grouping and sampling for point cloud 3D object detection. arXiv preprint arXiv:1908.09492 (2019)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2020)
Google Scholar
Zong, Z., Jiang, D., Song, G., Xue, Z., Su, J., Li, H., Liu, Y.: Temporal enhanced training of multi-view 3D object detector via historical object prediction. In: ICCV (2023)
Google Scholar
Zong, Z., Song, G., Liu, Y.: DETRs with collaborative hybrid assignments training. In: ICCV, pp. 6748–6758 (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

Nullmax, Shanghai, China
Yingqi Tang, Zhaotie Meng, Guoliang Chen & Erkang Cheng

Authors

Yingqi Tang
View author publications
You can also search for this author in PubMed Google Scholar
Zhaotie Meng
View author publications
You can also search for this author in PubMed Google Scholar
Guoliang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Erkang Cheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Erkang Cheng .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 9004 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tang, Y., Meng, Z., Chen, G., Cheng, E. (2025). SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15060. Springer, Cham. https://doi.org/10.1007/978-3-031-72627-9_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-72627-9_1
Published: 20 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72626-2
Online ISBN: 978-3-031-72627-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras