SS-BEV: multi-camera BEV object detection based on multi-scale spatial structure understanding

Shi, Peicheng; Pan, Yixin; Yang, Aixi

doi:10.1007/s11760-024-03762-5

SS-BEV: multi-camera BEV object detection based on multi-scale spatial structure understanding

Original Paper
Published: 02 January 2025

Volume 19, article number 172, (2025)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Peicheng Shi¹,
Yixin Pan¹ &
Aixi Yang²

156 Accesses
Explore all metrics

Abstract

BEV (Bird's Eye View) object detection based on multiple cameras has become a mainstream paradigm in the field of autonomous driving. However, we have identified that many existing methods perform poorly when detecting extremely large or small objects. To address this issue, this paper proposes a new framework for multi-camera BEV detection called SS-BEV. First, we designed a more expressive feature extraction module, A-WBFP, which integrates parallel atrous convolutions and weighted bidirectional feature pyramid blocks into the backbone network using a cascaded approach. This helps prevent the loss of small object information in deeper network layers and enhances the receptive field, thereby producing feature maps enriched with contextual information. We then introduce the MORD module, which leverages the accurate depth information from radar point clouds to improve the model’s spatial structure understanding for both large and small objects. By learning the relative depth between internal structures of the objects and selected reference points, a corresponding loss function is constructed to supervise the final detection performance. SS-BEV outperforms the baseline model on the challenging nuScenes validation set with a 2.1-point improvement in NDS detection score. On the nuScenes test set, it achieved detection accuracies of 66.1% and 22.3% mAP for obstacles and construction vehicles, respectively, surpassing some methods based on multi-camera and LiDAR fusion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

SODet: A LiDAR-Based Object Detector in Bird’s-Eye View

GraphBEV: Towards Robust BEV Feature Alignment for Multi-modal 3D Object Detection

Enhanced YOLOv8 framework for precision vehicle detection in high-resolution remote sensing images

Article 17 January 2025

Data availability

No datasets were generated or analysed during the current study.

References

Li, Y., Ge, Z., Yu, G. et al.: Bevdepth: acquisition of reliable depth for multi-view 3d object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1477–1485 (2023)
Reading, C., Harakeh, A., Chae, J., Waslander, S. L.: Categorical depth distribution network for monocular 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8555–8564 (2021)
Hu, H., Wang, F., Su, J. et al.: Ea-lss: edge-aware lift-splat-shot framework for 3d bev object detection. arXiv:2303.17895 (2023)
Li, H., Sima, C., Dai, J. et al.: Delving into the devils of bird's-eye-view perception: a review, evaluation and recipe. IEEE Trans. Pattern Anal. Mach. Intell. (2023)
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pp. 194–210 (2020)
Huang, J., Huang, G., Zhu, Z. et al.: Bevdet: high-performance multi-camera 3d object detection in bird-eye-view. arXiv:2112.11790 (2021)
Ma, Y., Wang, T., Bai, X. et al.: Vision-centric bev perception: a survey. arXiv:2208.02797 (2022)
Wang, Y., Guizilini, V. C., Zhang, T. et al.: Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: Conference on Robot Learning, pp. 180–191 (2022)
Li, Z., Wang, W., Li, H. et al.: Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: European Conference on Computer Vision, pp. 1–18 (2022)
Liu, Y., Wang, T., Zhang, X., Sun, J.: Petr: position embedding transformation for multi-view 3d object detection. In: European Conference on Computer Vision, pp. 531–548 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Xie, E., Yu, Z., Zhou, D. et al.: M $^ 2$ BEV: multi-camera joint 3D detection and segmentation with unified birds-eye view representation. arXiv:2204.05088 (2022)
Caesar, H., Bankiti, V., Lang, A. H. et al.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Xie, X., Lang, C., Miao, S. et al.: Mutual-assistance learning for object detection. IEEE Trans. Pattern Anal. Mach. Intell. (2023)
Xie, X., Cheng, G., Rao, C. et al.: Oriented object detection via contextual dependence mining and penalty-incentive allocation. IEEE Trans. Geosci. Remote Sens. (2024)
Li, Z., Peng, C., Yu, G. et al.: Detnet: a backbone network for object detection. arXiv:1804.06215 (2018)
Lin, T.-Y., Dollár, P., Girshick, R. et al.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Chen, L.-C., Papandreou, G., Kokkinos, I., et al.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Article Google Scholar
Tan, M., Pang, R., Le, Q. V.: Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10781–10790 (2020)
Ben Saad, A., Facciolo, G., Davy, A.: On the importance of large objects in CNN based object detection algorithms. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 533–542 (2024)
Huang, P., Liu, L., Zhang, R. et al.: Tig-bev: multi-view bev 3d object detection via target inner-geometry learning. arXiv:2212.13979 (2022)
Dong, X., Shi, P., Tang, Y., et al.: Vehicle classification algorithm based on improved vision transformer. World Electric Veh. J. 15(8), 344 (2024)
Article MATH Google Scholar
Qi, H., Shi, P., Liu, Z., Yang, A.: TSF: two-stage sequential fusion for 3D object detection. IEEE Sens. J. 22(12), 12163–12172 (2022)
Article Google Scholar
Shi, P., Zhang, C., Xu, S., et al.: MT-Net: fast video instance lane detection based on space time memory and template matching. J. Vis. Commun. Image Represent. 91, 103771 (2023)
Article MATH Google Scholar
Lowe, G.: Sift-the scale invariant feature transform. Int. J. 2(91–110), 2 (2004)
MATH Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), pp. 886–893 (2005)
Wang, K., Liew, J. H., Zou, Y. et al.: Panet: few-shot image semantic segmentation with prototype alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9197–9206 (2019)
Ghiasi, G., Lin, T.-Y., Le, Q. V.: Nas-fpn: Learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7036–7045 (2019)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, p. 27 (2014)
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2650–2658 (2015)
Fu, H., Gong, M., Wang, C. et al.: Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2002–2011 (2018)
Gao, R., Xiao, X., Xing, W., et al.: Unsupervised learning of monocular depth and ego-motion in outdoor/indoor environments. IEEE Internet Things J. 9(17), 16247–16258 (2022)
Article MATH Google Scholar
Aleotti, F., Tosi, F., Poggi, M., Mattoccia, S.: Generative adversarial networks for unsupervised monocular depth prediction. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)
Garg, R., Bg, V. K., Carneiro, G., Reid, I.: Unsupervised cnn for single view depth estimation: geometry to the rescue. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VIII 14, pp. 740–756 (2016)
Godard, C., Mac Aodha, O., Firman, M., Brostow, G. J.: Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3828–3838 (2019)
Zhou, T., Brown, M., Snavely, N., Lowe, D. G.: Unsupervised learning of depth and ego-motion from vide. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858 (2017)
Yao, Y., Luo, Z., Li, S. et al.: Mvsnet: depth inference for unstructured multi-view stereo. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783 (2018)
Yao, Y., Luo, Z., Li, S. et al.: Recurrent mvsnet for high-resolution multi-view stereo depth inference. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5525–5534 (2019)
Xue, Y., Chen, J., Wan, W. et al.: Mvscrf: learning multi-view stereo with conditional random fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4312–4321 (2019)
Chen, R., Han, S., Xu, J., Su, H.: Point-based multi-view stereo network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1538–1547 (2019)
Gu, X., Fan, Z., Zhu, S. et al.: Cascade cost volume for high-resolution multi-view stereo and stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2495–2504 (2020)
Bae, G., Budvytis, I., Cipolla, R.: Multi-view depth estimation by fusing single-view depth probability with multi-view geometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2842–2851 (2022)
Mallot, H.A., Bülthoff, H.H., Little, J.J., Bohrer, S.: Inverse perspective mapping simplifies optical flow computation and obstacle detection. Biol. Cybern. 64(3), 177–185 (1991)
Article MATH Google Scholar
Reiher, L., Lampe, B., Eckstein, L.: A sim2real deep learning approach for the transformation of images from multiple vehicle-mounted cameras to a semantically segmented image in bird’s eye view. In: 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pp. 1–7 (2020)
Hou, Y., Zheng, L., Gould, S.: Multiview detection with feature perspective transformation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pp. 1-18 (2020)
Huang, J., Huang, G.: Bevdet4d: exploit temporal cues in multi-camera 3d object detection. arXiv:2203.17054 (2022)
Lu, C., Van De Molengraft, M.J.G., Dubbelman, G.: Monocular semantic occupancy grid mapping with convolutional variational encoder–decoder networks. IEEE Robotics Autom. Lett. 4(2), 445–452 (2019)
Article Google Scholar
Saha, A., Mendez, O., Russell, C., Bowden, R.: Enabling spatio-temporal aggregation in birds-eye-view vehicle estimation. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 5133–5139 (2021)
Dong, X., Shi, P., Qi, H. et al.: TS-BEV: BEV object detection algorithm based on temporal-spatial feature fusion. Displays 102814 (2024)
Huang, B., Li, Y., Xie, E. et al.: Fast-BEV: towards real-time on-vehicle bird's-eye view perception. arXiv:2301.07870 (2023)
Lin, T.-Y., Maire, M., Belongie, S. et al.: Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755 (2014)
Lang, A. H., Vora, S., Caesar, H. et al.: Pointpillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705 (2019)
Zhang, X., Wan, F., Liu, C. et al.: Freeanchor: Learning to match anchors for visual object detection. Adv. Neural Inf. Process. Syst. 32 (2019)
Lang, B., Li, X., Chuah, M. C.: BEV-TP: end-to-end visual perception and trajectory prediction for autonomous driving. IEEE Trans. Intell. Transp. Syst. (2024)
Qi, Z., Wang, J., Wu, X., Zhao, H.: Ocbev: Object-centric bev transformer for multi-view 3d object detection. In: 2024 International Conference on 3D Vision (3DV), pp. 1188–1197 (2024)
Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3d: fully convolutional one-stage monocular 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 913–922 (2021)
Wang, T., Xinge, Z., Pang, J., Lin, D.: Probabilistic and geometric depth: detecting objects in perspective. In: Conference on Robot Learning, pp. 1475–1485 (2022)
Vora, S., Lang, A. H., Helou, B., Beijbom, O.: Pointpainting: sequential fusion for 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4604–4612 (2020)
Yoo, J. H., Kim, Y., Kim, J., Choi, J. W.: 3D-CVF: generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pp. 720–736 (2020)
Min, C., Xiao, L., Zhao, D. et al.: Multi-camera unified pre-training via 3d scene reconstruction. IEEE Robotics Autom. Lett. (2024)
Zhang, Y., Zhu, Z., Zheng, W. et al.: Beverse: unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv:2205.09743 (2022)
Jiang, Y., Zhang, L., Miao, Z. et al.: Polarformer: multi-camera 3d object detection with polar transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1042–1050 (2023)
Bang, G., Choi, K., Kim, J. et al.: RadarDistill: boosting radar-based object detection performance via knowledge distillation from LiDAR features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15491–15500 (2024)
Chen, L.-C. Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587 (2017)

Download references

Funding

This work was supported by the Yangtze River Delta Science and Technology Innovation Community Joint Research Project (2023CSJGG1600), Natural Science Foundation of Anhui Province (2208085MF173), Wuhu “ChiZhu Light” Major Science and Technology Project (2023ZD01; (2023ZD03)

Author information

Authors and Affiliations

School of Mechanical and Automotive Engineering, Anhui Polytechnic University, Wuhu, 241000, China
Peicheng Shi & Yixin Pan
Department Polytechnic Institute, Zhejiang University, Hangzhou, 310000, China
Aixi Yang

Authors

Peicheng Shi
View author publications
You can also search for this author in PubMed Google Scholar
Yixin Pan
View author publications
You can also search for this author in PubMed Google Scholar
Aixi Yang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

P.S: SS-BEV Multi-Camera BEV Object Detection Based on Multi-Scale Spatial Structure Understanding Y.P: Software, Methodology; Resources; Writing original draft preparation, Visualization, Formal analysis; Investigation. A.Y: Supervision; Validation.

Corresponding author

Correspondence to Peicheng Shi.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shi, P., Pan, Y. & Yang, A. SS-BEV: multi-camera BEV object detection based on multi-scale spatial structure understanding. SIViP 19, 172 (2025). https://doi.org/10.1007/s11760-024-03762-5

Download citation

Received: 25 September 2024
Revised: 07 November 2024
Accepted: 05 December 2024
Published: 02 January 2025
DOI: https://doi.org/10.1007/s11760-024-03762-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SS-BEV: multi-camera BEV object detection based on multi-scale spatial structure understanding

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SODet: A LiDAR-Based Object Detector in Bird’s-Eye View

GraphBEV: Towards Robust BEV Feature Alignment for Multi-modal 3D Object Detection

Enhanced YOLOv8 framework for precision vehicle detection in high-resolution remote sensing images

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

SS-BEV: multi-camera BEV object detection based on multi-scale spatial structure understanding

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SODet: A LiDAR-Based Object Detector in Bird’s-Eye View

GraphBEV: Towards Robust BEV Feature Alignment for Multi-modal 3D Object Detection

Enhanced YOLOv8 framework for precision vehicle detection in high-resolution remote sensing images

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation