Abstract
BEV (Bird's Eye View) object detection based on multiple cameras has become a mainstream paradigm in the field of autonomous driving. However, we have identified that many existing methods perform poorly when detecting extremely large or small objects. To address this issue, this paper proposes a new framework for multi-camera BEV detection called SS-BEV. First, we designed a more expressive feature extraction module, A-WBFP, which integrates parallel atrous convolutions and weighted bidirectional feature pyramid blocks into the backbone network using a cascaded approach. This helps prevent the loss of small object information in deeper network layers and enhances the receptive field, thereby producing feature maps enriched with contextual information. We then introduce the MORD module, which leverages the accurate depth information from radar point clouds to improve the model’s spatial structure understanding for both large and small objects. By learning the relative depth between internal structures of the objects and selected reference points, a corresponding loss function is constructed to supervise the final detection performance. SS-BEV outperforms the baseline model on the challenging nuScenes validation set with a 2.1-point improvement in NDS detection score. On the nuScenes test set, it achieved detection accuracies of 66.1% and 22.3% mAP for obstacles and construction vehicles, respectively, surpassing some methods based on multi-camera and LiDAR fusion.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03762-5/MediaObjects/11760_2024_3762_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03762-5/MediaObjects/11760_2024_3762_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03762-5/MediaObjects/11760_2024_3762_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03762-5/MediaObjects/11760_2024_3762_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03762-5/MediaObjects/11760_2024_3762_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03762-5/MediaObjects/11760_2024_3762_Fig6_HTML.png)
Similar content being viewed by others
Data availability
No datasets were generated or analysed during the current study.
References
Li, Y., Ge, Z., Yu, G. et al.: Bevdepth: acquisition of reliable depth for multi-view 3d object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1477–1485 (2023)
Reading, C., Harakeh, A., Chae, J., Waslander, S. L.: Categorical depth distribution network for monocular 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8555–8564 (2021)
Hu, H., Wang, F., Su, J. et al.: Ea-lss: edge-aware lift-splat-shot framework for 3d bev object detection. arXiv:2303.17895 (2023)
Li, H., Sima, C., Dai, J. et al.: Delving into the devils of bird's-eye-view perception: a review, evaluation and recipe. IEEE Trans. Pattern Anal. Mach. Intell. (2023)
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pp. 194–210 (2020)
Huang, J., Huang, G., Zhu, Z. et al.: Bevdet: high-performance multi-camera 3d object detection in bird-eye-view. arXiv:2112.11790 (2021)
Ma, Y., Wang, T., Bai, X. et al.: Vision-centric bev perception: a survey. arXiv:2208.02797 (2022)
Wang, Y., Guizilini, V. C., Zhang, T. et al.: Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: Conference on Robot Learning, pp. 180–191 (2022)
Li, Z., Wang, W., Li, H. et al.: Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: European Conference on Computer Vision, pp. 1–18 (2022)
Liu, Y., Wang, T., Zhang, X., Sun, J.: Petr: position embedding transformation for multi-view 3d object detection. In: European Conference on Computer Vision, pp. 531–548 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Xie, E., Yu, Z., Zhou, D. et al.: M $^ 2$ BEV: multi-camera joint 3D detection and segmentation with unified birds-eye view representation. arXiv:2204.05088 (2022)
Caesar, H., Bankiti, V., Lang, A. H. et al.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Xie, X., Lang, C., Miao, S. et al.: Mutual-assistance learning for object detection. IEEE Trans. Pattern Anal. Mach. Intell. (2023)
Xie, X., Cheng, G., Rao, C. et al.: Oriented object detection via contextual dependence mining and penalty-incentive allocation. IEEE Trans. Geosci. Remote Sens. (2024)
Li, Z., Peng, C., Yu, G. et al.: Detnet: a backbone network for object detection. arXiv:1804.06215 (2018)
Lin, T.-Y., Dollár, P., Girshick, R. et al.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Chen, L.-C., Papandreou, G., Kokkinos, I., et al.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Tan, M., Pang, R., Le, Q. V.: Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10781–10790 (2020)
Ben Saad, A., Facciolo, G., Davy, A.: On the importance of large objects in CNN based object detection algorithms. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 533–542 (2024)
Huang, P., Liu, L., Zhang, R. et al.: Tig-bev: multi-view bev 3d object detection via target inner-geometry learning. arXiv:2212.13979 (2022)
Dong, X., Shi, P., Tang, Y., et al.: Vehicle classification algorithm based on improved vision transformer. World Electric Veh. J. 15(8), 344 (2024)
Qi, H., Shi, P., Liu, Z., Yang, A.: TSF: two-stage sequential fusion for 3D object detection. IEEE Sens. J. 22(12), 12163–12172 (2022)
Shi, P., Zhang, C., Xu, S., et al.: MT-Net: fast video instance lane detection based on space time memory and template matching. J. Vis. Commun. Image Represent. 91, 103771 (2023)
Lowe, G.: Sift-the scale invariant feature transform. Int. J. 2(91–110), 2 (2004)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), pp. 886–893 (2005)
Wang, K., Liew, J. H., Zou, Y. et al.: Panet: few-shot image semantic segmentation with prototype alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9197–9206 (2019)
Ghiasi, G., Lin, T.-Y., Le, Q. V.: Nas-fpn: Learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7036–7045 (2019)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, p. 27 (2014)
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2650–2658 (2015)
Fu, H., Gong, M., Wang, C. et al.: Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2002–2011 (2018)
Gao, R., Xiao, X., Xing, W., et al.: Unsupervised learning of monocular depth and ego-motion in outdoor/indoor environments. IEEE Internet Things J. 9(17), 16247–16258 (2022)
Aleotti, F., Tosi, F., Poggi, M., Mattoccia, S.: Generative adversarial networks for unsupervised monocular depth prediction. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)
Garg, R., Bg, V. K., Carneiro, G., Reid, I.: Unsupervised cnn for single view depth estimation: geometry to the rescue. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VIII 14, pp. 740–756 (2016)
Godard, C., Mac Aodha, O., Firman, M., Brostow, G. J.: Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3828–3838 (2019)
Zhou, T., Brown, M., Snavely, N., Lowe, D. G.: Unsupervised learning of depth and ego-motion from vide. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858 (2017)
Yao, Y., Luo, Z., Li, S. et al.: Mvsnet: depth inference for unstructured multi-view stereo. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783 (2018)
Yao, Y., Luo, Z., Li, S. et al.: Recurrent mvsnet for high-resolution multi-view stereo depth inference. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5525–5534 (2019)
Xue, Y., Chen, J., Wan, W. et al.: Mvscrf: learning multi-view stereo with conditional random fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4312–4321 (2019)
Chen, R., Han, S., Xu, J., Su, H.: Point-based multi-view stereo network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1538–1547 (2019)
Gu, X., Fan, Z., Zhu, S. et al.: Cascade cost volume for high-resolution multi-view stereo and stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2495–2504 (2020)
Bae, G., Budvytis, I., Cipolla, R.: Multi-view depth estimation by fusing single-view depth probability with multi-view geometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2842–2851 (2022)
Mallot, H.A., Bülthoff, H.H., Little, J.J., Bohrer, S.: Inverse perspective mapping simplifies optical flow computation and obstacle detection. Biol. Cybern. 64(3), 177–185 (1991)
Reiher, L., Lampe, B., Eckstein, L.: A sim2real deep learning approach for the transformation of images from multiple vehicle-mounted cameras to a semantically segmented image in bird’s eye view. In: 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pp. 1–7 (2020)
Hou, Y., Zheng, L., Gould, S.: Multiview detection with feature perspective transformation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pp. 1-18 (2020)
Huang, J., Huang, G.: Bevdet4d: exploit temporal cues in multi-camera 3d object detection. arXiv:2203.17054 (2022)
Lu, C., Van De Molengraft, M.J.G., Dubbelman, G.: Monocular semantic occupancy grid mapping with convolutional variational encoder–decoder networks. IEEE Robotics Autom. Lett. 4(2), 445–452 (2019)
Saha, A., Mendez, O., Russell, C., Bowden, R.: Enabling spatio-temporal aggregation in birds-eye-view vehicle estimation. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 5133–5139 (2021)
Dong, X., Shi, P., Qi, H. et al.: TS-BEV: BEV object detection algorithm based on temporal-spatial feature fusion. Displays 102814 (2024)
Huang, B., Li, Y., Xie, E. et al.: Fast-BEV: towards real-time on-vehicle bird's-eye view perception. arXiv:2301.07870 (2023)
Lin, T.-Y., Maire, M., Belongie, S. et al.: Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755 (2014)
Lang, A. H., Vora, S., Caesar, H. et al.: Pointpillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705 (2019)
Zhang, X., Wan, F., Liu, C. et al.: Freeanchor: Learning to match anchors for visual object detection. Adv. Neural Inf. Process. Syst. 32 (2019)
Lang, B., Li, X., Chuah, M. C.: BEV-TP: end-to-end visual perception and trajectory prediction for autonomous driving. IEEE Trans. Intell. Transp. Syst. (2024)
Qi, Z., Wang, J., Wu, X., Zhao, H.: Ocbev: Object-centric bev transformer for multi-view 3d object detection. In: 2024 International Conference on 3D Vision (3DV), pp. 1188–1197 (2024)
Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3d: fully convolutional one-stage monocular 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 913–922 (2021)
Wang, T., Xinge, Z., Pang, J., Lin, D.: Probabilistic and geometric depth: detecting objects in perspective. In: Conference on Robot Learning, pp. 1475–1485 (2022)
Vora, S., Lang, A. H., Helou, B., Beijbom, O.: Pointpainting: sequential fusion for 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4604–4612 (2020)
Yoo, J. H., Kim, Y., Kim, J., Choi, J. W.: 3D-CVF: generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pp. 720–736 (2020)
Min, C., Xiao, L., Zhao, D. et al.: Multi-camera unified pre-training via 3d scene reconstruction. IEEE Robotics Autom. Lett. (2024)
Zhang, Y., Zhu, Z., Zheng, W. et al.: Beverse: unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv:2205.09743 (2022)
Jiang, Y., Zhang, L., Miao, Z. et al.: Polarformer: multi-camera 3d object detection with polar transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1042–1050 (2023)
Bang, G., Choi, K., Kim, J. et al.: RadarDistill: boosting radar-based object detection performance via knowledge distillation from LiDAR features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15491–15500 (2024)
Chen, L.-C. Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587 (2017)
Funding
This work was supported by the Yangtze River Delta Science and Technology Innovation Community Joint Research Project (2023CSJGG1600), Natural Science Foundation of Anhui Province (2208085MF173), Wuhu “ChiZhu Light” Major Science and Technology Project (2023ZD01; (2023ZD03)
Author information
Authors and Affiliations
Contributions
P.S: SS-BEV Multi-Camera BEV Object Detection Based on Multi-Scale Spatial Structure Understanding Y.P: Software, Methodology; Resources; Writing original draft preparation, Visualization, Formal analysis; Investigation. A.Y: Supervision; Validation.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shi, P., Pan, Y. & Yang, A. SS-BEV: multi-camera BEV object detection based on multi-scale spatial structure understanding. SIViP 19, 172 (2025). https://doi.org/10.1007/s11760-024-03762-5
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11760-024-03762-5