Abstract
Multi-view 3D object detection systems often struggle with generating precise predictions due to the challenges in estimating depth from images, increasing redundant and incorrect detections. Our paper presents Ray Denoising, an innovative method that enhances detection accuracy by strategically sampling along camera rays to construct hard negative examples. These examples, visually challenging to differentiate from true positives, compel the model to learn depth-aware features, thereby improving its capacity to distinguish between true and false positives. Ray Denoising is designed as a plug-and-play module, compatible with any DETR-style multi-view 3D detectors, and it only minimally increases training computational costs without affecting inference speed. Our comprehensive experiments, including detailed ablation studies, consistently demonstrate that Ray Denoising outperforms strong baselines across multiple datasets. It achieves a 1.9% improvement in mean Average Precision (mAP) over the state-of-the-art StreamPETR method on the NuScenes dataset. It shows significant performance gains on the Argoverse 2 dataset, highlighting its generalization capability. The code is available at https://github.com/LiewFeng/RayDN.
F. Liu—Work was done during internship at Mach Drive.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9287–9296 (2019)
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2009)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 38(1), 142–158 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Huang, J., Huang, G.: BEVDet4D: exploit temporal cues in multi-camera 3D object detection. arXiv preprint arXiv:2203.17054 (2022)
Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: Bevdet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)
Jiang, Y., et al.: Polarformer: multi-camera 3D object detection with polar transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1042–1050 (2023)
Lee, Y., Hwang, J.W., Lee, S., Bae, Y., Park, J.: An energy and GPU-computation efficient backbone network for real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019)
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: DN-DETR: accelerate DETR training by introducing query denoising. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13619–13627 (2022)
Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J., Li, Z.: Bevstereo: enhancing depth estimation in multi-view 3D object detection with temporal stereo. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1486–1494 (2023)
Li, Y., et al.: Bevdepth: acquisition of reliable depth for multi-view 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1477–1485 (2023)
Li, Z., et al.: BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 1–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_1
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4D: multi-view 3D object detection with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581 (2022)
Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: SparsE4D V2: recurrent temporal fusion with sparse model. arXiv preprint arXiv:2305.14018 (2023)
Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: Sparsebev: high-performance sparse 3D object detection from multi-camera videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18580–18590 (2023)
Liu, S., et al.: DAB-DETR: dynamic anchor boxes are better queries for DETR. arXiv preprint arXiv:2201.12329 (2022)
Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3d object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 531–548. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_31
Liu, Y., et al.: PETRV2: a unified framework for 3D perception from multi-camera images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3262–3272 (2023)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2018)
Ma, X., et al.: Delving into localization errors for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4721–4730 (2021)
Park, D., Ambrus, R., Guizilini, V., Li, J., Gaidon, A.: Is pseudo-lidar needed for monocular 3D object detection? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3142–3152 (2021)
Park, J., et al.: Time will tell: new outlooks and a baseline for temporal multi-view 3D object detection. In: The Eleventh International Conference on Learning Representations (2022)
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8555–8564 (2021)
Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769 (2016)
Shu, C., Deng, J., Yu, F., Liu, Y.: 3DPPE: 3D point positional encoding for transformer-based multi-camera 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3580–3589 (2023)
Simonelli, A., Bulo, S.R., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1991–1999 (2019)
Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454 (2020)
Sung, K.K., Poggio, T.: Example-based learning for view-based human face detection. IEEE Trans. Pattern Anal. Mach. Intell. 20(1), 39–51 (1998)
Wang, S., Liu, Y., Wang, T., Li, Y., Zhang, X.: Exploring object-centric temporal modeling for efficient multi-view 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3621–3631 (2023)
Wang, T., Zhu, X., Pang, J., Lin, D.: FCOS3D: fully convolutional one-stage monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 913–922 (2021)
Wang, Y., Zhang, X., Yang, T., Sun, J.: Anchor DETR: query design for transformer-based detector. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2567–2575 (2022)
Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: Conference on Robot Learning, pp. 180–191. PMLR (2022)
Wang, Z., Huang, Z., Fu, J., Wang, N., Liu, S.: Object as query: lifting any 2D object detector to 3D detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3791–3800 (2023)
Wilson, B., et al.: Argoverse 2: next generation datasets for self-driving perception and forecasting. In: Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021)
Yan, J., et al.: Cross modal transformer: towards fast and robust 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18268–18278 (2023)
Yang, C., et al.: Bevformer V2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17830–17839 (2023)
Zhang, H., et al.: Dino: DETR with improved denoising anchor boxes for end-to-end object detection. In: The Eleventh International Conference on Learning Representations (2022)
Zhang, H., et al.: Introducing depth into transformer-based 3D object detection. arXiv preprint arXiv:2302.13002 (2023)
Zhang, Y., Lu, J., Zhou, J.: Objects are different: flexible monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3289–3298 (2021)
Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Class-balanced grouping and sampling for point cloud 3D object detection. arXiv preprint arXiv:1908.09492 (2019)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)
Zong, Z., et al.: Temporal enhanced training of multi-view 3D object detector via historical object prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3781–3790 (2023)
Acknowledgment
This work was supported by the Fundamental Research Funds for the Central Universities (E3E41903, E2ET1104, E3ET6201X2), the National Natural Science Foundation of China (NSFC) under Grant 62225208 and 62171431.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, F. et al. (2025). Ray Denoising: Depth-Aware Hard Negative Sampling for Multi-view 3D Object Detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15107. Springer, Cham. https://doi.org/10.1007/978-3-031-72967-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-72967-6_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72966-9
Online ISBN: 978-3-031-72967-6
eBook Packages: Computer ScienceComputer Science (R0)