Ray Denoising: Depth-Aware Hard Negative Sampling for Multi-view 3D Object Detection

Liu, Feng; Huang, Tengteng; Zhang, Qianjing; Yao, Haotian; Zhang, Chi; Wan, Fang; Ye, Qixiang; Zhou, Yanzhao

doi:10.1007/978-3-031-72967-6_12

Feng Liu¹³,
Tengteng Huang¹⁴,
Qianjing Zhang¹⁴,
Haotian Yao¹⁴,
Chi Zhang¹⁴,
Fang Wan¹³,
Qixiang Ye¹³ &
…
Yanzhao Zhou¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15107))

Included in the following conference series:

European Conference on Computer Vision

327 Accesses

Abstract

Multi-view 3D object detection systems often struggle with generating precise predictions due to the challenges in estimating depth from images, increasing redundant and incorrect detections. Our paper presents Ray Denoising, an innovative method that enhances detection accuracy by strategically sampling along camera rays to construct hard negative examples. These examples, visually challenging to differentiate from true positives, compel the model to learn depth-aware features, thereby improving its capacity to distinguish between true and false positives. Ray Denoising is designed as a plug-and-play module, compatible with any DETR-style multi-view 3D detectors, and it only minimally increases training computational costs without affecting inference speed. Our comprehensive experiments, including detailed ablation studies, consistently demonstrate that Ray Denoising outperforms strong baselines across multiple datasets. It achieves a 1.9% improvement in mean Average Precision (mAP) over the state-of-the-art StreamPETR method on the NuScenes dataset. It shows significant performance gains on the Argoverse 2 dataset, highlighting its generalization capability. The code is available at https://github.com/LiewFeng/RayDN.

F. Liu—Work was done during internship at Mach Drive.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection From Multi-view Camera Images With Global Cross-Sensor Attention

Learning High-Resolution Vector Representation from Multi-camera Images for 3D Object Detection

Deep Cost Ray Fusion for Sparse Depth Video Completion

References

Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9287–9296 (2019)
Google Scholar
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2009)
Article Google Scholar
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)
Article Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 38(1), 142–158 (2015)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Huang, J., Huang, G.: BEVDet4D: exploit temporal cues in multi-camera 3D object detection. arXiv preprint arXiv:2203.17054 (2022)
Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: Bevdet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)
Jiang, Y., et al.: Polarformer: multi-camera 3D object detection with polar transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1042–1050 (2023)
Google Scholar
Lee, Y., Hwang, J.W., Lee, S., Bae, Y., Park, J.: An energy and GPU-computation efficient backbone network for real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019)
Google Scholar
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: DN-DETR: accelerate DETR training by introducing query denoising. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13619–13627 (2022)
Google Scholar
Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J., Li, Z.: Bevstereo: enhancing depth estimation in multi-view 3D object detection with temporal stereo. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1486–1494 (2023)
Google Scholar
Li, Y., et al.: Bevdepth: acquisition of reliable depth for multi-view 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1477–1485 (2023)
Google Scholar
Li, Z., et al.: BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 1–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_1
Chapter Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Google Scholar
Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4D: multi-view 3D object detection with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581 (2022)
Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: SparsE4D V2: recurrent temporal fusion with sparse model. arXiv preprint arXiv:2305.14018 (2023)
Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: Sparsebev: high-performance sparse 3D object detection from multi-camera videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18580–18590 (2023)
Google Scholar
Liu, S., et al.: DAB-DETR: dynamic anchor boxes are better queries for DETR. arXiv preprint arXiv:2201.12329 (2022)
Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3d object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 531–548. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_31
Chapter Google Scholar
Liu, Y., et al.: PETRV2: a unified framework for 3D perception from multi-camera images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3262–3272 (2023)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2018)
Google Scholar
Ma, X., et al.: Delving into localization errors for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4721–4730 (2021)
Google Scholar
Park, D., Ambrus, R., Guizilini, V., Li, J., Gaidon, A.: Is pseudo-lidar needed for monocular 3D object detection? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3142–3152 (2021)
Google Scholar
Park, J., et al.: Time will tell: new outlooks and a baseline for temporal multi-view 3D object detection. In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Chapter Google Scholar
Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8555–8564 (2021)
Google Scholar
Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769 (2016)
Google Scholar
Shu, C., Deng, J., Yu, F., Liu, Y.: 3DPPE: 3D point positional encoding for transformer-based multi-camera 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3580–3589 (2023)
Google Scholar
Simonelli, A., Bulo, S.R., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1991–1999 (2019)
Google Scholar
Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454 (2020)
Google Scholar
Sung, K.K., Poggio, T.: Example-based learning for view-based human face detection. IEEE Trans. Pattern Anal. Mach. Intell. 20(1), 39–51 (1998)
Article Google Scholar
Wang, S., Liu, Y., Wang, T., Li, Y., Zhang, X.: Exploring object-centric temporal modeling for efficient multi-view 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3621–3631 (2023)
Google Scholar
Wang, T., Zhu, X., Pang, J., Lin, D.: FCOS3D: fully convolutional one-stage monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 913–922 (2021)
Google Scholar
Wang, Y., Zhang, X., Yang, T., Sun, J.: Anchor DETR: query design for transformer-based detector. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2567–2575 (2022)
Google Scholar
Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: Conference on Robot Learning, pp. 180–191. PMLR (2022)
Google Scholar
Wang, Z., Huang, Z., Fu, J., Wang, N., Liu, S.: Object as query: lifting any 2D object detector to 3D detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3791–3800 (2023)
Google Scholar
Wilson, B., et al.: Argoverse 2: next generation datasets for self-driving perception and forecasting. In: Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021)
Google Scholar
Yan, J., et al.: Cross modal transformer: towards fast and robust 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18268–18278 (2023)
Google Scholar
Yang, C., et al.: Bevformer V2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17830–17839 (2023)
Google Scholar
Zhang, H., et al.: Dino: DETR with improved denoising anchor boxes for end-to-end object detection. In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Zhang, H., et al.: Introducing depth into transformer-based 3D object detection. arXiv preprint arXiv:2302.13002 (2023)
Zhang, Y., Lu, J., Zhou, J.: Objects are different: flexible monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3289–3298 (2021)
Google Scholar
Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Class-balanced grouping and sampling for point cloud 3D object detection. arXiv preprint arXiv:1908.09492 (2019)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)
Zong, Z., et al.: Temporal enhanced training of multi-view 3D object detector via historical object prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3781–3790 (2023)
Google Scholar

Download references

Acknowledgment

This work was supported by the Fundamental Research Funds for the Central Universities (E3E41903, E2ET1104, E3ET6201X2), the National Natural Science Foundation of China (NSFC) under Grant 62225208 and 62171431.

Author information

Authors and Affiliations

University of Chinese Academy of Sciences, Beijing, China
Feng Liu, Fang Wan, Qixiang Ye & Yanzhao Zhou
Mach Drive, Wuhu, China
Tengteng Huang, Qianjing Zhang, Haotian Yao & Chi Zhang

Authors

Feng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tengteng Huang
View author publications
You can also search for this author in PubMed Google Scholar
Qianjing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Haotian Yao
View author publications
You can also search for this author in PubMed Google Scholar
Chi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Fang Wan
View author publications
You can also search for this author in PubMed Google Scholar
Qixiang Ye
View author publications
You can also search for this author in PubMed Google Scholar
Yanzhao Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanzhao Zhou .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 814 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, F. et al. (2025). Ray Denoising: Depth-Aware Hard Negative Sampling for Multi-view 3D Object Detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15107. Springer, Cham. https://doi.org/10.1007/978-3-031-72967-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-72967-6_12
Published: 03 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72966-9
Online ISBN: 978-3-031-72967-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Ray Denoising: Depth-Aware Hard Negative Sampling for Multi-view 3D Object Detection