Skip to main content

Advertisement

DDet3D: embracing 3D object detector with diffusion

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Existing approaches rely on heuristic or learnable object proposals (which are required to be optimised during training) for 3D object detection. In our approach, we replace the hand-crafted or learnable object proposals with randomly generated object proposals by formulating a new paradigm to employ a diffusion model to detect 3D objects from a set of randomly generated and supervised learning-based object proposals in an autonomous driving application. We propose DDet3D, a diffusion-based 3D object detection framework that formulates 3D object detection as a generative task over the 3D bounding box coordinates in 3D space. To our knowledge, this work is the first to formulate the 3D object detection with denoising diffusion model and to establish that 3D randomly generated and supervised learning-based proposals (different from empirical anchors or learnt queries) are also potential object candidates for 3D object detection. During training, the 3D random noisy boxes are employed from the 3D ground truth boxes by progressively adding Gaussian noise, and the DDet3D network is trained to reverse the diffusion process. During the inference stage, the DDet3D network is able to iteratively refine the 3D randomly generated and supervised learning-based noisy boxes to predict 3D bounding boxes conditioned on the LiDAR Bird’s Eye View (BEV) features. The advantage of DDet3D is that it allows to decouple training and inference stages, thus enabling the use of a larger number of proposal boxes or sampling steps during inference to improve accuracy. We conduct extensive experiments and analysis on the nuScenes and KITTI datasets. DDet3D achieves competitive performance compared to well-designed 3D object detectors. Our work serves as a strong baseline to explore and employ more efficient diffusion models for 3D perception tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Algorithm 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The large-scale, publicly available autonomous driving datasets used in this work are available at https://www.nuscenes.org/ (nuScenes) and https://www.cvlibs.net/datasets/kitti/index.php (KITTI).

Abbreviations

BEV:

Bird’s Eye View

CBGS:

Class-balanced Grouping and Sampling

DDIM:

Denoising Diffusion Implicit Models

DDPM:

Denoising Diffusion Probabilistic Models

FPN:

Feature Pyramid Network

FPS:

Frames Per Second

MLP:

Multilayer Perceptron

NDS:

Nuscenes Detection Score

NMS:

Non-maximum Suppression

R-CNN:

Region Convolutional Neural Network

RPN:

Region Proposal Network

RoI:

Region of Interest

SECOND:

Sparsely Embedded Convolutional Detection

SNR:

Signal-to-Noise Ratio

SparseConv:

Sparse Convolution

mAAE:

mean Average Attribute Error

mAOE:

mean Average Orientation Error

mAP:

mean Average Precision

mASE:

mean Average Scale Error

mATE:

mean Average Translation Error

mAVE:

mean Average Velocity Error

References

  1. Austin J, Johnson DD, Ho J et al (2021) Structured denoising diffusion models in discrete state-spaces. Adv Neural Inf Process Syst 34:17981–17993

    Google Scholar 

  2. Baranchuk D, Voynov A, Rubachev I, et al (2022) Label-efficient semantic segmentation with diffusion models. In: International conference on learning representations. https://openreview.net/forum?id=SlxSY2UZQT

  3. Bond-Taylor S, Hessey P, Sasaki H, et al (2022) Unleashing transformers: parallel token prediction with discrete absorbing diffusion for fast high-resolution image generation from vector-quantized codes. In: Computer vision–ECCV 2022: 17th european conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIII, Springer, pp 170–188

  4. Caesar H, Bankiti V, Lang AH, et al (2020) nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11621–11631

  5. Carion N, Massa F, Synnaeve G, et al (2020) End-to-end object detection with transformers. In: European conference on computer vision, Springer, pp 213–229

  6. Chen Q, Sun L, Cheung E et al (2020) Every view counts: Cross-view consistency in 3d object detection with hybrid-cylindrical-spherical voxelization. Adv Neural Inf Process Syst 33:21224–21235

    Google Scholar 

  7. Chen Q, Sun L, Wang Z, et al (2020b) Object as hotspots: An anchor-free 3d object detection approach via firing of hotspots. In: European conference on computer vision, Springer, pp 68–84

  8. Chen Q, Vora S, Beijbom O (2021) Polarstream: streaming object detection and segmentation with polar pillars. Adv Neural Inf Process Syst 34:26871–26883

    Google Scholar 

  9. Chen S, Sun P, Song Y, et al (2023a) Diffusiondet: Diffusion model for object detection. In: 2023 IEEE/CVF International conference on computer vision (ICCV), pp 19773–19786, https://doi.org/10.1109/ICCV51070.2023.01816

  10. Chen T, Li L, Saxena S, et al (2023b) A generalist framework for panoptic segmentation of images and videos. In: 2023 IEEE/CVF International conference on computer vision (ICCV), pp 909–919,https://doi.org/10.1109/ICCV51070.2023.00090

  11. Chen T, ZHANG R, Hinton G (2023c) Analog bits: Generating discrete data using diffusion models with self-conditioning. In: The eleventh international conference on learning representations. https://openreview.net/forum?id=3itjR9QxFw

  12. Chen X, Ma H, Wan J, et al (2017) Multi-view 3d object detection network for autonomous driving. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1907–1915

  13. Chen Y, Liu S, Shen X, et al (2019) Fast point r-cnn. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9775–9784

  14. Contributors M (2020) MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/open-mmlab/mmdetection3d

  15. Cortinhal T, Tzelepis G, Erdal Aksoy E (2020) Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds. In: Advances in visual computing: 15th international symposium, ISVC 2020, San Diego, CA, USA, October 5–7, 2020, Proceedings, Part II 15, Springer, pp 207–222

  16. Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Adv Neural Inf Process Syst 34:8780–8794

    MATH  Google Scholar 

  17. Erabati GK, Araujo H (2023) Li3detr: a lidar based 3d detection transformer. In: Proceedings of the IEEE/CVF Winter conference on applications of computer vision, pp 4250–4259

  18. Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In: Conference on computer vision and pattern recognition (CVPR)

  19. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, pp 249–256

  20. Gong S, Li M, Feng J, et al (2023) Diffuseq: sequence to sequence text generation with diffusion models. In: The eleventh international conference on learning representations. https://openreview.net/forum?id=jQj-_rLVXsj

  21. Graham B, Engelcke M, Van Der Maaten L (2018) 3d semantic segmentation with submanifold sparse convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9224–9232

  22. Graikos A, Malkin N, Jojic N, et al (2022) Diffusion models as plug-and-play priors. In: Oh AH, Agarwal A, Belgrave D, et al (Eds) Advances in neural information processing systems. https://openreview.net/forum?id=yhlMZ3iR7Pu

  23. Gu Z, Chen H, Xu Z (2024) Diffusioninst: diffusion model for instance segmentation. In: ICASSP 2024 - 2024 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 2730–2734. https://doi.org/10.1109/ICASSP48485.2024.10447191

  24. Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inf Process Syst 33:6840–6851

    Google Scholar 

  25. Ho J, Saharia C, Chan W et al (2022) Cascaded diffusion models for high fidelity image generation. J Mach Learn Res 23(47):1–33

    MathSciNet  MATH  Google Scholar 

  26. Huang CW, Lim JH, Courville AC (2021) A variational perspective on diffusion-based generative models and score matching. Adv Neural Inf Process Syst 34:22863–22876

    Google Scholar 

  27. Kawar B, Elad M, Ermon S, et al (2022) Denoising diffusion restoration models. In: Oh AH, Agarwal A, Belgrave D, et al (eds) Advances in neural information processing systems. https://openreview.net/forum?id=kxXvopt9pWK

  28. Kong Z, Ping W, Huang J, et al (2021) Diffwave: a versatile diffusion model for audio synthesis. In: International conference on learning representations. https://openreview.net/forum?id=a-xFK8Ymz5J

  29. Ku J, Mozifian M, Lee J, et al (2018) Joint 3d proposal generation and object detection from view aggregation. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), IEEE, pp 1–8

  30. Kuhn HW (1955) The hungarian method for the assignment problem. Nav Res Logist Q 2(1–2):83–97

    Article  MathSciNet  MATH  Google Scholar 

  31. Lang AH, Vora S, Caesar H, et al (2019) Pointpillars: Fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12697–12705

  32. Li J, Dai H, Han H, et al (2023) Mseg3d: Multi-modal 3d semantic segmentation for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 21694–21704

  33. Li X, Thickstun J, Gulrajani I et al (2022) Diffusion-lm improves controllable text generation. Adv Neural Inf Process Syst 35:4328–4343

    Google Scholar 

  34. Lin TY, Dollár P, Girshick R, et al (2017a) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125

  35. Lin TY, Goyal P, Girshick R, et al (2017b) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988

  36. Liu W, Anguelov D, Erhan D, et al (2016) Ssd: single shot multibox detector. In: European conference on computer vision, Springer, pp 21–37

  37. Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: International conference on learning representations. https://openreview.net/forum?id=Bkg6RiCqY7

  38. Maturana D, Scherer S (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ International conference on intelligent robots and systems (IROS), pp 922–928.https://doi.org/10.1109/IROS.2015.7353481

  39. Misra I, Girdhar R, Joulin A (2021) An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 2906–2917

  40. Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International conference on machine learning, PMLR, pp 8162–8171

  41. Pan X, Xia Z, Song S, et al (2021) 3d object detection with pointformer. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 7463–7472

  42. Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, et al (eds) Advances in neural information processing systems 32. Curran Associates, Inc., pp 8024–8035

  43. Qi CR, Su H, Mo K, et al (2017a) Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 652–660

  44. Qi CR, Yi L, Su H, et al (2017b) Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv Neural Inf Process Syst 30

  45. Rapoport-Lavie M, Raviv D (2021) It’s all around you: Range-guided cylindrical network for 3d object detection. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 2992–3001

  46. Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv:1804.02767

  47. Ren S, He K, Girshick R, et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28

  48. Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 10684–10695

  49. Saharia C, Ho J, Chan W, et al (2022) Image super-resolution via iterative refinement. IEEE Trans Pattern Anal Mach Intell

  50. Shi S, Wang X, Li H (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 770–779

  51. Shi S, Guo C, Jiang L, et al (2020) Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 10529–10538

  52. Shi S, Jiang L, Deng J et al (2023) Pv-rcnn++: point-voxel feature set abstraction with local vector representation for 3d object detection. Int J Comput Vis 131(2):531–551

    Article  MATH  Google Scholar 

  53. Simony M, Milzy S, Amendey K, et al (2018) Complex-yolo: an euler-region-proposal for real-time 3d object detection on point clouds. In: Proceedings of the european conference on computer vision (ECCV) Workshops

  54. Sohl-Dickstein J, Weiss E, Maheswaranathan N, et al (2015) Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning, PMLR, pp 2256–2265

  55. Song J, Meng C, Ermon S (2021) Denoising diffusion implicit models. In: International conference on learning representations. https://openreview.net/forum?id=St1giarCHLP

  56. Sun P, Zhang R, Jiang Y, et al (2021) Sparse r-cnn: end-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14454–14463

  57. Tae J, Kim H, Kim T (2022) EdiTTS: score-based Editing for Controllable Text-to-Speech. In: Proc. Interspeech 2022, pp 421–425. https://doi.org/10.21437/Interspeech.2022-6

  58. Tang H, Liu Z, Zhao S, et al (2020) Searching efficient 3d architectures with sparse point-voxel convolution. In: European conference on computer vision, Springer, pp 685–702

  59. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30

  60. Wang T, Zhu X, Pang J, et al (2021) Fcos3d: fully convolutional one-stage monocular 3d object detection. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 913–922

  61. Wang Y, Solomon JM (2021) Object dgcnn: 3d object detection using dynamic graphs. Adv Neural Inf Process Syst 34

  62. Wang Y, Fathi A, Kundu A, et al (2020) Pillar-based object detection for autonomous driving. In: European conference on computer vision, Springer, pp 18–34

  63. Wang Y, Guizilini VC, Zhang T, et al (2022) Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: Conference on robot learning, PMLR, pp 180–191

  64. Wu L, Gong C, Liu X, et al (2022) Diffusion-based molecule generation with informative prior bridges. In: Oh AH, Agarwal A, Belgrave D, et al (Eds) Advances in neural information processing systems. https://openreview.net/forum?id=TJUNtiZiTKE

  65. Xiao Z, Kreis K, Vahdat A (2022) Tackling the generative learning trilemma with denoising diffusion GANs. In: International conference on learning representations. https://openreview.net/forum?id=JprM0p-q0Co

  66. Yan Y, Mao Y, Li B (2018) Second: Sparsely embedded convolutional detection. Sensors 18(10):3337

  67. Yang B, Luo W, Urtasun R (2018) Pixor: real-time 3d object detection from point clouds. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7652–7660

  68. Yang Z, Sun Y, Liu S, et al (2020) 3dssd: point-based 3d single stage object detector. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11040–11048

  69. Yin T, Zhou X, Krahenbuhl P (2021) Center-based 3d object detection and tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11784–11793

  70. Zhang Y, Zhou Z, David P, et al (2020) Polarnet: an improved grid representation for online lidar point clouds semantic segmentation. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 9601–9610

  71. Zhou Y, Tuzel O (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4490–4499

  72. Zhu B, Jiang Z, Zhou X, et al (2019) Class-balanced grouping and sampling for point cloud 3d object detection. arXiv:1908.09492

  73. Zhu X, Ma Y, Wang T, et al (2020) Ssn: shape signature networks for multi-class object detection from point clouds. In: European conference on computer vision, Springer, pp 581–597

  74. Zhu X, Su W, Lu L, et al (2021) Deformable detr: deformable transformers for end-to-end object detection. In: International conference on learning representations. https://openreview.net/forum?id=gZ9hCDWe6ke

  75. Zou J, Tian K, Zhu Z, et al (2024) Diffbev: conditional diffusion model for bird’s eye view perception. In: Proceedings of the AAAI conference on artificial intelligence vol 38, no 7, pp 7846–7854. https://doi.org/10.1609/aaai.v38i7.28620https://ojs.aaai.org/index.php/AAAI/article/view/28620

Download references

Acknowledgements

This work has been supported by the European Union’s H2020 MSCA-ITN-ACHIEVE with grant agreement No. 765866, Fundação para a Ciência e a Tecnologia (FCT) under the project UIDB/00048/2020 (https://doi.org/10.54499/UIDB/00048/2020), and FCT Portugal PhD research grant with reference 2021.06219.BD.

Author information

Authors and Affiliations

Authors

Contributions

Gopi Krishna Erabati: Conceptualization, Methodology, Software, Investigation, Writing - Original Draft. Helder Araujo: Supervision, Resources, Funding acquisition, Writing - Review & Editing.

Corresponding author

Correspondence to Gopi Krishna Erabati.

Ethics declarations

Conflict of Interest

The authors have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Erabati, G.K., Araujo, H. DDet3D: embracing 3D object detector with diffusion. Appl Intell 55, 283 (2025). https://doi.org/10.1007/s10489-024-06045-1

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-024-06045-1

Keywords