DDet3D: embracing 3D object detector with diffusion

Erabati, Gopi Krishna; Araujo, Helder

doi:10.1007/s10489-024-06045-1

DDet3D: embracing 3D object detector with diffusion

Published: 09 January 2025

Volume 55, article number 283, (2025)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Gopi Krishna Erabati¹ &
Helder Araujo¹

209 Accesses
Explore all metrics

Abstract

Existing approaches rely on heuristic or learnable object proposals (which are required to be optimised during training) for 3D object detection. In our approach, we replace the hand-crafted or learnable object proposals with randomly generated object proposals by formulating a new paradigm to employ a diffusion model to detect 3D objects from a set of randomly generated and supervised learning-based object proposals in an autonomous driving application. We propose DDet3D, a diffusion-based 3D object detection framework that formulates 3D object detection as a generative task over the 3D bounding box coordinates in 3D space. To our knowledge, this work is the first to formulate the 3D object detection with denoising diffusion model and to establish that 3D randomly generated and supervised learning-based proposals (different from empirical anchors or learnt queries) are also potential object candidates for 3D object detection. During training, the 3D random noisy boxes are employed from the 3D ground truth boxes by progressively adding Gaussian noise, and the DDet3D network is trained to reverse the diffusion process. During the inference stage, the DDet3D network is able to iteratively refine the 3D randomly generated and supervised learning-based noisy boxes to predict 3D bounding boxes conditioned on the LiDAR Bird’s Eye View (BEV) features. The advantage of DDet3D is that it allows to decouple training and inference stages, thus enabling the use of a larger number of proposal boxes or sampling steps during inference to improve accuracy. We conduct extensive experiments and analysis on the nuScenes and KITTI datasets. DDet3D achieves competitive performance compared to well-designed 3D object detectors. Our work serves as a strong baseline to explore and employ more efficient diffusion models for 3D perception tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Diffusion-Based 3D Object Detection with Random Boxes

Diff3DETR: Agent-Based Diffusion Model for Semi-supervised 3D Object Detection

GLENet: Boosting 3D Object Detectors with Generative Label Uncertainty Estimation

Article 15 August 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The large-scale, publicly available autonomous driving datasets used in this work are available at https://www.nuscenes.org/ (nuScenes) and https://www.cvlibs.net/datasets/kitti/index.php (KITTI).

Abbreviations

BEV:: Bird’s Eye View
CBGS:: Class-balanced Grouping and Sampling
DDIM:: Denoising Diffusion Implicit Models
DDPM:: Denoising Diffusion Probabilistic Models
FPN:: Feature Pyramid Network
FPS:: Frames Per Second
MLP:: Multilayer Perceptron
NDS:: Nuscenes Detection Score
NMS:: Non-maximum Suppression
R-CNN:: Region Convolutional Neural Network
RPN:: Region Proposal Network
RoI:: Region of Interest
SECOND:: Sparsely Embedded Convolutional Detection
SNR:: Signal-to-Noise Ratio
SparseConv:: Sparse Convolution
mAAE:: mean Average Attribute Error
mAOE:: mean Average Orientation Error
mAP:: mean Average Precision
mASE:: mean Average Scale Error
mATE:: mean Average Translation Error
mAVE:: mean Average Velocity Error

References

Austin J, Johnson DD, Ho J et al (2021) Structured denoising diffusion models in discrete state-spaces. Adv Neural Inf Process Syst 34:17981–17993
Google Scholar
Baranchuk D, Voynov A, Rubachev I, et al (2022) Label-efficient semantic segmentation with diffusion models. In: International conference on learning representations. https://openreview.net/forum?id=SlxSY2UZQT
Bond-Taylor S, Hessey P, Sasaki H, et al (2022) Unleashing transformers: parallel token prediction with discrete absorbing diffusion for fast high-resolution image generation from vector-quantized codes. In: Computer vision–ECCV 2022: 17th european conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIII, Springer, pp 170–188
Caesar H, Bankiti V, Lang AH, et al (2020) nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11621–11631
Carion N, Massa F, Synnaeve G, et al (2020) End-to-end object detection with transformers. In: European conference on computer vision, Springer, pp 213–229
Chen Q, Sun L, Cheung E et al (2020) Every view counts: Cross-view consistency in 3d object detection with hybrid-cylindrical-spherical voxelization. Adv Neural Inf Process Syst 33:21224–21235
Google Scholar
Chen Q, Sun L, Wang Z, et al (2020b) Object as hotspots: An anchor-free 3d object detection approach via firing of hotspots. In: European conference on computer vision, Springer, pp 68–84
Chen Q, Vora S, Beijbom O (2021) Polarstream: streaming object detection and segmentation with polar pillars. Adv Neural Inf Process Syst 34:26871–26883
Google Scholar
Chen S, Sun P, Song Y, et al (2023a) Diffusiondet: Diffusion model for object detection. In: 2023 IEEE/CVF International conference on computer vision (ICCV), pp 19773–19786, https://doi.org/10.1109/ICCV51070.2023.01816
Chen T, Li L, Saxena S, et al (2023b) A generalist framework for panoptic segmentation of images and videos. In: 2023 IEEE/CVF International conference on computer vision (ICCV), pp 909–919,https://doi.org/10.1109/ICCV51070.2023.00090
Chen T, ZHANG R, Hinton G (2023c) Analog bits: Generating discrete data using diffusion models with self-conditioning. In: The eleventh international conference on learning representations. https://openreview.net/forum?id=3itjR9QxFw
Chen X, Ma H, Wan J, et al (2017) Multi-view 3d object detection network for autonomous driving. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1907–1915
Chen Y, Liu S, Shen X, et al (2019) Fast point r-cnn. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9775–9784
Contributors M (2020) MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/open-mmlab/mmdetection3d
Cortinhal T, Tzelepis G, Erdal Aksoy E (2020) Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds. In: Advances in visual computing: 15th international symposium, ISVC 2020, San Diego, CA, USA, October 5–7, 2020, Proceedings, Part II 15, Springer, pp 207–222
Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Adv Neural Inf Process Syst 34:8780–8794
MATH Google Scholar
Erabati GK, Araujo H (2023) Li3detr: a lidar based 3d detection transformer. In: Proceedings of the IEEE/CVF Winter conference on applications of computer vision, pp 4250–4259
Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In: Conference on computer vision and pattern recognition (CVPR)
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, pp 249–256
Gong S, Li M, Feng J, et al (2023) Diffuseq: sequence to sequence text generation with diffusion models. In: The eleventh international conference on learning representations. https://openreview.net/forum?id=jQj-_rLVXsj
Graham B, Engelcke M, Van Der Maaten L (2018) 3d semantic segmentation with submanifold sparse convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9224–9232
Graikos A, Malkin N, Jojic N, et al (2022) Diffusion models as plug-and-play priors. In: Oh AH, Agarwal A, Belgrave D, et al (Eds) Advances in neural information processing systems. https://openreview.net/forum?id=yhlMZ3iR7Pu
Gu Z, Chen H, Xu Z (2024) Diffusioninst: diffusion model for instance segmentation. In: ICASSP 2024 - 2024 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 2730–2734. https://doi.org/10.1109/ICASSP48485.2024.10447191
Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inf Process Syst 33:6840–6851
Google Scholar
Ho J, Saharia C, Chan W et al (2022) Cascaded diffusion models for high fidelity image generation. J Mach Learn Res 23(47):1–33
MathSciNet MATH Google Scholar
Huang CW, Lim JH, Courville AC (2021) A variational perspective on diffusion-based generative models and score matching. Adv Neural Inf Process Syst 34:22863–22876
Google Scholar
Kawar B, Elad M, Ermon S, et al (2022) Denoising diffusion restoration models. In: Oh AH, Agarwal A, Belgrave D, et al (eds) Advances in neural information processing systems. https://openreview.net/forum?id=kxXvopt9pWK
Kong Z, Ping W, Huang J, et al (2021) Diffwave: a versatile diffusion model for audio synthesis. In: International conference on learning representations. https://openreview.net/forum?id=a-xFK8Ymz5J
Ku J, Mozifian M, Lee J, et al (2018) Joint 3d proposal generation and object detection from view aggregation. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), IEEE, pp 1–8
Kuhn HW (1955) The hungarian method for the assignment problem. Nav Res Logist Q 2(1–2):83–97
Article MathSciNet MATH Google Scholar
Lang AH, Vora S, Caesar H, et al (2019) Pointpillars: Fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12697–12705
Li J, Dai H, Han H, et al (2023) Mseg3d: Multi-modal 3d semantic segmentation for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 21694–21704
Li X, Thickstun J, Gulrajani I et al (2022) Diffusion-lm improves controllable text generation. Adv Neural Inf Process Syst 35:4328–4343
Google Scholar
Lin TY, Dollár P, Girshick R, et al (2017a) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Lin TY, Goyal P, Girshick R, et al (2017b) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Liu W, Anguelov D, Erhan D, et al (2016) Ssd: single shot multibox detector. In: European conference on computer vision, Springer, pp 21–37
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: International conference on learning representations. https://openreview.net/forum?id=Bkg6RiCqY7
Maturana D, Scherer S (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ International conference on intelligent robots and systems (IROS), pp 922–928.https://doi.org/10.1109/IROS.2015.7353481
Misra I, Girdhar R, Joulin A (2021) An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 2906–2917
Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International conference on machine learning, PMLR, pp 8162–8171
Pan X, Xia Z, Song S, et al (2021) 3d object detection with pointformer. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 7463–7472
Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, et al (eds) Advances in neural information processing systems 32. Curran Associates, Inc., pp 8024–8035
Qi CR, Su H, Mo K, et al (2017a) Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 652–660
Qi CR, Yi L, Su H, et al (2017b) Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv Neural Inf Process Syst 30
Rapoport-Lavie M, Raviv D (2021) It’s all around you: Range-guided cylindrical network for 3d object detection. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 2992–3001
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv:1804.02767
Ren S, He K, Girshick R, et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28
Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 10684–10695
Saharia C, Ho J, Chan W, et al (2022) Image super-resolution via iterative refinement. IEEE Trans Pattern Anal Mach Intell
Shi S, Wang X, Li H (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 770–779
Shi S, Guo C, Jiang L, et al (2020) Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 10529–10538
Shi S, Jiang L, Deng J et al (2023) Pv-rcnn++: point-voxel feature set abstraction with local vector representation for 3d object detection. Int J Comput Vis 131(2):531–551
Article MATH Google Scholar
Simony M, Milzy S, Amendey K, et al (2018) Complex-yolo: an euler-region-proposal for real-time 3d object detection on point clouds. In: Proceedings of the european conference on computer vision (ECCV) Workshops
Sohl-Dickstein J, Weiss E, Maheswaranathan N, et al (2015) Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning, PMLR, pp 2256–2265
Song J, Meng C, Ermon S (2021) Denoising diffusion implicit models. In: International conference on learning representations. https://openreview.net/forum?id=St1giarCHLP
Sun P, Zhang R, Jiang Y, et al (2021) Sparse r-cnn: end-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14454–14463
Tae J, Kim H, Kim T (2022) EdiTTS: score-based Editing for Controllable Text-to-Speech. In: Proc. Interspeech 2022, pp 421–425. https://doi.org/10.21437/Interspeech.2022-6
Tang H, Liu Z, Zhao S, et al (2020) Searching efficient 3d architectures with sparse point-voxel convolution. In: European conference on computer vision, Springer, pp 685–702
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Wang T, Zhu X, Pang J, et al (2021) Fcos3d: fully convolutional one-stage monocular 3d object detection. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 913–922
Wang Y, Solomon JM (2021) Object dgcnn: 3d object detection using dynamic graphs. Adv Neural Inf Process Syst 34
Wang Y, Fathi A, Kundu A, et al (2020) Pillar-based object detection for autonomous driving. In: European conference on computer vision, Springer, pp 18–34
Wang Y, Guizilini VC, Zhang T, et al (2022) Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: Conference on robot learning, PMLR, pp 180–191
Wu L, Gong C, Liu X, et al (2022) Diffusion-based molecule generation with informative prior bridges. In: Oh AH, Agarwal A, Belgrave D, et al (Eds) Advances in neural information processing systems. https://openreview.net/forum?id=TJUNtiZiTKE
Xiao Z, Kreis K, Vahdat A (2022) Tackling the generative learning trilemma with denoising diffusion GANs. In: International conference on learning representations. https://openreview.net/forum?id=JprM0p-q0Co
Yan Y, Mao Y, Li B (2018) Second: Sparsely embedded convolutional detection. Sensors 18(10):3337
Yang B, Luo W, Urtasun R (2018) Pixor: real-time 3d object detection from point clouds. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7652–7660
Yang Z, Sun Y, Liu S, et al (2020) 3dssd: point-based 3d single stage object detector. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11040–11048
Yin T, Zhou X, Krahenbuhl P (2021) Center-based 3d object detection and tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11784–11793
Zhang Y, Zhou Z, David P, et al (2020) Polarnet: an improved grid representation for online lidar point clouds semantic segmentation. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 9601–9610
Zhou Y, Tuzel O (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4490–4499
Zhu B, Jiang Z, Zhou X, et al (2019) Class-balanced grouping and sampling for point cloud 3d object detection. arXiv:1908.09492
Zhu X, Ma Y, Wang T, et al (2020) Ssn: shape signature networks for multi-class object detection from point clouds. In: European conference on computer vision, Springer, pp 581–597
Zhu X, Su W, Lu L, et al (2021) Deformable detr: deformable transformers for end-to-end object detection. In: International conference on learning representations. https://openreview.net/forum?id=gZ9hCDWe6ke
Zou J, Tian K, Zhu Z, et al (2024) Diffbev: conditional diffusion model for bird’s eye view perception. In: Proceedings of the AAAI conference on artificial intelligence vol 38, no 7, pp 7846–7854. https://doi.org/10.1609/aaai.v38i7.28620 https://ojs.aaai.org/index.php/AAAI/article/view/28620

Download references

Acknowledgements

This work has been supported by the European Union’s H2020 MSCA-ITN-ACHIEVE with grant agreement No. 765866, Fundação para a Ciência e a Tecnologia (FCT) under the project UIDB/00048/2020 (https://doi.org/10.54499/UIDB/00048/2020), and FCT Portugal PhD research grant with reference 2021.06219.BD.

Author information

Authors and Affiliations

Institute of Systems and Robotics, University of Coimbra, Rua Silvio Lima - Polo II, 3030-290, Coimbra, Portugal
Gopi Krishna Erabati & Helder Araujo

Authors

Gopi Krishna Erabati
View author publications
You can also search for this author inPubMed Google Scholar
Helder Araujo
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Gopi Krishna Erabati: Conceptualization, Methodology, Software, Investigation, Writing - Original Draft. Helder Araujo: Supervision, Resources, Funding acquisition, Writing - Review & Editing.

Corresponding author

Correspondence to Gopi Krishna Erabati.

Ethics declarations

Conflict of Interest

The authors have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Erabati, G.K., Araujo, H. DDet3D: embracing 3D object detector with diffusion. Appl Intell 55, 283 (2025). https://doi.org/10.1007/s10489-024-06045-1

Download citation

Accepted: 08 September 2024
Published: 09 January 2025
DOI: https://doi.org/10.1007/s10489-024-06045-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DDet3D: embracing 3D object detector with diffusion

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Diffusion-Based 3D Object Detection with Random Boxes

Diff3DETR: Agent-Based Diffusion Model for Semi-supervised 3D Object Detection

GLENet: Boosting 3D Object Detectors with Generative Label Uncertainty Estimation

Explore related subjects

Data Availability

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now