Abstract
Existing approaches rely on heuristic or learnable object proposals (which are required to be optimised during training) for 3D object detection. In our approach, we replace the hand-crafted or learnable object proposals with randomly generated object proposals by formulating a new paradigm to employ a diffusion model to detect 3D objects from a set of randomly generated and supervised learning-based object proposals in an autonomous driving application. We propose DDet3D, a diffusion-based 3D object detection framework that formulates 3D object detection as a generative task over the 3D bounding box coordinates in 3D space. To our knowledge, this work is the first to formulate the 3D object detection with denoising diffusion model and to establish that 3D randomly generated and supervised learning-based proposals (different from empirical anchors or learnt queries) are also potential object candidates for 3D object detection. During training, the 3D random noisy boxes are employed from the 3D ground truth boxes by progressively adding Gaussian noise, and the DDet3D network is trained to reverse the diffusion process. During the inference stage, the DDet3D network is able to iteratively refine the 3D randomly generated and supervised learning-based noisy boxes to predict 3D bounding boxes conditioned on the LiDAR Bird’s Eye View (BEV) features. The advantage of DDet3D is that it allows to decouple training and inference stages, thus enabling the use of a larger number of proposal boxes or sampling steps during inference to improve accuracy. We conduct extensive experiments and analysis on the nuScenes and KITTI datasets. DDet3D achieves competitive performance compared to well-designed 3D object detectors. Our work serves as a strong baseline to explore and employ more efficient diffusion models for 3D perception tasks.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The large-scale, publicly available autonomous driving datasets used in this work are available at https://www.nuscenes.org/ (nuScenes) and https://www.cvlibs.net/datasets/kitti/index.php (KITTI).
Abbreviations
- BEV:
-
Bird’s Eye View
- CBGS:
-
Class-balanced Grouping and Sampling
- DDIM:
-
Denoising Diffusion Implicit Models
- DDPM:
-
Denoising Diffusion Probabilistic Models
- FPN:
-
Feature Pyramid Network
- FPS:
-
Frames Per Second
- MLP:
-
Multilayer Perceptron
- NDS:
-
Nuscenes Detection Score
- NMS:
-
Non-maximum Suppression
- R-CNN:
-
Region Convolutional Neural Network
- RPN:
-
Region Proposal Network
- RoI:
-
Region of Interest
- SECOND:
-
Sparsely Embedded Convolutional Detection
- SNR:
-
Signal-to-Noise Ratio
- SparseConv:
-
Sparse Convolution
- mAAE:
-
mean Average Attribute Error
- mAOE:
-
mean Average Orientation Error
- mAP:
-
mean Average Precision
- mASE:
-
mean Average Scale Error
- mATE:
-
mean Average Translation Error
- mAVE:
-
mean Average Velocity Error
References
Austin J, Johnson DD, Ho J et al (2021) Structured denoising diffusion models in discrete state-spaces. Adv Neural Inf Process Syst 34:17981–17993
Baranchuk D, Voynov A, Rubachev I, et al (2022) Label-efficient semantic segmentation with diffusion models. In: International conference on learning representations. https://openreview.net/forum?id=SlxSY2UZQT
Bond-Taylor S, Hessey P, Sasaki H, et al (2022) Unleashing transformers: parallel token prediction with discrete absorbing diffusion for fast high-resolution image generation from vector-quantized codes. In: Computer vision–ECCV 2022: 17th european conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIII, Springer, pp 170–188
Caesar H, Bankiti V, Lang AH, et al (2020) nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11621–11631
Carion N, Massa F, Synnaeve G, et al (2020) End-to-end object detection with transformers. In: European conference on computer vision, Springer, pp 213–229
Chen Q, Sun L, Cheung E et al (2020) Every view counts: Cross-view consistency in 3d object detection with hybrid-cylindrical-spherical voxelization. Adv Neural Inf Process Syst 33:21224–21235
Chen Q, Sun L, Wang Z, et al (2020b) Object as hotspots: An anchor-free 3d object detection approach via firing of hotspots. In: European conference on computer vision, Springer, pp 68–84
Chen Q, Vora S, Beijbom O (2021) Polarstream: streaming object detection and segmentation with polar pillars. Adv Neural Inf Process Syst 34:26871–26883
Chen S, Sun P, Song Y, et al (2023a) Diffusiondet: Diffusion model for object detection. In: 2023 IEEE/CVF International conference on computer vision (ICCV), pp 19773–19786, https://doi.org/10.1109/ICCV51070.2023.01816
Chen T, Li L, Saxena S, et al (2023b) A generalist framework for panoptic segmentation of images and videos. In: 2023 IEEE/CVF International conference on computer vision (ICCV), pp 909–919,https://doi.org/10.1109/ICCV51070.2023.00090
Chen T, ZHANG R, Hinton G (2023c) Analog bits: Generating discrete data using diffusion models with self-conditioning. In: The eleventh international conference on learning representations. https://openreview.net/forum?id=3itjR9QxFw
Chen X, Ma H, Wan J, et al (2017) Multi-view 3d object detection network for autonomous driving. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1907–1915
Chen Y, Liu S, Shen X, et al (2019) Fast point r-cnn. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9775–9784
Contributors M (2020) MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/open-mmlab/mmdetection3d
Cortinhal T, Tzelepis G, Erdal Aksoy E (2020) Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds. In: Advances in visual computing: 15th international symposium, ISVC 2020, San Diego, CA, USA, October 5–7, 2020, Proceedings, Part II 15, Springer, pp 207–222
Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Adv Neural Inf Process Syst 34:8780–8794
Erabati GK, Araujo H (2023) Li3detr: a lidar based 3d detection transformer. In: Proceedings of the IEEE/CVF Winter conference on applications of computer vision, pp 4250–4259
Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In: Conference on computer vision and pattern recognition (CVPR)
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, pp 249–256
Gong S, Li M, Feng J, et al (2023) Diffuseq: sequence to sequence text generation with diffusion models. In: The eleventh international conference on learning representations. https://openreview.net/forum?id=jQj-_rLVXsj
Graham B, Engelcke M, Van Der Maaten L (2018) 3d semantic segmentation with submanifold sparse convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9224–9232
Graikos A, Malkin N, Jojic N, et al (2022) Diffusion models as plug-and-play priors. In: Oh AH, Agarwal A, Belgrave D, et al (Eds) Advances in neural information processing systems. https://openreview.net/forum?id=yhlMZ3iR7Pu
Gu Z, Chen H, Xu Z (2024) Diffusioninst: diffusion model for instance segmentation. In: ICASSP 2024 - 2024 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 2730–2734. https://doi.org/10.1109/ICASSP48485.2024.10447191
Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inf Process Syst 33:6840–6851
Ho J, Saharia C, Chan W et al (2022) Cascaded diffusion models for high fidelity image generation. J Mach Learn Res 23(47):1–33
Huang CW, Lim JH, Courville AC (2021) A variational perspective on diffusion-based generative models and score matching. Adv Neural Inf Process Syst 34:22863–22876
Kawar B, Elad M, Ermon S, et al (2022) Denoising diffusion restoration models. In: Oh AH, Agarwal A, Belgrave D, et al (eds) Advances in neural information processing systems. https://openreview.net/forum?id=kxXvopt9pWK
Kong Z, Ping W, Huang J, et al (2021) Diffwave: a versatile diffusion model for audio synthesis. In: International conference on learning representations. https://openreview.net/forum?id=a-xFK8Ymz5J
Ku J, Mozifian M, Lee J, et al (2018) Joint 3d proposal generation and object detection from view aggregation. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), IEEE, pp 1–8
Kuhn HW (1955) The hungarian method for the assignment problem. Nav Res Logist Q 2(1–2):83–97
Lang AH, Vora S, Caesar H, et al (2019) Pointpillars: Fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12697–12705
Li J, Dai H, Han H, et al (2023) Mseg3d: Multi-modal 3d semantic segmentation for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 21694–21704
Li X, Thickstun J, Gulrajani I et al (2022) Diffusion-lm improves controllable text generation. Adv Neural Inf Process Syst 35:4328–4343
Lin TY, Dollár P, Girshick R, et al (2017a) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Lin TY, Goyal P, Girshick R, et al (2017b) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Liu W, Anguelov D, Erhan D, et al (2016) Ssd: single shot multibox detector. In: European conference on computer vision, Springer, pp 21–37
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: International conference on learning representations. https://openreview.net/forum?id=Bkg6RiCqY7
Maturana D, Scherer S (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ International conference on intelligent robots and systems (IROS), pp 922–928.https://doi.org/10.1109/IROS.2015.7353481
Misra I, Girdhar R, Joulin A (2021) An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 2906–2917
Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International conference on machine learning, PMLR, pp 8162–8171
Pan X, Xia Z, Song S, et al (2021) 3d object detection with pointformer. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 7463–7472
Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, et al (eds) Advances in neural information processing systems 32. Curran Associates, Inc., pp 8024–8035
Qi CR, Su H, Mo K, et al (2017a) Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 652–660
Qi CR, Yi L, Su H, et al (2017b) Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv Neural Inf Process Syst 30
Rapoport-Lavie M, Raviv D (2021) It’s all around you: Range-guided cylindrical network for 3d object detection. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 2992–3001
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv:1804.02767
Ren S, He K, Girshick R, et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28
Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 10684–10695
Saharia C, Ho J, Chan W, et al (2022) Image super-resolution via iterative refinement. IEEE Trans Pattern Anal Mach Intell
Shi S, Wang X, Li H (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 770–779
Shi S, Guo C, Jiang L, et al (2020) Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 10529–10538
Shi S, Jiang L, Deng J et al (2023) Pv-rcnn++: point-voxel feature set abstraction with local vector representation for 3d object detection. Int J Comput Vis 131(2):531–551
Simony M, Milzy S, Amendey K, et al (2018) Complex-yolo: an euler-region-proposal for real-time 3d object detection on point clouds. In: Proceedings of the european conference on computer vision (ECCV) Workshops
Sohl-Dickstein J, Weiss E, Maheswaranathan N, et al (2015) Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning, PMLR, pp 2256–2265
Song J, Meng C, Ermon S (2021) Denoising diffusion implicit models. In: International conference on learning representations. https://openreview.net/forum?id=St1giarCHLP
Sun P, Zhang R, Jiang Y, et al (2021) Sparse r-cnn: end-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14454–14463
Tae J, Kim H, Kim T (2022) EdiTTS: score-based Editing for Controllable Text-to-Speech. In: Proc. Interspeech 2022, pp 421–425. https://doi.org/10.21437/Interspeech.2022-6
Tang H, Liu Z, Zhao S, et al (2020) Searching efficient 3d architectures with sparse point-voxel convolution. In: European conference on computer vision, Springer, pp 685–702
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Wang T, Zhu X, Pang J, et al (2021) Fcos3d: fully convolutional one-stage monocular 3d object detection. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 913–922
Wang Y, Solomon JM (2021) Object dgcnn: 3d object detection using dynamic graphs. Adv Neural Inf Process Syst 34
Wang Y, Fathi A, Kundu A, et al (2020) Pillar-based object detection for autonomous driving. In: European conference on computer vision, Springer, pp 18–34
Wang Y, Guizilini VC, Zhang T, et al (2022) Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: Conference on robot learning, PMLR, pp 180–191
Wu L, Gong C, Liu X, et al (2022) Diffusion-based molecule generation with informative prior bridges. In: Oh AH, Agarwal A, Belgrave D, et al (Eds) Advances in neural information processing systems. https://openreview.net/forum?id=TJUNtiZiTKE
Xiao Z, Kreis K, Vahdat A (2022) Tackling the generative learning trilemma with denoising diffusion GANs. In: International conference on learning representations. https://openreview.net/forum?id=JprM0p-q0Co
Yan Y, Mao Y, Li B (2018) Second: Sparsely embedded convolutional detection. Sensors 18(10):3337
Yang B, Luo W, Urtasun R (2018) Pixor: real-time 3d object detection from point clouds. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7652–7660
Yang Z, Sun Y, Liu S, et al (2020) 3dssd: point-based 3d single stage object detector. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11040–11048
Yin T, Zhou X, Krahenbuhl P (2021) Center-based 3d object detection and tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11784–11793
Zhang Y, Zhou Z, David P, et al (2020) Polarnet: an improved grid representation for online lidar point clouds semantic segmentation. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 9601–9610
Zhou Y, Tuzel O (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4490–4499
Zhu B, Jiang Z, Zhou X, et al (2019) Class-balanced grouping and sampling for point cloud 3d object detection. arXiv:1908.09492
Zhu X, Ma Y, Wang T, et al (2020) Ssn: shape signature networks for multi-class object detection from point clouds. In: European conference on computer vision, Springer, pp 581–597
Zhu X, Su W, Lu L, et al (2021) Deformable detr: deformable transformers for end-to-end object detection. In: International conference on learning representations. https://openreview.net/forum?id=gZ9hCDWe6ke
Zou J, Tian K, Zhu Z, et al (2024) Diffbev: conditional diffusion model for bird’s eye view perception. In: Proceedings of the AAAI conference on artificial intelligence vol 38, no 7, pp 7846–7854. https://doi.org/10.1609/aaai.v38i7.28620https://ojs.aaai.org/index.php/AAAI/article/view/28620
Acknowledgements
This work has been supported by the European Union’s H2020 MSCA-ITN-ACHIEVE with grant agreement No. 765866, Fundação para a Ciência e a Tecnologia (FCT) under the project UIDB/00048/2020 (https://doi.org/10.54499/UIDB/00048/2020), and FCT Portugal PhD research grant with reference 2021.06219.BD.
Author information
Authors and Affiliations
Contributions
Gopi Krishna Erabati: Conceptualization, Methodology, Software, Investigation, Writing - Original Draft. Helder Araujo: Supervision, Resources, Funding acquisition, Writing - Review & Editing.
Corresponding author
Ethics declarations
Conflict of Interest
The authors have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Erabati, G.K., Araujo, H. DDet3D: embracing 3D object detector with diffusion. Appl Intell 55, 283 (2025). https://doi.org/10.1007/s10489-024-06045-1
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-06045-1