DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection

Peng, Liang; Wu, Xiaopei; Yang, Zheng; Liu, Haifeng; Cai, Deng

doi:10.1007/978-3-031-19769-7_5

Liang Peng^12,13,
Xiaopei Wu¹²,
Zheng Yang¹³,
Haifeng Liu¹² &
…
Deng Cai^12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13661))

Included in the following conference series:

European Conference on Computer Vision

4314 Accesses
42 Citations

Abstract

Monocular 3D detection has drawn much attention from the community due to its low cost and setup simplicity. It takes an RGB image as input and predicts 3D boxes in the 3D space. The most challenging sub-task lies in the instance depth estimation. Previous works usually use a direct estimation method. However, in this paper we point out that the instance depth on the RGB image is non-intuitive. It is coupled by visual depth clues and instance attribute clues, making it hard to be directly learned in the network. Therefore, we propose to reformulate the instance depth to the combination of the instance visual surface depth (visual depth) and the instance attribute depth (attribute depth). The visual depth is related to objects’ appearances and positions on the image. By contrast, the attribute depth relies on objects’ inherent attributes, which are invariant to the object affine transformation on the image. Correspondingly, we decouple the 3D location uncertainty into visual depth uncertainty and attribute depth uncertainty. By combining different types of depths and associated uncertainties, we can obtain the final instance depth. Furthermore, data augmentation in monocular 3D detection is usually limited due to the physical nature, hindering the boost of performance. Based on the proposed instance depth disentanglement strategy, we can alleviate this problem. Evaluated on KITTI, our method achieves new state-of-the-art results, and extensive ablation studies validate the effectiveness of each component in our method. The codes are released at https://github.com/SPengLiang/DID-M3D.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Dynamic Depth Fusion and Transformation for Monocular 3D Object Detection

Center3D: Center-Based Monocular 3D Object Detection with Joint Depth Understanding

MonoSAID: Monocular 3D Object Detection based on Scene-Level Adaptive Instance Depth Estimation

Article 18 December 2023

References

Bewley, A., Sun, P., Mensink, T., Anguelov, D., Sminchisescu, C.: Range conditioned dilated convolutions for scale invariant 3D object detection. arXiv preprint arXiv:2005.09927 (2020)
Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9287–9296 (2019)
Google Scholar
Brazil, G., Pons-Moll, G., Liu, X., Schiele, B.: Kinematic 3D object detection in monocular video. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 135–152. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58592-1_9
Chapter Google Scholar
Chai, Y., et al.: To the point: efficient 3D object detection in the range image with graph convolution kernels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2021)
Google Scholar
Chen, H., Huang, Y., Tian, W., Gao, Z., Xiong, L.: Monorun: monocular 3D object detection by reconstruction and uncertainty propagation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10379–10388 (2021)
Google Scholar
Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2147–2156 (2016)
Google Scholar
Chen, X., Kundu, K., Zhu, Y., Ma, H., Fidler, S., Urtasun, R.: 3D object proposals using stereo imagery for accurate object class detection. IEEE Trans. Pattern Anal. Mach. Intell. 40(5), 1259–1272 (2017)
Article Google Scholar
Chen, Y., Liu, S., Shen, X., Jia, J.: Fast point R-CNN. In: ICCV (2019)
Google Scholar
Chen, Y., Tai, L., Sun, K., Li, M.: Monopair: monocular 3D object detection using pairwise spatial relationships. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12093–12102 (2020)
Google Scholar
Chu, X., et al.: Neighbor-vote: improving monocular 3D object detection through neighbor distance voting. arXiv preprint arXiv:2107.02493 (2021)
Dijk, T.V., Croon, G.D.: How do neural networks see depth in single images? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2183–2191 (2019)
Google Scholar
Ding, M., Huo, Y., Yi, H., Wang, Z., Shi, J., Lu, Z., Luo, P.: Learning depth-guided convolutions for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11672–11681 (2020)
Google Scholar
Fan, L., Xiong, X., Wang, F., Wang, N., Zhang, Z.: Rangedet: in defense of range view for lidar-based 3D object detection. arXiv preprint arXiv:2103.10039 (2021)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)
Google Scholar
Hu, M., Wang, S., Li, B., Ning, S., Fan, L., Gong, X.: Penet: towards precise and efficient image guided depth completion. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 13656–13662. IEEE (2021)
Google Scholar
Kendall, A., Gal, Y.: What uncertainties do we need in Bayesian deep learning for computer vision? Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Ku, J., Pon, A.D., Waslander, S.L.: Monocular 3D object detection leveraging accurate proposals and shape reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11867–11876 (2019)
Google Scholar
Kumar, A., Brazil, G., Liu, X.: GrooMed-NMS: grouped mathematically differentiable NMS for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8973–8983 (2021)
Google Scholar
Li, B., Zhang, T., Xia, T.: Vehicle detection from 3D lidar using fully convolutional network. arXiv preprint arXiv:1608.07916 (2016)
Li, P., Zhao, H.: Monocular 3D detection with geometric constraint embedding and semi-supervised training. IEEE Robot. Autom. Lett. 6(3), 5565–5572 (2021)
Article Google Scholar
Li, P., Zhao, H., Liu, P., Cao, F.: Rtm3d: real-time monocular 3D detection from object keypoints for autonomous driving. arXiv preprint arXiv:2001.03343 (2020)
Liu, X., Xue, N., Wu, T.: Learning auxiliary monocular contexts helps monocular 3D object detection. arXiv preprint arXiv:2112.04628 (2021)
Liu, Y., Yixuan, Y., Liu, M.: Ground-aware monocular 3D object detection for autonomous driving. IEEE Robot. Autom. Lett. 6(2), 919–926 (2021)
Article Google Scholar
Liu, Z., Zhou, D., Lu, F., Fang, J., Zhang, L.: Autoshape: real-time shape-aware monocular 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15641–15650 (2021)
Google Scholar
Lu, Y., et al.: Geometry uncertainty projection network for monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3111–3121 (2021)
Google Scholar
Ma, X., Liu, S., Xia, Z., Zhang, H., Zeng, X., Ouyang, W.: Rethinking pseudo-lidar representation. arXiv preprint arXiv:2008.04582 (2020)
Ma, X., Wang, Z., Li, H., Zhang, P., Ouyang, W., Fan, X.: Accurate monocular 3D object detection via color-embedded 3D reconstruction for autonomous driving. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6851–6860 (2019)
Google Scholar
Ma, X., et al.: Delving into localization errors for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4721–4730 (2021)
Google Scholar
Manhardt, F., Kehl, W., Gaidon, A.: Roi-10d: monocular lifting of 2D detection to 6D pose and metric shape. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2069–2078 (2019)
Google Scholar
Mao, J., et al.: Voxel transformer for 3D object detection. arXiv preprint arXiv:2109.02497 (2021)
Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7074–7082 (2017)
Google Scholar
Noh, J., Lee, S., Ham, B.: HVPR: hybrid voxel-point representation for single-stage 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14605–14614 (2021)
Google Scholar
Park, D., Ambrus, R., Guizilini, V., Li, J., Gaidon, A.: Is pseudo-lidar needed for monocular 3D object detection? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3142–3152 (2021)
Google Scholar
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8026–8037 (2019)
Google Scholar
Peng, L., Liu, F., Yan, S., He, X., Cai, D.: OCM3D: object-centric monocular 3D object detection. arXiv preprint arXiv:2104.06041 (2021)
Peng, L., et al.: Lidar point cloud guided monocular 3D object detection. arXiv preprint arXiv:2104.09035 (2021)
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum PointNets for 3D object detection from RGB-D data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018)
Google Scholar
Qin, Z., Wang, J., Lu, Y.: MonogrNet: a geometric reasoning network for monocular 3D object localization. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 8851–8858 (2019)
Google Scholar
Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8555–8564 (2021)
Google Scholar
Roddick, T., Kendall, A., Cipolla, R.: Orthographic feature transform for monocular 3D object detection. arXiv preprint arXiv:1811.08188 (2018)
Shi, S., et al.: PV-RCNN: point-voxel feature set abstraction for 3D object detection. In: CVPR, pp. 10529–10538 (2020)
Google Scholar
Shi, S., Wang, X., Li, H.: PointRCNN: 3D object proposal generation and detection from point cloud. In: CVPR, pp. 770–779 (2019)
Google Scholar
Shi, S., Wang, Z., Shi, J., Wang, X., Li, H.: From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell. 43(8), 2647–2664 (2020)
Google Scholar
Shi, X., Ye, Q., Chen, X., Chen, C., Chen, Z., Kim, T.K.: Geometry-based distance decomposition for monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15172–15181 (2021)
Google Scholar
Simonelli, A., Bulo, S.R., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular 3D object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1991–1999 (2019)
Google Scholar
Wang, L., et al.: Depth-conditioned dynamic message propagation for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 454–463 (2021)
Google Scholar
Wang, L., et al.: Progressive coordinate transforms for monocular 3D object detection. Adv. Neural. Inf. Process. Syst. 34, 13364–13377 (2021)
Google Scholar
Wang, T., Xinge, Z., Pang, J., Lin, D.: Probabilistic and geometric depth: detecting objects in perspective. In: Conference on Robot Learning, pp. 1475–1485. PMLR (2022)
Google Scholar
Wang, Y., Chao, W.L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-lidar from visual depth estimation: bridging the gap in 3D object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8445–8453 (2019)
Google Scholar
Yan, Y., Mao, Y., Li, B.: SECOND: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)
Article Google Scholar
Yang, Z., Sun, Y., Liu, S., Jia, J.: 3DSSD: point-based 3D single stage object detector. In: CVPR, pp. 11040–11048 (2020)
Google Scholar
Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: STD: sparse-to-dense 3D object detector for point cloud. In: ICCV, pp. 1951–1960 (2019)
Google Scholar
Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3D object detection and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11784–11793 (2021)
Google Scholar
Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2403–2412 (2018)
Google Scholar
Zhang, Y., Lu, J., Zhou, J.: Objects are different: flexible monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3289–3298 (2021)
Google Scholar
Zheng, W., Tang, W., Jiang, L., Fu, C.W.: SE-SSD: self-ensembling single-stage object detector from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14494–14503 (2021)
Google Scholar
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)
Zhou, Y., Tuzel, O.: VoxelNet: end-to-end learning for point cloud based 3D object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499 (2018)
Google Scholar
Zhou, Y., He, Y., Zhu, H., Wang, C., Li, H., Jiang, Q.: Monocular 3D object detection: an extrinsic parameter free approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7556–7566 (2021)
Google Scholar
Zou, Z., Ye, X., Du, L., Cheng, X., Tan, X., Zhang, L., Feng, J., Xue, X., Ding, E.: The devil is in the task: exploiting reciprocal appearance-localization features for monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2713–2722 (2021)
Google Scholar

Download references

Acknowledgments

This work was supported in part by The National Key Research and Development Program of China (Grant Nos: 2018AAA0101400), in part by The National Nature Science Foundation of China (Grant Nos: 62036009, U1909203, 61936006, 61973271), in part by Innovation Capability Support Program of Shaanxi (Program No. 2021TD-05).

Author information

Authors and Affiliations

State Key Lab of CAD &CG, Zhejiang University, Hangzhou, China
Liang Peng, Xiaopei Wu, Haifeng Liu & Deng Cai
Fabu Inc., Hangzhou, China
Liang Peng, Zheng Yang & Deng Cai

Authors

Liang Peng
View author publications
You can also search for this author in PubMed Google Scholar
Xiaopei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Yang
View author publications
You can also search for this author in PubMed Google Scholar
Haifeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Deng Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deng Cai .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 16220 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peng, L., Wu, X., Yang, Z., Liu, H., Cai, D. (2022). DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13661. Springer, Cham. https://doi.org/10.1007/978-3-031-19769-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-19769-7_5
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19768-0
Online ISBN: 978-3-031-19769-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection