Skip to main content

DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13661))

Included in the following conference series:

Abstract

Monocular 3D detection has drawn much attention from the community due to its low cost and setup simplicity. It takes an RGB image as input and predicts 3D boxes in the 3D space. The most challenging sub-task lies in the instance depth estimation. Previous works usually use a direct estimation method. However, in this paper we point out that the instance depth on the RGB image is non-intuitive. It is coupled by visual depth clues and instance attribute clues, making it hard to be directly learned in the network. Therefore, we propose to reformulate the instance depth to the combination of the instance visual surface depth (visual depth) and the instance attribute depth (attribute depth). The visual depth is related to objects’ appearances and positions on the image. By contrast, the attribute depth relies on objects’ inherent attributes, which are invariant to the object affine transformation on the image. Correspondingly, we decouple the 3D location uncertainty into visual depth uncertainty and attribute depth uncertainty. By combining different types of depths and associated uncertainties, we can obtain the final instance depth. Furthermore, data augmentation in monocular 3D detection is usually limited due to the physical nature, hindering the boost of performance. Based on the proposed instance depth disentanglement strategy, we can alleviate this problem. Evaluated on KITTI, our method achieves new state-of-the-art results, and extensive ablation studies validate the effectiveness of each component in our method. The codes are released at https://github.com/SPengLiang/DID-M3D.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bewley, A., Sun, P., Mensink, T., Anguelov, D., Sminchisescu, C.: Range conditioned dilated convolutions for scale invariant 3D object detection. arXiv preprint arXiv:2005.09927 (2020)

  2. Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9287–9296 (2019)

    Google Scholar 

  3. Brazil, G., Pons-Moll, G., Liu, X., Schiele, B.: Kinematic 3D object detection in monocular video. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 135–152. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58592-1_9

    Chapter  Google Scholar 

  4. Chai, Y., et al.: To the point: efficient 3D object detection in the range image with graph convolution kernels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2021)

    Google Scholar 

  5. Chen, H., Huang, Y., Tian, W., Gao, Z., Xiong, L.: Monorun: monocular 3D object detection by reconstruction and uncertainty propagation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10379–10388 (2021)

    Google Scholar 

  6. Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2147–2156 (2016)

    Google Scholar 

  7. Chen, X., Kundu, K., Zhu, Y., Ma, H., Fidler, S., Urtasun, R.: 3D object proposals using stereo imagery for accurate object class detection. IEEE Trans. Pattern Anal. Mach. Intell. 40(5), 1259–1272 (2017)

    Article  Google Scholar 

  8. Chen, Y., Liu, S., Shen, X., Jia, J.: Fast point R-CNN. In: ICCV (2019)

    Google Scholar 

  9. Chen, Y., Tai, L., Sun, K., Li, M.: Monopair: monocular 3D object detection using pairwise spatial relationships. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12093–12102 (2020)

    Google Scholar 

  10. Chu, X., et al.: Neighbor-vote: improving monocular 3D object detection through neighbor distance voting. arXiv preprint arXiv:2107.02493 (2021)

  11. Dijk, T.V., Croon, G.D.: How do neural networks see depth in single images? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2183–2191 (2019)

    Google Scholar 

  12. Ding, M., Huo, Y., Yi, H., Wang, Z., Shi, J., Lu, Z., Luo, P.: Learning depth-guided convolutions for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11672–11681 (2020)

    Google Scholar 

  13. Fan, L., Xiong, X., Wang, F., Wang, N., Zhang, Z.: Rangedet: in defense of range view for lidar-based 3D object detection. arXiv preprint arXiv:2103.10039 (2021)

  14. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)

    Google Scholar 

  15. Hu, M., Wang, S., Li, B., Ning, S., Fan, L., Gong, X.: Penet: towards precise and efficient image guided depth completion. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 13656–13662. IEEE (2021)

    Google Scholar 

  16. Kendall, A., Gal, Y.: What uncertainties do we need in Bayesian deep learning for computer vision? Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  17. Ku, J., Pon, A.D., Waslander, S.L.: Monocular 3D object detection leveraging accurate proposals and shape reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11867–11876 (2019)

    Google Scholar 

  18. Kumar, A., Brazil, G., Liu, X.: GrooMed-NMS: grouped mathematically differentiable NMS for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8973–8983 (2021)

    Google Scholar 

  19. Li, B., Zhang, T., Xia, T.: Vehicle detection from 3D lidar using fully convolutional network. arXiv preprint arXiv:1608.07916 (2016)

  20. Li, P., Zhao, H.: Monocular 3D detection with geometric constraint embedding and semi-supervised training. IEEE Robot. Autom. Lett. 6(3), 5565–5572 (2021)

    Article  Google Scholar 

  21. Li, P., Zhao, H., Liu, P., Cao, F.: Rtm3d: real-time monocular 3D detection from object keypoints for autonomous driving. arXiv preprint arXiv:2001.03343 (2020)

  22. Liu, X., Xue, N., Wu, T.: Learning auxiliary monocular contexts helps monocular 3D object detection. arXiv preprint arXiv:2112.04628 (2021)

  23. Liu, Y., Yixuan, Y., Liu, M.: Ground-aware monocular 3D object detection for autonomous driving. IEEE Robot. Autom. Lett. 6(2), 919–926 (2021)

    Article  Google Scholar 

  24. Liu, Z., Zhou, D., Lu, F., Fang, J., Zhang, L.: Autoshape: real-time shape-aware monocular 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15641–15650 (2021)

    Google Scholar 

  25. Lu, Y., et al.: Geometry uncertainty projection network for monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3111–3121 (2021)

    Google Scholar 

  26. Ma, X., Liu, S., Xia, Z., Zhang, H., Zeng, X., Ouyang, W.: Rethinking pseudo-lidar representation. arXiv preprint arXiv:2008.04582 (2020)

  27. Ma, X., Wang, Z., Li, H., Zhang, P., Ouyang, W., Fan, X.: Accurate monocular 3D object detection via color-embedded 3D reconstruction for autonomous driving. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6851–6860 (2019)

    Google Scholar 

  28. Ma, X., et al.: Delving into localization errors for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4721–4730 (2021)

    Google Scholar 

  29. Manhardt, F., Kehl, W., Gaidon, A.: Roi-10d: monocular lifting of 2D detection to 6D pose and metric shape. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2069–2078 (2019)

    Google Scholar 

  30. Mao, J., et al.: Voxel transformer for 3D object detection. arXiv preprint arXiv:2109.02497 (2021)

  31. Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7074–7082 (2017)

    Google Scholar 

  32. Noh, J., Lee, S., Ham, B.: HVPR: hybrid voxel-point representation for single-stage 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14605–14614 (2021)

    Google Scholar 

  33. Park, D., Ambrus, R., Guizilini, V., Li, J., Gaidon, A.: Is pseudo-lidar needed for monocular 3D object detection? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3142–3152 (2021)

    Google Scholar 

  34. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8026–8037 (2019)

    Google Scholar 

  35. Peng, L., Liu, F., Yan, S., He, X., Cai, D.: OCM3D: object-centric monocular 3D object detection. arXiv preprint arXiv:2104.06041 (2021)

  36. Peng, L., et al.: Lidar point cloud guided monocular 3D object detection. arXiv preprint arXiv:2104.09035 (2021)

  37. Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum PointNets for 3D object detection from RGB-D data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018)

    Google Scholar 

  38. Qin, Z., Wang, J., Lu, Y.: MonogrNet: a geometric reasoning network for monocular 3D object localization. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 8851–8858 (2019)

    Google Scholar 

  39. Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8555–8564 (2021)

    Google Scholar 

  40. Roddick, T., Kendall, A., Cipolla, R.: Orthographic feature transform for monocular 3D object detection. arXiv preprint arXiv:1811.08188 (2018)

  41. Shi, S., et al.: PV-RCNN: point-voxel feature set abstraction for 3D object detection. In: CVPR, pp. 10529–10538 (2020)

    Google Scholar 

  42. Shi, S., Wang, X., Li, H.: PointRCNN: 3D object proposal generation and detection from point cloud. In: CVPR, pp. 770–779 (2019)

    Google Scholar 

  43. Shi, S., Wang, Z., Shi, J., Wang, X., Li, H.: From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell. 43(8), 2647–2664 (2020)

    Google Scholar 

  44. Shi, X., Ye, Q., Chen, X., Chen, C., Chen, Z., Kim, T.K.: Geometry-based distance decomposition for monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15172–15181 (2021)

    Google Scholar 

  45. Simonelli, A., Bulo, S.R., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular 3D object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1991–1999 (2019)

    Google Scholar 

  46. Wang, L., et al.: Depth-conditioned dynamic message propagation for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 454–463 (2021)

    Google Scholar 

  47. Wang, L., et al.: Progressive coordinate transforms for monocular 3D object detection. Adv. Neural. Inf. Process. Syst. 34, 13364–13377 (2021)

    Google Scholar 

  48. Wang, T., Xinge, Z., Pang, J., Lin, D.: Probabilistic and geometric depth: detecting objects in perspective. In: Conference on Robot Learning, pp. 1475–1485. PMLR (2022)

    Google Scholar 

  49. Wang, Y., Chao, W.L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-lidar from visual depth estimation: bridging the gap in 3D object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8445–8453 (2019)

    Google Scholar 

  50. Yan, Y., Mao, Y., Li, B.: SECOND: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)

    Article  Google Scholar 

  51. Yang, Z., Sun, Y., Liu, S., Jia, J.: 3DSSD: point-based 3D single stage object detector. In: CVPR, pp. 11040–11048 (2020)

    Google Scholar 

  52. Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: STD: sparse-to-dense 3D object detector for point cloud. In: ICCV, pp. 1951–1960 (2019)

    Google Scholar 

  53. Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3D object detection and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11784–11793 (2021)

    Google Scholar 

  54. Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2403–2412 (2018)

    Google Scholar 

  55. Zhang, Y., Lu, J., Zhou, J.: Objects are different: flexible monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3289–3298 (2021)

    Google Scholar 

  56. Zheng, W., Tang, W., Jiang, L., Fu, C.W.: SE-SSD: self-ensembling single-stage object detector from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14494–14503 (2021)

    Google Scholar 

  57. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)

  58. Zhou, Y., Tuzel, O.: VoxelNet: end-to-end learning for point cloud based 3D object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499 (2018)

    Google Scholar 

  59. Zhou, Y., He, Y., Zhu, H., Wang, C., Li, H., Jiang, Q.: Monocular 3D object detection: an extrinsic parameter free approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7556–7566 (2021)

    Google Scholar 

  60. Zou, Z., Ye, X., Du, L., Cheng, X., Tan, X., Zhang, L., Feng, J., Xue, X., Ding, E.: The devil is in the task: exploiting reciprocal appearance-localization features for monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2713–2722 (2021)

    Google Scholar 

Download references

Acknowledgments

This work was supported in part by The National Key Research and Development Program of China (Grant Nos: 2018AAA0101400), in part by The National Nature Science Foundation of China (Grant Nos: 62036009, U1909203, 61936006, 61973271), in part by Innovation Capability Support Program of Shaanxi (Program No. 2021TD-05).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Deng Cai .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 16220 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Peng, L., Wu, X., Yang, Z., Liu, H., Cai, D. (2022). DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13661. Springer, Cham. https://doi.org/10.1007/978-3-031-19769-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19769-7_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19768-0

  • Online ISBN: 978-3-031-19769-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics