Skip to main content

DevNet: Self-supervised Monocular Depth Learning via Density Volume Construction

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13699))

Included in the following conference series:

  • 3775 Accesses

Abstract

Self-supervised depth learning from monocular images normally relies on the 2D pixel-wise photometric relation between temporally adjacent image frames. However, they neither fully exploit the 3D point-wise geometric correspondences, nor effectively tackle the ambiguities in the photometric warping caused by occlusions or illumination inconsistency. To address these problems, this work proposes Density Volume Construction Network (DevNet), a novel self-supervised monocular depth learning framework, that can consider 3D spatial information, and exploit stronger geometric constraints among adjacent camera frustums. Instead of directly regressing the pixel value from a single image, our DevNet divides the camera frustum into multiple parallel planes and predicts the pointwise occlusion probability density on each plane. The final depth map is generated by integrating the density along corresponding rays. During the training process, novel regularization strategies and loss functions are introduced to mitigate photometric ambiguities and overfitting. Without obviously enlarging model parameters size or running time, DevNet outperforms several representative baselines on both the KITTI-2015 outdoor dataset and NYU-V2 indoor dataset. In particular, the root-mean-square-deviation is reduced by around 4% with DevNet on both KITTI-2015 and NYU-V2 in the task of depth estimation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bhat, S.F., Alhashim, I., Wonka, P.: AdaBins: depth estimation using adaptive bins. In: CVPR (2021)

    Google Scholar 

  2. Bian, J., et al.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: NeurIPS (2019)

    Google Scholar 

  3. Du, Y., Zhang, Y., Yu, H.X., Tenenbaum, J.B., Wu, J.: Neural radiance flow for 4D view synthesis and video processing. In: ICCV (2021)

    Google Scholar 

  4. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)

    Google Scholar 

  5. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. arXiv preprint arXiv:1406.2283 (2014)

  6. Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. TPAMI 40(3), 611–625 (2017)

    Article  Google Scholar 

  7. Engel, J., Stückler, J., Cremers, D.: Large-scale direct slam with stereo cameras. In: IROS (2015)

    Google Scholar 

  8. Garg, R., B.G., V.K., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_45

    Chapter  Google Scholar 

  9. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)

    Google Scholar 

  10. Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR (2017)

    Google Scholar 

  11. Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: ICCV (2019)

    Google Scholar 

  12. GonzalezBello, J.L., Kim, M.: Forget about the LiDAR: self-supervised depth estimators with med probability volumes. In: NeurIPS (2020)

    Google Scholar 

  13. Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In: ICCV (2019)

    Google Scholar 

  14. Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3D packing for self-supervised monocular depth estimation. In: CVPR (2020)

    Google Scholar 

  15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  16. Huo, Y., et al.: Learning depth-guided convolutions for monocular 3D object detection. In: CVPR Workshop (2020)

    Google Scholar 

  17. Ji, P., Li, R., Bhanu, B., Xu, Y.: MonoIndoor: towards good practice of self-supervised monocular depth estimation for indoor environments. In: ICCV (2021)

    Google Scholar 

  18. Jin, H., Favaro, P., Soatto, S.: Real-time feature tracking and outlier rejection with changes in illumination. In: ICCV (2001)

    Google Scholar 

  19. Johnston, A., Carneiro, G.: Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In: CVPR (2020)

    Google Scholar 

  20. Kajiya, J.T.: The rendering equation. In: Annual Conference on Computer Graphics and Interactive Techniques (1986)

    Google Scholar 

  21. Kalia, M., Navab, N., Salcudean, T.: A real-time interactive augmented reality depth estimation technique for surgical robotics. In: ICRA (2019)

    Google Scholar 

  22. Lee, J.H., Han, M.K., Ko, D.W., Suh, I.H.: From big to small: multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326 (2019)

  23. Li, Y., Ushiku, Y., Harada, T.: Pose graph optimization for unsupervised monocular visual odometry. In: ICRA (2019)

    Google Scholar 

  24. Li, Z., Niklaus, S., Snavely, N., Wang, O.: Neural scene flow fields for space-time view synthesis of dynamic scenes. In: CVPR (2021)

    Google Scholar 

  25. Liu, H., Tang, X., Shen, S.: Depth-map completion for large indoor scene reconstruction. Pattern Recogn. 99, 107112 (2020)

    Article  Google Scholar 

  26. Luo, C., et al.: Every pixel counts++: joint learning of geometry and motion with 3D holistic understanding. TPAMI 42(10), 2624–2641 (2019)

    Article  Google Scholar 

  27. Martin-Brualla, R., Radwan, N., Sajjadi, M., Barron, J.T., Dosovitskiy, A., Duckworth, D.: NeRF in the wild: neural radiance fields for unconstrained photo collections. arXiv preprint arXiv:2008.02268 (2020)

  28. Max, N.: Optical models for direct volume rendering. IEEE Trans. Visual. Comput. Graph. 1(2), 99–108 (1995)

    Article  Google Scholar 

  29. Meng, Y., et al.: SigNet: semantic instance aided unsupervised 3D geometry perception. In: CVPR (2019)

    Google Scholar 

  30. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24

    Chapter  Google Scholar 

  31. Peng, R., Wang, R., Lai, Y., Tang, L., Cai, Y.: Excavating the potential capacity of self-supervised monocular depth estimation. In: ICCV (2021)

    Google Scholar 

  32. Pillai, S., Ambruş, R., Gaidon, A.: SuperDepth: self-supervised, super-resolved monocular depth estimation. In: ICRA (2019)

    Google Scholar 

  33. Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-NeRF: neural radiance fields for dynamic scenes. In: CVPR (2021)

    Google Scholar 

  34. Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3d object detection. In: CVPR (2021)

    Google Scholar 

  35. Saxena, A., Chung, S.H., Ng, A.Y., et al.: Learning depth from single monocular images. In: NeurIPS (2005)

    Google Scholar 

  36. Shu, C., Yu, K., Duan, Z., Yang, K.: Feature-metric loss for self-supervised learning of depth and egomotion. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 572–588. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_34

    Chapter  Google Scholar 

  37. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54

    Chapter  Google Scholar 

  38. Sun, J., et al.: Disp R-CNN: Stereo 3D object detection via shape prior guided instance disparity estimation. In: CVPR (2020)

    Google Scholar 

  39. Wang, C., Buenaposada, J.M., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: CVPR (2018)

    Google Scholar 

  40. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

    Article  Google Scholar 

  41. Wei, Y., Liu, S., Rao, Y., Zhao, W., Lu, J., Zhou, J.: NerfingMVS: guided optimization of neural radiance fields for indoor multi-view stereo. In: ICCV (2021)

    Google Scholar 

  42. Yang, N., Stumberg, L.v., Wang, R., Cremers, D.: D3VO: deep depth, deep pose and deep uncertainty for monocular visual odometry. In: CVPR (2020)

    Google Scholar 

  43. Yen-Chen, L., Florence, P., Barron, J.T., Rodriguez, A., Isola, P., Lin, T.Y.: iNeRF: inverting neural radiance fields for pose estimation. arXiv preprint arXiv:2012.05877 (2020)

  44. Yin, Z., Shi, J.: GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: CVPR, pp. 1983–1992 (2018)

    Google Scholar 

  45. Yuan, W., Gu, X., Dai, Z., Zhu, S., Tan, P.: New CRFs: neural window fully-connected CRFs for monocular depth estimation. arXiv preprint arXiv:2203.01502 (2022)

  46. Zhan, H., Garg, R., Weerasekera, C.S., Li, K., Agarwal, H., Reid, I.: Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: CVPR (2018)

    Google Scholar 

  47. Zhang, D., Lo, F.P.W., Zheng, J.Q., Bai, W., Yang, G.Z., Lo, B.: Data-driven microscopic pose and depth estimation for optical microrobot manipulation. ACS Photonics 7(11), 3003–3014 (2020)

    Article  Google Scholar 

  48. Zhao, W., Liu, S., Shu, Y., Liu, Y.J.: Towards better generalization: joint depth-pose learning without PoseNet. In: CVPR (2020)

    Google Scholar 

  49. Zhou, J., Wang, Y., Qin, K., Zeng, W.: Unsupervised high-resolution depth learning from videos with dual networks. In: ICCV (2019)

    Google Scholar 

  50. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)

    Google Scholar 

  51. Zhou, Z., Fan, X., Shi, P., Xin, Y.: R-MSFM: recurrent multi-scale feature modulation for monocular depth estimating. In: ICCV (2021)

    Google Scholar 

  52. Zhu, S., Brazil, G., Liu, X.: The edge of depth: explicit constraints between segmentation and depth. In: CVPR (2020)

    Google Scholar 

Download references

Acknowledgements

This work was partially supported by the Huawei UK AI Fellowship. We gratefully acknowledge the support of MindSpore, CANN(Compute Architecture for Neural Networks), and Ascend AI Processor used for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lanqing Hong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhou, K. et al. (2022). DevNet: Self-supervised Monocular Depth Learning via Density Volume Construction. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13699. Springer, Cham. https://doi.org/10.1007/978-3-031-19842-7_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19842-7_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19841-0

  • Online ISBN: 978-3-031-19842-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics