Skip to main content

Multi-modal Masked Pre-training for Monocular Panoramic Depth Completion

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13661))

Included in the following conference series:

Abstract

In this paper, we formulate a potentially valuable panoramic depth completion (PDC) task as panoramic 3D cameras often produce 360\(^\circ \) depth with missing data in complex scenes. Its goal is to recover dense panoramic depths from raw sparse ones and panoramic RGB images. To deal with the PDC task, we train a deep network that takes both depth and image as inputs for the dense panoramic depth recovery. However, it needs to face a challenging optimization problem of the network parameters due to its non-convex objective function. To address this problem, we propose a simple yet effective approach termed M\(^{3}\)PT: multi-modal masked pre-training. Specifically, during pre-training, we simultaneously cover up patches of the panoramic RGB image and sparse depth by shared random mask, then reconstruct the sparse depth in the masked regions. To our best knowledge, it is the first time that we show the effectiveness of masked pre-training in a multi-modal vision task, instead of the single-modal task resolved by masked autoencoders (MAE). Different from MAE where fine-tuning completely discards the decoder part of pre-training, there is no architectural difference between the pre-training and fine-tuning stages in our M\(^{3}\)PT as they only differ in the prediction density, which potentially makes the transfer learning more convenient and effective. Extensive experiments verify the effectiveness of M\(^{3}\)PT on three panoramic datasets. Notably, we improve the state-of-the-art baselines by averagely 29.2% in RMSE, 51.7% in MRE, 49.7% in MAE, and 37.5% in RMSElog on three benchmark datasets.

Z. Yan and X. Li—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://matterport.com/cameras/pro2-3D-camera.

  2. 2.

    https://www.faro.com/en/Products/Hardware/Focus-Laser-Scanners.

  3. 3.

    https://github.com/sunset1995/py360convert.

  4. 4.

    https://vcl3d.github.io/Pano3D/download/.

  5. 5.

    http://buildingparser.stanford.edu/dataset.html.

  6. 6.

    https://vcl3d.github.io/3D60/.

References

  1. Albanis, G., et al.: Pano3d: A holistic benchmark and a solid baseline for 360\(^{\circ }\) depth estimation. In: CVPRW, pp. 3722–3732. IEEE (2021)

    Google Scholar 

  2. Armeni, I., Sax, S., Zamir, A.R., Savarese, S.: Joint 2D–3D-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105 (2017)

  3. Bai, J., Lai, S., Qin, H., Guo, J., Guo, Y.: Glpanodepth: global-to-local panoramic depth estimation. arXiv preprint arXiv:2202.02796 (2022)

  4. Bao, H., Dong, L., Wei, F.: Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)

  5. Chang, A., et al.: Matterport3d: Learning from RGB-D data in indoor environments. In: 3DV (2017)

    Google Scholar 

  6. Chao, P., Kao, C.Y., Ruan, Y.S., Huang, C.H., Lin, Y.L.: Hardnet: a low memory traffic network. In: ICCV. pp. 3552–3561 (2019)

    Google Scholar 

  7. Chen, M., et al.: Generative pretraining from pixels. In: ICML, pp. 1691–1703. PMLR (2020)

    Google Scholar 

  8. Cheng, X., Wang, P., Guan, C., Yang, R.: Cspn++: learning context and resource aware convolutional spatial propagation networks for depth completion. In: AAAI, pp. 10615–10622 (2020)

    Google Scholar 

  9. Cheng, X., Wang, P., Yang, R.: Learning depth with convolutional spatial propagation network. In: ECCV, pp. 103–119 (2018)

    Google Scholar 

  10. Chodosh, N., Wang, C., Lucey, S.: Deep convolutional compressed sensing for LiDAR depth completion. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11361, pp. 499–513. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20887-5_31

    Chapter  Google Scholar 

  11. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  12. Eder, M., Moulon, P., Guan, L.: Pano popups: indoor 3D reconstruction with a plane-aware network. In: 3DV, pp. 76–84. IEEE (2019)

    Google Scholar 

  13. Eldesokey, A., Felsberg, M., Khan, F.S.: Confidence propagation through CNNs for guided sparse depth regression. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2423–2436 (2019)

    Article  Google Scholar 

  14. Erhan, D., Bengio, Y., Courville, A., Manzagol, P.A., Vincent, P., Bengio, S.: Why does unsupervised pre-training help deep learning? J. Mach. Learn. Res. 11, 625–660 (2010)

    MathSciNet  MATH  Google Scholar 

  15. Feng, B.Y., Yao, W., Liu, Z., Varshney, A.: Deep depth estimation on 360 images with a double quaternion loss. In: 3DV, pp. 524–533. IEEE (2020)

    Google Scholar 

  16. Feng, Q., Shum, H.P., Morishima, S.: 360 depth estimation in the wild-the depth360 dataset and the segfuse network. In: VR. IEEE (2022)

    Google Scholar 

  17. Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In: ICCV. pp. 8977–8986 (2019)

    Google Scholar 

  18. Gu, J., Xiang, Z., Ye, Y., Wang, L.: Denselidar: a real-time pseudo dense depth guided depth completion network. IEEE Robot. Autom. Lett. 6(2), 1808–1815 (2021)

    Article  Google Scholar 

  19. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021)

  20. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  21. Hu, M., Wang, S., Li, B., Ning, S., Fan, L., Gong, X.: PENet: towards precise and efficient image guided depth completion. In: ICRA (2021)

    Google Scholar 

  22. Jaritz, M., De Charette, R., Wirbel, E., Perrotton, X., Nashashibi, F.: Sparse and dense data with CNNs: Depth completion and semantic segmentation. In: 3DV, pp. 52–60 (2018)

    Google Scholar 

  23. Jiang, H., Sheng, Z., Zhu, S., Dong, Z., Huang, R.: Unifuse: unidirectional fusion for 360 panorama depth estimation. IEEE Robot. Autom. Lett. 6(2), 1519–1526 (2021)

    Article  Google Scholar 

  24. Jin, L., : Geometric structure based and regularized depth estimation from 360 indoor imagery. In: CVPR, pp. 889–898 (2020)

    Google Scholar 

  25. Krauss, B., Schroeder, G., Gustke, M., Hussein, A.: Deterministic guided lidar depth map completion. arXiv preprint arXiv:2106.07256 (2021)

  26. Lai, Z., Chen, D., Su, K.: Olanet: self-supervised 360\(^{\circ }\) depth estimation with effective distortion-aware view synthesis and l1 smooth regularization. In: ICME, pp. 1–6. IEEE (2021)

    Google Scholar 

  27. Lee, Y., Jeong, J., Yun, J., Cho, W., Yoon, K.J.: SpherePHD: applying CNNs on a spherical polyhedron representation of 360deg images. In: CVPR, pp. 9181–9189 (2019)

    Google Scholar 

  28. Lee, Y., Jeong, J., Yun, J., Cho, W., Yoon, K.J.: SpherePHD: applying CNNs on 360\(^{\circ }\) images with non-euclidean spherical polyhedron representation. IEEE Trans. Pattern Anal. Mach. Intell. (2020)

    Google Scholar 

  29. Li, A., Yuan, Z., Ling, Y., Chi, W., Zhang, C., et al.: A multi-scale guided cascade hourglass network for depth completion. In: WACV, pp. 32–40 (2020)

    Google Scholar 

  30. Li, J., Zhang, T., Luo, W., Yang, J., Yuan, X.T., Zhang, J.: Sparseness analysis in the pretraining of deep neural networks. IEEE Trans. Neural Networks Learn. Syst. 28(6), 1425–1438 (2016)

    Article  Google Scholar 

  31. Li, Y., Yan, Z., Duan, Y., Ren, L.: Panodepth: a two-stage approach for monocular omnidirectional depth estimation. In: 3DV, pp. 648–658. IEEE (2021)

    Google Scholar 

  32. Lin, Y., Cheng, T., Zhong, Q., Zhou, W., Yang, H.: Dynamic spatial propagation network for depth completion. In: AAAI (2022)

    Google Scholar 

  33. Liu, L., et al.: FCFR-net: feature fusion based coarse-to-fine residual learning for depth completion. In: AAAI, vol. 35, pp. 2136–2144 (2021)

    Google Scholar 

  34. Lu, K., Barnes, N., Anwar, S., Zheng, L.: From depth what can you see? Depth completion via auxiliary image reconstruction. In: CVPR, pp. 11306–11315 (2020)

    Google Scholar 

  35. Ma, F., Cavalheiro, G.V., Karaman, S.: Self-supervised sparse-to-dense: self-supervised depth completion from lidar and monocular camera. In: ICRA (2019)

    Google Scholar 

  36. Park, J., Joo, K., Hu, Z., Liu, C.-K., So Kweon, I.: Non-local spatial propagation network for depth completion. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 120–136. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_8

    Chapter  Google Scholar 

  37. Pintore, G., Agus, M., Almansa, E., Schneider, J., Gobbetti, E.: Slicenet: deep dense depth estimation from a single indoor panorama using a slice-based representation. In: CVPR, pp. 11536–11545 (2021)

    Google Scholar 

  38. Qiu, J., et al.: DeepLiDAR: deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image. In: CVPR, pp. 3313–3322 (2019)

    Google Scholar 

  39. Rey-Area, M., Yuan, M., Richardt, C.: 360monodepth: high-resolution 360\(^{\circ }\) monocular depth estimation. arXiv e-prints pp. arXiv-2111 (2021)

    Google Scholar 

  40. Schuster, R., Wasenmuller, O., Unger, C., Stricker, D.: SSGP: sparse spatial guided propagation for robust and generic interpolation. In: WACV, pp. 197–206 (2021)

    Google Scholar 

  41. Shen, Z., Lin, C., Liao, K., Nie, L., Zheng, Z., Zhao, Y.: Panoformer: panorama transformer for indoor 360 depth estimation. arXiv e-prints pp. arXiv-2203 (2022)

    Google Scholar 

  42. Shen, Z., Lin, C., Nie, L., Liao, K., Zhao, Y.: Distortion-tolerant monocular depth estimation on omnidirectional images using dual-cubemap. In: ICME, pp. 1–6. IEEE (2021)

    Google Scholar 

  43. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54

    Chapter  Google Scholar 

  44. Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: CVPR, pp. 1746–1754 (2017)

    Google Scholar 

  45. Sun, C., Hsiao, C.W., Wang, N.H., Sun, M., Chen, H.T.: Indoor panorama planar 3D reconstruction via divide and conquer. In: CVPR, pp. 11338–11347 (2021)

    Google Scholar 

  46. Sun, C., Sun, M., Chen, H.T.: Hohonet: 360 indoor holistic understanding with latent horizontal features. In: CVPR, pp. 2573–2582 (2021)

    Google Scholar 

  47. Tang, J., Tian, F.P., Feng, W., Li, J., Tan, P.: Learning guided convolutional network for depth completion. IEEE Trans. Image Process. 30, 1116–1129 (2020)

    Article  Google Scholar 

  48. Tateno, K., Navab, N., Tombari, F.: Distortion-aware convolutional filters for dense prediction in panoramic images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 732–750. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_43

    Chapter  Google Scholar 

  49. Teutscher, D., Mangat, P., Wasenmüller, O.: PDC: piecewise depth completion utilizing superpixels. In: ITSC, pp. 2752–2758. IEEE (2021)

    Google Scholar 

  50. Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. In: 3DV, pp. 11–20 (2017)

    Google Scholar 

  51. Van Gansbeke, W., Neven, D., De Brabandere, B., Van Gool, L.: Sparse and noisy lidar completion with RGB guidance and uncertainty. In: MVA, pp. 1–6 (2019)

    Google Scholar 

  52. Vaswani, A., et al.: Attention is all you need. In: NeurlPS, vol. 30 (2017)

    Google Scholar 

  53. Wang, F.E., Yeh, Y.H., Sun, M., Chiu, W.C., Tsai, Y.H.: Bifuse: monocular 360 depth estimation via bi-projection fusion. In: CVPR, pp. 462–471 (2020)

    Google Scholar 

  54. Wong, A., Cicek, S., Soatto, S.: Learning topology from synthetic data for unsupervised depth completion. IEEE Robo. Autom. Lett. 6(2), 1495–1502 (2021)

    Article  Google Scholar 

  55. Wong, A., Fei, X., Hong, B.W., Soatto, S.: An adaptive framework for learning unsupervised depth completion. IEEE Robot. Autom. Lett. 6(2), 3120–3127 (2021)

    Article  Google Scholar 

  56. Wong, A., Fei, X., Tsuei, S., Soatto, S.: Unsupervised depth completion from visual inertial odometry. IEEE Robot. Autom. Lett. 5(2), 1899–1906 (2020)

    Article  Google Scholar 

  57. Wong, A., Soatto, S.: Unsupervised depth completion with calibrated backprojection layers. In: ICCV (2021)

    Google Scholar 

  58. Xie, Z., et al.: SimMIM: a simple framework for masked image modeling. arXiv preprint arXiv:2111.09886 (2021)

  59. Xu, Y., Zhu, X., Shi, J., Zhang, G., Bao, H., Li, H.: Depth completion from sparse lidar data with depth-normal constraints. In: ICCV, pp. 2811–2820 (2019)

    Google Scholar 

  60. Xu, Z., Yin, H., Yao, J.: Deformable spatial propagation networks for depth completion. In: ICIP, pp. 913–917. IEEE (2020)

    Google Scholar 

  61. Yan, L., Liu, K., Gao, L.: Dan-conv: depth aware non-local convolution for lidar depth completion. Electron. Lett. 57(20), 754–757 (2021)

    Article  Google Scholar 

  62. Yan, Z., et al.: Rignet: repetitive image guided network for depth completion. arXiv preprint arXiv:2107.13802 (2021)

  63. Yun, I., Lee, H.J., Rhee, C.E.: Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learning. arXiv preprint arXiv:2109.10563 (2021)

  64. Zhao, S., Gong, M., Fu, H., Tao, D.: Adaptive context-aware multi-modal network for depth completion. IEEE Trans. Image Process. 30, 5264–5276 (2021)

    Article  Google Scholar 

  65. Zhou, K., Yang, K., Wang, K.: Panoramic depth estimation via supervised and unsupervised learning in indoor scenes. Appl. Opt. 60(26), 8188–8197 (2021)

    Article  Google Scholar 

  66. Zhu, Y., Dong, W., Li, L., Wu, J., Li, X., Shi, G.: Robust depth completion with uncertainty-driven loss functions. arXiv preprint arXiv:2112.07895 (2021)

  67. Zhuang, C., Lu, Z., Wang, Y., Xiao, J., Wang, Y.: ACDNet: adaptively combined dilated convolution for monocular panorama depth estimation. In: AAAI (2022)

    Google Scholar 

  68. Zioulis, N., Karakottas, A., Zarpalas, D., Alvarez, F., Daras, P.: Spherical view synthesis for self-supervised 360 depth estimation. In: 3DV, pp. 690–699. IEEE (2019)

    Google Scholar 

  69. Zioulis, N., Karakottas, A., Zarpalas, D., Daras, P.: OmniDepth: dense depth estimation for indoors spherical panoramas. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 453–471. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_28

    Chapter  Google Scholar 

Download references

Acknowledgement

The authors would like to thank reviewers for their detailed comments and instructive suggestions. This work was supported by the National Science Fund of China under Grant Nos. U1713208, 62072242 and Postdoctoral Innovative Talent Support Program of China under Grant BX20200168, 2020M681608. Note that the PCA Lab is associated with, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, Nanjing University of Science and Technology.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jun Li or Jian Yang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6536 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yan, Z., Li, X., Wang, K., Zhang, Z., Li, J., Yang, J. (2022). Multi-modal Masked Pre-training for Monocular Panoramic Depth Completion. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13661. Springer, Cham. https://doi.org/10.1007/978-3-031-19769-7_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19769-7_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19768-0

  • Online ISBN: 978-3-031-19769-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics