Multi-modal Masked Pre-training for Monocular Panoramic Depth Completion

Yan, Zhiqiang; Li, Xiang; Wang, Kun; Zhang, Zhenyu; Li, Jun; Yang, Jian

doi:10.1007/978-3-031-19769-7_22

Zhiqiang Yan¹²,
Xiang Li¹²,
Kun Wang¹²,
Zhenyu Zhang¹²,
Jun Li¹² &
…
Jian Yang¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13661))

Included in the following conference series:

European Conference on Computer Vision

4096 Accesses
18 Citations

Abstract

In this paper, we formulate a potentially valuable panoramic depth completion (PDC) task as panoramic 3D cameras often produce 360$^\circ $ depth with missing data in complex scenes. Its goal is to recover dense panoramic depths from raw sparse ones and panoramic RGB images. To deal with the PDC task, we train a deep network that takes both depth and image as inputs for the dense panoramic depth recovery. However, it needs to face a challenging optimization problem of the network parameters due to its non-convex objective function. To address this problem, we propose a simple yet effective approach termed M$^{3}$PT: multi-modal masked pre-training. Specifically, during pre-training, we simultaneously cover up patches of the panoramic RGB image and sparse depth by shared random mask, then reconstruct the sparse depth in the masked regions. To our best knowledge, it is the first time that we show the effectiveness of masked pre-training in a multi-modal vision task, instead of the single-modal task resolved by masked autoencoders (MAE). Different from MAE where fine-tuning completely discards the decoder part of pre-training, there is no architectural difference between the pre-training and fine-tuning stages in our M$^{3}$PT as they only differ in the prediction density, which potentially makes the transfer learning more convenient and effective. Extensive experiments verify the effectiveness of M$^{3}$PT on three panoramic datasets. Notably, we improve the state-of-the-art baselines by averagely 29.2% in RMSE, 51.7% in MRE, 49.7% in MAE, and 37.5% in RMSElog on three benchmark datasets.

Z. Yan and X. Li—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Distortion-Aware Convolutional Filters for Dense Prediction in Panoramic Images

RA-Depth: Resolution Adaptive Self-supervised Monocular Depth Estimation

OmniDepth: Dense Depth Estimation for Indoors Spherical Panoramas

Notes

References

Albanis, G., et al.: Pano3d: A holistic benchmark and a solid baseline for 360$^{\circ }$ depth estimation. In: CVPRW, pp. 3722–3732. IEEE (2021)
Google Scholar
Armeni, I., Sax, S., Zamir, A.R., Savarese, S.: Joint 2D–3D-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105 (2017)
Bai, J., Lai, S., Qin, H., Guo, J., Guo, Y.: Glpanodepth: global-to-local panoramic depth estimation. arXiv preprint arXiv:2202.02796 (2022)
Bao, H., Dong, L., Wei, F.: Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)
Chang, A., et al.: Matterport3d: Learning from RGB-D data in indoor environments. In: 3DV (2017)
Google Scholar
Chao, P., Kao, C.Y., Ruan, Y.S., Huang, C.H., Lin, Y.L.: Hardnet: a low memory traffic network. In: ICCV. pp. 3552–3561 (2019)
Google Scholar
Chen, M., et al.: Generative pretraining from pixels. In: ICML, pp. 1691–1703. PMLR (2020)
Google Scholar
Cheng, X., Wang, P., Guan, C., Yang, R.: Cspn++: learning context and resource aware convolutional spatial propagation networks for depth completion. In: AAAI, pp. 10615–10622 (2020)
Google Scholar
Cheng, X., Wang, P., Yang, R.: Learning depth with convolutional spatial propagation network. In: ECCV, pp. 103–119 (2018)
Google Scholar
Chodosh, N., Wang, C., Lucey, S.: Deep convolutional compressed sensing for LiDAR depth completion. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11361, pp. 499–513. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20887-5_31
Chapter Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Eder, M., Moulon, P., Guan, L.: Pano popups: indoor 3D reconstruction with a plane-aware network. In: 3DV, pp. 76–84. IEEE (2019)
Google Scholar
Eldesokey, A., Felsberg, M., Khan, F.S.: Confidence propagation through CNNs for guided sparse depth regression. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2423–2436 (2019)
Article Google Scholar
Erhan, D., Bengio, Y., Courville, A., Manzagol, P.A., Vincent, P., Bengio, S.: Why does unsupervised pre-training help deep learning? J. Mach. Learn. Res. 11, 625–660 (2010)
MathSciNet MATH Google Scholar
Feng, B.Y., Yao, W., Liu, Z., Varshney, A.: Deep depth estimation on 360 images with a double quaternion loss. In: 3DV, pp. 524–533. IEEE (2020)
Google Scholar
Feng, Q., Shum, H.P., Morishima, S.: 360 depth estimation in the wild-the depth360 dataset and the segfuse network. In: VR. IEEE (2022)
Google Scholar
Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In: ICCV. pp. 8977–8986 (2019)
Google Scholar
Gu, J., Xiang, Z., Ye, Y., Wang, L.: Denselidar: a real-time pseudo dense depth guided depth completion network. IEEE Robot. Autom. Lett. 6(2), 1808–1815 (2021)
Article Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hu, M., Wang, S., Li, B., Ning, S., Fan, L., Gong, X.: PENet: towards precise and efficient image guided depth completion. In: ICRA (2021)
Google Scholar
Jaritz, M., De Charette, R., Wirbel, E., Perrotton, X., Nashashibi, F.: Sparse and dense data with CNNs: Depth completion and semantic segmentation. In: 3DV, pp. 52–60 (2018)
Google Scholar
Jiang, H., Sheng, Z., Zhu, S., Dong, Z., Huang, R.: Unifuse: unidirectional fusion for 360 panorama depth estimation. IEEE Robot. Autom. Lett. 6(2), 1519–1526 (2021)
Article Google Scholar
Jin, L., : Geometric structure based and regularized depth estimation from 360 indoor imagery. In: CVPR, pp. 889–898 (2020)
Google Scholar
Krauss, B., Schroeder, G., Gustke, M., Hussein, A.: Deterministic guided lidar depth map completion. arXiv preprint arXiv:2106.07256 (2021)
Lai, Z., Chen, D., Su, K.: Olanet: self-supervised 360$^{\circ }$ depth estimation with effective distortion-aware view synthesis and l1 smooth regularization. In: ICME, pp. 1–6. IEEE (2021)
Google Scholar
Lee, Y., Jeong, J., Yun, J., Cho, W., Yoon, K.J.: SpherePHD: applying CNNs on a spherical polyhedron representation of 360deg images. In: CVPR, pp. 9181–9189 (2019)
Google Scholar
Lee, Y., Jeong, J., Yun, J., Cho, W., Yoon, K.J.: SpherePHD: applying CNNs on 360$^{\circ }$ images with non-euclidean spherical polyhedron representation. IEEE Trans. Pattern Anal. Mach. Intell. (2020)
Google Scholar
Li, A., Yuan, Z., Ling, Y., Chi, W., Zhang, C., et al.: A multi-scale guided cascade hourglass network for depth completion. In: WACV, pp. 32–40 (2020)
Google Scholar
Li, J., Zhang, T., Luo, W., Yang, J., Yuan, X.T., Zhang, J.: Sparseness analysis in the pretraining of deep neural networks. IEEE Trans. Neural Networks Learn. Syst. 28(6), 1425–1438 (2016)
Article Google Scholar
Li, Y., Yan, Z., Duan, Y., Ren, L.: Panodepth: a two-stage approach for monocular omnidirectional depth estimation. In: 3DV, pp. 648–658. IEEE (2021)
Google Scholar
Lin, Y., Cheng, T., Zhong, Q., Zhou, W., Yang, H.: Dynamic spatial propagation network for depth completion. In: AAAI (2022)
Google Scholar
Liu, L., et al.: FCFR-net: feature fusion based coarse-to-fine residual learning for depth completion. In: AAAI, vol. 35, pp. 2136–2144 (2021)
Google Scholar
Lu, K., Barnes, N., Anwar, S., Zheng, L.: From depth what can you see? Depth completion via auxiliary image reconstruction. In: CVPR, pp. 11306–11315 (2020)
Google Scholar
Ma, F., Cavalheiro, G.V., Karaman, S.: Self-supervised sparse-to-dense: self-supervised depth completion from lidar and monocular camera. In: ICRA (2019)
Google Scholar
Park, J., Joo, K., Hu, Z., Liu, C.-K., So Kweon, I.: Non-local spatial propagation network for depth completion. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 120–136. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_8
Chapter Google Scholar
Pintore, G., Agus, M., Almansa, E., Schneider, J., Gobbetti, E.: Slicenet: deep dense depth estimation from a single indoor panorama using a slice-based representation. In: CVPR, pp. 11536–11545 (2021)
Google Scholar
Qiu, J., et al.: DeepLiDAR: deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image. In: CVPR, pp. 3313–3322 (2019)
Google Scholar
Rey-Area, M., Yuan, M., Richardt, C.: 360monodepth: high-resolution 360$^{\circ }$ monocular depth estimation. arXiv e-prints pp. arXiv-2111 (2021)
Google Scholar
Schuster, R., Wasenmuller, O., Unger, C., Stricker, D.: SSGP: sparse spatial guided propagation for robust and generic interpolation. In: WACV, pp. 197–206 (2021)
Google Scholar
Shen, Z., Lin, C., Liao, K., Nie, L., Zheng, Z., Zhao, Y.: Panoformer: panorama transformer for indoor 360 depth estimation. arXiv e-prints pp. arXiv-2203 (2022)
Google Scholar
Shen, Z., Lin, C., Nie, L., Liao, K., Zhao, Y.: Distortion-tolerant monocular depth estimation on omnidirectional images using dual-cubemap. In: ICME, pp. 1–6. IEEE (2021)
Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Chapter Google Scholar
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: CVPR, pp. 1746–1754 (2017)
Google Scholar
Sun, C., Hsiao, C.W., Wang, N.H., Sun, M., Chen, H.T.: Indoor panorama planar 3D reconstruction via divide and conquer. In: CVPR, pp. 11338–11347 (2021)
Google Scholar
Sun, C., Sun, M., Chen, H.T.: Hohonet: 360 indoor holistic understanding with latent horizontal features. In: CVPR, pp. 2573–2582 (2021)
Google Scholar
Tang, J., Tian, F.P., Feng, W., Li, J., Tan, P.: Learning guided convolutional network for depth completion. IEEE Trans. Image Process. 30, 1116–1129 (2020)
Article Google Scholar
Tateno, K., Navab, N., Tombari, F.: Distortion-aware convolutional filters for dense prediction in panoramic images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 732–750. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_43
Chapter Google Scholar
Teutscher, D., Mangat, P., Wasenmüller, O.: PDC: piecewise depth completion utilizing superpixels. In: ITSC, pp. 2752–2758. IEEE (2021)
Google Scholar
Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. In: 3DV, pp. 11–20 (2017)
Google Scholar
Van Gansbeke, W., Neven, D., De Brabandere, B., Van Gool, L.: Sparse and noisy lidar completion with RGB guidance and uncertainty. In: MVA, pp. 1–6 (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurlPS, vol. 30 (2017)
Google Scholar
Wang, F.E., Yeh, Y.H., Sun, M., Chiu, W.C., Tsai, Y.H.: Bifuse: monocular 360 depth estimation via bi-projection fusion. In: CVPR, pp. 462–471 (2020)
Google Scholar
Wong, A., Cicek, S., Soatto, S.: Learning topology from synthetic data for unsupervised depth completion. IEEE Robo. Autom. Lett. 6(2), 1495–1502 (2021)
Article Google Scholar
Wong, A., Fei, X., Hong, B.W., Soatto, S.: An adaptive framework for learning unsupervised depth completion. IEEE Robot. Autom. Lett. 6(2), 3120–3127 (2021)
Article Google Scholar
Wong, A., Fei, X., Tsuei, S., Soatto, S.: Unsupervised depth completion from visual inertial odometry. IEEE Robot. Autom. Lett. 5(2), 1899–1906 (2020)
Article Google Scholar
Wong, A., Soatto, S.: Unsupervised depth completion with calibrated backprojection layers. In: ICCV (2021)
Google Scholar
Xie, Z., et al.: SimMIM: a simple framework for masked image modeling. arXiv preprint arXiv:2111.09886 (2021)
Xu, Y., Zhu, X., Shi, J., Zhang, G., Bao, H., Li, H.: Depth completion from sparse lidar data with depth-normal constraints. In: ICCV, pp. 2811–2820 (2019)
Google Scholar
Xu, Z., Yin, H., Yao, J.: Deformable spatial propagation networks for depth completion. In: ICIP, pp. 913–917. IEEE (2020)
Google Scholar
Yan, L., Liu, K., Gao, L.: Dan-conv: depth aware non-local convolution for lidar depth completion. Electron. Lett. 57(20), 754–757 (2021)
Article Google Scholar
Yan, Z., et al.: Rignet: repetitive image guided network for depth completion. arXiv preprint arXiv:2107.13802 (2021)
Yun, I., Lee, H.J., Rhee, C.E.: Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learning. arXiv preprint arXiv:2109.10563 (2021)
Zhao, S., Gong, M., Fu, H., Tao, D.: Adaptive context-aware multi-modal network for depth completion. IEEE Trans. Image Process. 30, 5264–5276 (2021)
Article Google Scholar
Zhou, K., Yang, K., Wang, K.: Panoramic depth estimation via supervised and unsupervised learning in indoor scenes. Appl. Opt. 60(26), 8188–8197 (2021)
Article Google Scholar
Zhu, Y., Dong, W., Li, L., Wu, J., Li, X., Shi, G.: Robust depth completion with uncertainty-driven loss functions. arXiv preprint arXiv:2112.07895 (2021)
Zhuang, C., Lu, Z., Wang, Y., Xiao, J., Wang, Y.: ACDNet: adaptively combined dilated convolution for monocular panorama depth estimation. In: AAAI (2022)
Google Scholar
Zioulis, N., Karakottas, A., Zarpalas, D., Alvarez, F., Daras, P.: Spherical view synthesis for self-supervised 360 depth estimation. In: 3DV, pp. 690–699. IEEE (2019)
Google Scholar
Zioulis, N., Karakottas, A., Zarpalas, D., Daras, P.: OmniDepth: dense depth estimation for indoors spherical panoramas. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 453–471. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_28
Chapter Google Scholar

Download references

Acknowledgement

The authors would like to thank reviewers for their detailed comments and instructive suggestions. This work was supported by the National Science Fund of China under Grant Nos. U1713208, 62072242 and Postdoctoral Innovative Talent Support Program of China under Grant BX20200168, 2020M681608. Note that the PCA Lab is associated with, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, Nanjing University of Science and Technology.

Author information

Authors and Affiliations

PCA Lab, Nanjing University of Science and Technology, Nanjing, China
Zhiqiang Yan, Xiang Li, Kun Wang, Zhenyu Zhang, Jun Li & Jian Yang

Authors

Zhiqiang Yan
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Kun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhenyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Li
View author publications
You can also search for this author in PubMed Google Scholar
Jian Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jun Li or Jian Yang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6536 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yan, Z., Li, X., Wang, K., Zhang, Z., Li, J., Yang, J. (2022). Multi-modal Masked Pre-training for Monocular Panoramic Depth Completion. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13661. Springer, Cham. https://doi.org/10.1007/978-3-031-19769-7_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-19769-7_22
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19768-0
Online ISBN: 978-3-031-19769-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-modal Masked Pre-training for Monocular Panoramic Depth Completion