Mono-ViFI: A Unified Learning Framework for Self-supervised Single and Multi-frame Monocular Depth Estimation

Liu, Jinfeng; Kong, Lingtong; Li, Bo; Wang, Zerong; Gu, Hong; Chen, Jinwei

doi:10.1007/978-3-031-72995-9_6

Jinfeng Liu¹³,
Lingtong Kong¹³,
Bo Li¹³,
Zerong Wang¹³,
Hong Gu¹³ &
…
Jinwei Chen¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15103))

Included in the following conference series:

European Conference on Computer Vision

358 Accesses

Abstract

Self-supervised monocular depth estimation has gathered notable interest since it can liberate training from dependency on depth annotations. In monocular video training case, recent methods only conduct view synthesis between existing camera views, leading to insufficient guidance. To tackle this, we try to synthesize more virtual camera views by flow-based video frame interpolation (VFI), termed as temporal augmentation. For multi-frame inference, to sidestep the problem of dynamic objects encountered by explicit geometry-based methods like ManyDepth, we return to the feature fusion paradigm and design a VFI-assisted multi-frame fusion module to align and aggregate multi-frame features, using motion and occlusion information obtained by the flow-based VFI model. Finally, we construct a unified self-supervised learning framework, named Mono-ViFI, to bilaterally connect single- and multi-frame depth. In this framework, spatial data augmentation through image affine transformation is incorporated for data diversity, along with a triplet depth consistency loss for regularization. The single- and multi-frame models can share weights, making our framework compact and memory-efficient. Extensive experiments demonstrate that our method can bring significant improvements to current advanced architectures. Source code is available at https://github.com/LiuJF1226/Mono-ViFI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Efficient Unsupervised Monocular Depth Estimation with Inter-Frame Depth Interpolation

MuDeepNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose Using Multi-view Consistency Loss

Article 06 July 2019

RA-Depth: Resolution Adaptive Self-supervised Monocular Depth Estimation

References

Bae, J., Moon, S., Im, S.: Deep digging into the generalization of self-supervised monocular depth estimation. In: AAAI, pp. 187–196 (2023)
Google Scholar
Bangunharcana, A., Magd, A., Kim, K.S.: DualRefine: self-supervised depth and pose estimation through iterative epipolar sampling and refinement toward equilibrium. In: CVPR, pp. 726–738 (2023)
Google Scholar
Bello, J.L.G., Kim, M.: Forget about the lidar: self-supervised depth estimators with med probability volumes. In: NeurIPS, pp. 12626–12637 (2020)
Google Scholar
Bello, J.L.G., Moon, J., Kim, M.: Positional information is all you need: A novel pipeline for self-supervised svde from videos. arXiv:2205.08851 (2022)
Bian, J., et al.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: NeurIPS, pp. 35–45 (2019)
Google Scholar
Chan, K.C., Wang, X., Yu, K., Dong, C., Loy, C.C.: Basicvsr: the search for essential components in video super-resolution and beyond. In: CVPR, pp. 4945–4954 (2021)
Google Scholar
Chan, K.C., Zhou, S., Xu, X., Loy, C.C.: BasicVSR++: improving video super-resolution with enhanced propagation and alignment. In: CVPR, pp. 5962–5971 (2022)
Google Scholar
Chen, Y., Liu, S., Wang, X.: Learning continuous image representation with local implicit image function. In: CVPR, pp. 8624–8634 (2021)
Google Scholar
Chen, Y., Schmid, C., Sminchisescu, C.: Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera. In: ICCV, pp. 7063–7072 (2019)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Google Scholar
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV, pp. 2650–2658 (2015)
Google Scholar
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NeurIPS, pp. 2366–2374 (2014)
Google Scholar
Feng, Z., Yang, L., Jing, L., Wang, H., Tian, Y., Li, B.: Disentangling object motion and occlusion for unsupervised multi-frame monocular depth. In: ECCV, pp. 228–244 (2022)
Google Scholar
Garg, R., Kumar, B.G.V., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: ECCV, pp. 740–756 (2016)
Google Scholar
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. IJRR 32(11), 1231–1237 (2013)
Google Scholar
Godard, C., Aodha, O.M., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, pp. 6602–6611 (2017)
Google Scholar
Godard, C., Aodha, O.M., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: CVPR, pp. 3828–3838 (2019)
Google Scholar
Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: unsupervised monocular depth learning from unknown cameras. In: ICCV, pp. 8976–8985 (2019)
Google Scholar
Guizilini, V., Ambrus, R., Chen, D., Zakharov, S., Gaidon, A.: Multi-frame self-supervised depth with transformers. In: CVPR, pp. 160–170 (2022)
Google Scholar
Guizilini, V., Ambrus, R., Pillai, S., Gaidon, A.: 3D packing for self-supervised monocular depth estimation. In: CVPR, pp. 2482–2491 (2020)
Google Scholar
Han, W., Yin, J., Jin, X., Dai, X., Shen, J.: BRNet: exploring comprehensive features for monocular depth estimation. In: ECCV, pp. 586–602 (2022)
Google Scholar
Han, W., Yin, J., Shen, J.: Self-supervised monocular depth estimation by direction-aware cumulative convolution network. In: ICCV (2023)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
He, M., Hui, L., Bian, Y., Ren, J., Xie, J., Yang, J.: Ra-depth: resolution adaptive self-supervised monocular depth estimation. In: ECCV, pp. 565–581 (2022)
Google Scholar
Huang, Z., Zhang, T., Heng, W., Shi, B., Zhou, S.: Real-time intermediate flow estimation for video frame interpolation. In: ECCV, pp. 624–642 (2022)
Google Scholar
Jung, H., Park, E., Yoo, S.: Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation. In: ICCV, pp. 12622–12632 (2021)
Google Scholar
Klingner, M., Termöhlen, J.A., Mikolajczyk, J., Fingscheidt, T.: Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: ECCV, pp. 582–600 (2020)
Google Scholar
Kong, L., Jiang, B., Luo, D., Chu, W., Huang, X., Tai, Y., Wang, C., Yang, J.: IFRNet: intermediate feature refine network for efficient frame interpolation. In: CVPR, pp. 1959–1968 (2022)
Google Scholar
Kuznietsov, Y., Proesmans, M., Gool, L.V.: CoMoDA: continuous monocular depth adaptation using past experiences. In: WACV, pp. 2906–2916 (2021)
Google Scholar
Lee, S., Im, S., Lin, S., Kweon, I.S.: Learning monocular depth in dynamic scenes via instance-aware projection consistency. In: AAAI, pp. 1863–1872 (2021)
Google Scholar
Lee, S., Rameau, F., Pan, F., Kweon, I.S.: Attentive and contrastive learning for joint depth and motion field estimation. In: ICCV, pp. 4862–4871 (2021)
Google Scholar
Li, H., Gordon, A., Zhao, H., Casser, V., Angelova, A.: Unsupervised monocular depth learning in dynamic scenes. In: CoRL, pp. 1908–1917 (2020)
Google Scholar
Li, Z., Zhu, Z.L., Han, L.H., Hou, Q., Guo, C.L., Cheng, M.M.: AMT: all-pairs multi-field transforms for efficient frame interpolation. In: CVPR, pp. 9801–9810 (2023)
Google Scholar
Liu, J., Kong, L., Yang, J.: Designing and searching for lightweight monocular depth network. In: ICONIP, pp. 477–488 (2021)
Google Scholar
Liu, J., Kong, L., Yang, J.: ATCA: an arc trajectory based model with curvature attention for video frame interpolation. In: ICIP, pp. 1486–1490 (2022)
Google Scholar
Liu, J., Kong, L., Yang, J., Liu, W.: Towards better data exploitation in self-supervised monocular depth estimation. IEEE Robot. Autom. Lett. 9(1), 763–770 (2024)
Article Google Scholar
Luo, X., Huang, J.B., Szeliski, R., Matzen, K., Kopf, J.: Consistent video depth estimation. TOG 39(71), 1–13 (2020)
Google Scholar
Lyu, X., Liu, L., Wang, M., Kong, X., Liu, L., Liu, Y., Chen, X., Yuan, Y.: HR-depth: high resolution self-supervised monocular depth estimation. In: AAAI, pp. 2294–2301 (2021)
Google Scholar
Ma, J., Lei, X., Liu, N., Zhao, X., Pu, S.: Towards comprehensive representation enhancement in semantics-guided self-supervised monocular depth estimation. In: ECCV, pp. 304–321 (2022)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: ECCV, pp. 405–421 (2020)
Google Scholar
Patil, V., Gansbeke, W.V., Dai, D., Gool, L.V.: Don’t forget the past: recurrent depth estimation from monocular video. IEEE Robot. Autom. Lett. 5(4), 6813–6820 (2020)
Article Google Scholar
Peng, R., Wang, R., Lai, Y., Tang, L., Cai, Y.: Excavating the potential capacity of self supervised monocular depth estimation. In: ICCV, pp. 15560–15569 (2021)
Google Scholar
Petrovai, A., Nedevschi, S.: Exploiting pseudo labels in a self-supervised learning framework for improved monocular depth estimation. In: CVPR, pp. 1568–1578 (2022)
Google Scholar
Pilzer, A., Lathuilière, S., Sebe, N., Ricci, E.: Refine and distill: exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation. In: CVPR, pp. 9760–9769 (2019)
Google Scholar
Poggi, M., Aleotti, F., Tosi, F., Mattoccia, S.: On the uncertainty of self-supervised monocular depth estimation. In: CVPR, pp. 3224–3234 (2020)
Google Scholar
Ranjan, A., et al.: Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: CVPR, pp. 12240–12249 (2019)
Google Scholar
Ren, W., Wang, L., Piao, Y., Zhang, M., Lu, H., Liu, T.: Adaptive co-teaching for unsupervised monocular depth estimation. In: ECCV, pp. 89–105 (2022)
Google Scholar
Ruhkamp, P., Gao, D., Chen, H., Navab, N., Busam, B.: Attention meets geometry: Geometry guided spatial-temporal attention for consistent self-supervised monocular depth estimation. In: 3DV, pp. 837–847 (2021)
Google Scholar
Saxena, A., Sun, M., Ng, A.Y.: Make3D: learning 3D scene structure from a single still image. PAMI 31(5), 824–840 (2009)
Article Google Scholar
Tancik, M., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. In: NeurIPS, pp. 7537–7547 (2020)
Google Scholar
Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. In: 3DV, pp. 11–20 (2017)
Google Scholar
Wang, C., Buenaposada, J.M., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: CVPR, pp. 2022–2030 (2018)
Google Scholar
Wang, R., Pizer, S.M., Frahm, J.M.: Recurrent neural network for (un-)supervised learning of monocular video visual odometry and depth. In: CVPR, pp. 5555–5564 (2019)
Google Scholar
Wang, R., Yu, Z., Gao, S.: PlaneDepth: self-supervised depth estimation via orthogonal planes. In: CVPR, pp. 21425–21434 (2023)
Google Scholar
Wang, X., et al.: Crafting monocular cues and velocity guidance for self-supervised multi-frame depth learning. In: AAAI, pp. 2689–2697 (2023)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP 13(4), 600–612 (2004)
Google Scholar
Watson, J., Aodha, O.M., Prisacariu, V.A., Brostow, G.J., Firman, M.: The temporal opportunist: Self-supervised multi-frame monocular depth. In: CVPR, pp. 1164–1174 (2021)
Google Scholar
Watson, J., Firman, M., Brostow, G.J., Turmukhambetov, D.: Self-supervised monocular depth hints. In: ICCV, pp. 2162–2171 (2019)
Google Scholar
Yin, Z., Shi, J.: GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: CVPR, pp. 1983–1992 (2018)
Google Scholar
Zhang, H., Li, Y., Cao, Y., Liu, Y., Shen, C., Yan, Y.: Exploiting temporal consistency for real-time video depth estimation. In: ICCV, pp. 1725–1734 (2019)
Google Scholar
Zhang, N., Nex, F., Vosselman, G., Kerle, N.: Lite-Mono: a lightweight CNN and transformer architecture for self-supervised monocular depth estimation. In: CVPR, pp. 18537–18546 (2023)
Google Scholar
Zhao, C., et al.: GasMono: geometry-aided self-supervised monocular depth estimation for indoor scenes. In: ICCV, pp. 16163–16174 (2023)
Google Scholar
Zhao, C., et al.: MonoViT: self-supervised monocular depth estimation with a vision transformer. In: 3DV, pp. 668–678 (2022)
Google Scholar
Zhou, H., Greenwood, D., Taylor, S.: Self-supervised monocular depth estimation with internal feature fusion. In: BMVC (2021)
Google Scholar
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR, pp. 6612–6619 (2017)
Google Scholar
Zhou, Z., Fan, X., Shi, P., Xin, Y.: R-MSFM: recurrent multi-scale feature modulation for monocular depth estimating. In: ICCV, pp. 12757–12766 (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

vivo Mobile Communication Co., Ltd., Dongguan, China
Jinfeng Liu, Lingtong Kong, Bo Li, Zerong Wang, Hong Gu & Jinwei Chen

Authors

Jinfeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Lingtong Kong
View author publications
You can also search for this author in PubMed Google Scholar
Bo Li
View author publications
You can also search for this author in PubMed Google Scholar
Zerong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gu
View author publications
You can also search for this author in PubMed Google Scholar
Jinwei Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lingtong Kong .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3146 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, J., Kong, L., Li, B., Wang, Z., Gu, H., Chen, J. (2025). Mono-ViFI: A Unified Learning Framework for Self-supervised Single and Multi-frame Monocular Depth Estimation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15103. Springer, Cham. https://doi.org/10.1007/978-3-031-72995-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-72995-9_6
Published: 24 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72994-2
Online ISBN: 978-3-031-72995-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Mono-ViFI: A Unified Learning Framework for Self-supervised Single and Multi-frame Monocular Depth Estimation