Skip to main content

Mono-ViFI: A Unified Learning Framework for Self-supervised Single and Multi-frame Monocular Depth Estimation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15103))

Included in the following conference series:

  • 358 Accesses

Abstract

Self-supervised monocular depth estimation has gathered notable interest since it can liberate training from dependency on depth annotations. In monocular video training case, recent methods only conduct view synthesis between existing camera views, leading to insufficient guidance. To tackle this, we try to synthesize more virtual camera views by flow-based video frame interpolation (VFI), termed as temporal augmentation. For multi-frame inference, to sidestep the problem of dynamic objects encountered by explicit geometry-based methods like ManyDepth, we return to the feature fusion paradigm and design a VFI-assisted multi-frame fusion module to align and aggregate multi-frame features, using motion and occlusion information obtained by the flow-based VFI model. Finally, we construct a unified self-supervised learning framework, named Mono-ViFI, to bilaterally connect single- and multi-frame depth. In this framework, spatial data augmentation through image affine transformation is incorporated for data diversity, along with a triplet depth consistency loss for regularization. The single- and multi-frame models can share weights, making our framework compact and memory-efficient. Extensive experiments demonstrate that our method can bring significant improvements to current advanced architectures. Source code is available at https://github.com/LiuJF1226/Mono-ViFI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bae, J., Moon, S., Im, S.: Deep digging into the generalization of self-supervised monocular depth estimation. In: AAAI, pp. 187–196 (2023)

    Google Scholar 

  2. Bangunharcana, A., Magd, A., Kim, K.S.: DualRefine: self-supervised depth and pose estimation through iterative epipolar sampling and refinement toward equilibrium. In: CVPR, pp. 726–738 (2023)

    Google Scholar 

  3. Bello, J.L.G., Kim, M.: Forget about the lidar: self-supervised depth estimators with med probability volumes. In: NeurIPS, pp. 12626–12637 (2020)

    Google Scholar 

  4. Bello, J.L.G., Moon, J., Kim, M.: Positional information is all you need: A novel pipeline for self-supervised svde from videos. arXiv:2205.08851 (2022)

  5. Bian, J., et al.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: NeurIPS, pp. 35–45 (2019)

    Google Scholar 

  6. Chan, K.C., Wang, X., Yu, K., Dong, C., Loy, C.C.: Basicvsr: the search for essential components in video super-resolution and beyond. In: CVPR, pp. 4945–4954 (2021)

    Google Scholar 

  7. Chan, K.C., Zhou, S., Xu, X., Loy, C.C.: BasicVSR++: improving video super-resolution with enhanced propagation and alignment. In: CVPR, pp. 5962–5971 (2022)

    Google Scholar 

  8. Chen, Y., Liu, S., Wang, X.: Learning continuous image representation with local implicit image function. In: CVPR, pp. 8624–8634 (2021)

    Google Scholar 

  9. Chen, Y., Schmid, C., Sminchisescu, C.: Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera. In: ICCV, pp. 7063–7072 (2019)

    Google Scholar 

  10. Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016)

    Google Scholar 

  11. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)

    Google Scholar 

  12. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV, pp. 2650–2658 (2015)

    Google Scholar 

  13. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NeurIPS, pp. 2366–2374 (2014)

    Google Scholar 

  14. Feng, Z., Yang, L., Jing, L., Wang, H., Tian, Y., Li, B.: Disentangling object motion and occlusion for unsupervised multi-frame monocular depth. In: ECCV, pp. 228–244 (2022)

    Google Scholar 

  15. Garg, R., Kumar, B.G.V., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: ECCV, pp. 740–756 (2016)

    Google Scholar 

  16. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. IJRR 32(11), 1231–1237 (2013)

    Google Scholar 

  17. Godard, C., Aodha, O.M., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, pp. 6602–6611 (2017)

    Google Scholar 

  18. Godard, C., Aodha, O.M., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: CVPR, pp. 3828–3838 (2019)

    Google Scholar 

  19. Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: unsupervised monocular depth learning from unknown cameras. In: ICCV, pp. 8976–8985 (2019)

    Google Scholar 

  20. Guizilini, V., Ambrus, R., Chen, D., Zakharov, S., Gaidon, A.: Multi-frame self-supervised depth with transformers. In: CVPR, pp. 160–170 (2022)

    Google Scholar 

  21. Guizilini, V., Ambrus, R., Pillai, S., Gaidon, A.: 3D packing for self-supervised monocular depth estimation. In: CVPR, pp. 2482–2491 (2020)

    Google Scholar 

  22. Han, W., Yin, J., Jin, X., Dai, X., Shen, J.: BRNet: exploring comprehensive features for monocular depth estimation. In: ECCV, pp. 586–602 (2022)

    Google Scholar 

  23. Han, W., Yin, J., Shen, J.: Self-supervised monocular depth estimation by direction-aware cumulative convolution network. In: ICCV (2023)

    Google Scholar 

  24. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  25. He, M., Hui, L., Bian, Y., Ren, J., Xie, J., Yang, J.: Ra-depth: resolution adaptive self-supervised monocular depth estimation. In: ECCV, pp. 565–581 (2022)

    Google Scholar 

  26. Huang, Z., Zhang, T., Heng, W., Shi, B., Zhou, S.: Real-time intermediate flow estimation for video frame interpolation. In: ECCV, pp. 624–642 (2022)

    Google Scholar 

  27. Jung, H., Park, E., Yoo, S.: Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation. In: ICCV, pp. 12622–12632 (2021)

    Google Scholar 

  28. Klingner, M., Termöhlen, J.A., Mikolajczyk, J., Fingscheidt, T.: Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: ECCV, pp. 582–600 (2020)

    Google Scholar 

  29. Kong, L., Jiang, B., Luo, D., Chu, W., Huang, X., Tai, Y., Wang, C., Yang, J.: IFRNet: intermediate feature refine network for efficient frame interpolation. In: CVPR, pp. 1959–1968 (2022)

    Google Scholar 

  30. Kuznietsov, Y., Proesmans, M., Gool, L.V.: CoMoDA: continuous monocular depth adaptation using past experiences. In: WACV, pp. 2906–2916 (2021)

    Google Scholar 

  31. Lee, S., Im, S., Lin, S., Kweon, I.S.: Learning monocular depth in dynamic scenes via instance-aware projection consistency. In: AAAI, pp. 1863–1872 (2021)

    Google Scholar 

  32. Lee, S., Rameau, F., Pan, F., Kweon, I.S.: Attentive and contrastive learning for joint depth and motion field estimation. In: ICCV, pp. 4862–4871 (2021)

    Google Scholar 

  33. Li, H., Gordon, A., Zhao, H., Casser, V., Angelova, A.: Unsupervised monocular depth learning in dynamic scenes. In: CoRL, pp. 1908–1917 (2020)

    Google Scholar 

  34. Li, Z., Zhu, Z.L., Han, L.H., Hou, Q., Guo, C.L., Cheng, M.M.: AMT: all-pairs multi-field transforms for efficient frame interpolation. In: CVPR, pp. 9801–9810 (2023)

    Google Scholar 

  35. Liu, J., Kong, L., Yang, J.: Designing and searching for lightweight monocular depth network. In: ICONIP, pp. 477–488 (2021)

    Google Scholar 

  36. Liu, J., Kong, L., Yang, J.: ATCA: an arc trajectory based model with curvature attention for video frame interpolation. In: ICIP, pp. 1486–1490 (2022)

    Google Scholar 

  37. Liu, J., Kong, L., Yang, J., Liu, W.: Towards better data exploitation in self-supervised monocular depth estimation. IEEE Robot. Autom. Lett. 9(1), 763–770 (2024)

    Article  Google Scholar 

  38. Luo, X., Huang, J.B., Szeliski, R., Matzen, K., Kopf, J.: Consistent video depth estimation. TOG 39(71), 1–13 (2020)

    Google Scholar 

  39. Lyu, X., Liu, L., Wang, M., Kong, X., Liu, L., Liu, Y., Chen, X., Yuan, Y.: HR-depth: high resolution self-supervised monocular depth estimation. In: AAAI, pp. 2294–2301 (2021)

    Google Scholar 

  40. Ma, J., Lei, X., Liu, N., Zhao, X., Pu, S.: Towards comprehensive representation enhancement in semantics-guided self-supervised monocular depth estimation. In: ECCV, pp. 304–321 (2022)

    Google Scholar 

  41. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: ECCV, pp. 405–421 (2020)

    Google Scholar 

  42. Patil, V., Gansbeke, W.V., Dai, D., Gool, L.V.: Don’t forget the past: recurrent depth estimation from monocular video. IEEE Robot. Autom. Lett. 5(4), 6813–6820 (2020)

    Article  Google Scholar 

  43. Peng, R., Wang, R., Lai, Y., Tang, L., Cai, Y.: Excavating the potential capacity of self supervised monocular depth estimation. In: ICCV, pp. 15560–15569 (2021)

    Google Scholar 

  44. Petrovai, A., Nedevschi, S.: Exploiting pseudo labels in a self-supervised learning framework for improved monocular depth estimation. In: CVPR, pp. 1568–1578 (2022)

    Google Scholar 

  45. Pilzer, A., Lathuilière, S., Sebe, N., Ricci, E.: Refine and distill: exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation. In: CVPR, pp. 9760–9769 (2019)

    Google Scholar 

  46. Poggi, M., Aleotti, F., Tosi, F., Mattoccia, S.: On the uncertainty of self-supervised monocular depth estimation. In: CVPR, pp. 3224–3234 (2020)

    Google Scholar 

  47. Ranjan, A., et al.: Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: CVPR, pp. 12240–12249 (2019)

    Google Scholar 

  48. Ren, W., Wang, L., Piao, Y., Zhang, M., Lu, H., Liu, T.: Adaptive co-teaching for unsupervised monocular depth estimation. In: ECCV, pp. 89–105 (2022)

    Google Scholar 

  49. Ruhkamp, P., Gao, D., Chen, H., Navab, N., Busam, B.: Attention meets geometry: Geometry guided spatial-temporal attention for consistent self-supervised monocular depth estimation. In: 3DV, pp. 837–847 (2021)

    Google Scholar 

  50. Saxena, A., Sun, M., Ng, A.Y.: Make3D: learning 3D scene structure from a single still image. PAMI 31(5), 824–840 (2009)

    Article  Google Scholar 

  51. Tancik, M., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. In: NeurIPS, pp. 7537–7547 (2020)

    Google Scholar 

  52. Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. In: 3DV, pp. 11–20 (2017)

    Google Scholar 

  53. Wang, C., Buenaposada, J.M., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: CVPR, pp. 2022–2030 (2018)

    Google Scholar 

  54. Wang, R., Pizer, S.M., Frahm, J.M.: Recurrent neural network for (un-)supervised learning of monocular video visual odometry and depth. In: CVPR, pp. 5555–5564 (2019)

    Google Scholar 

  55. Wang, R., Yu, Z., Gao, S.: PlaneDepth: self-supervised depth estimation via orthogonal planes. In: CVPR, pp. 21425–21434 (2023)

    Google Scholar 

  56. Wang, X., et al.: Crafting monocular cues and velocity guidance for self-supervised multi-frame depth learning. In: AAAI, pp. 2689–2697 (2023)

    Google Scholar 

  57. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP 13(4), 600–612 (2004)

    Google Scholar 

  58. Watson, J., Aodha, O.M., Prisacariu, V.A., Brostow, G.J., Firman, M.: The temporal opportunist: Self-supervised multi-frame monocular depth. In: CVPR, pp. 1164–1174 (2021)

    Google Scholar 

  59. Watson, J., Firman, M., Brostow, G.J., Turmukhambetov, D.: Self-supervised monocular depth hints. In: ICCV, pp. 2162–2171 (2019)

    Google Scholar 

  60. Yin, Z., Shi, J.: GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: CVPR, pp. 1983–1992 (2018)

    Google Scholar 

  61. Zhang, H., Li, Y., Cao, Y., Liu, Y., Shen, C., Yan, Y.: Exploiting temporal consistency for real-time video depth estimation. In: ICCV, pp. 1725–1734 (2019)

    Google Scholar 

  62. Zhang, N., Nex, F., Vosselman, G., Kerle, N.: Lite-Mono: a lightweight CNN and transformer architecture for self-supervised monocular depth estimation. In: CVPR, pp. 18537–18546 (2023)

    Google Scholar 

  63. Zhao, C., et al.: GasMono: geometry-aided self-supervised monocular depth estimation for indoor scenes. In: ICCV, pp. 16163–16174 (2023)

    Google Scholar 

  64. Zhao, C., et al.: MonoViT: self-supervised monocular depth estimation with a vision transformer. In: 3DV, pp. 668–678 (2022)

    Google Scholar 

  65. Zhou, H., Greenwood, D., Taylor, S.: Self-supervised monocular depth estimation with internal feature fusion. In: BMVC (2021)

    Google Scholar 

  66. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR, pp. 6612–6619 (2017)

    Google Scholar 

  67. Zhou, Z., Fan, X., Shi, P., Xin, Y.: R-MSFM: recurrent multi-scale feature modulation for monocular depth estimating. In: ICCV, pp. 12757–12766 (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lingtong Kong .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3146 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, J., Kong, L., Li, B., Wang, Z., Gu, H., Chen, J. (2025). Mono-ViFI: A Unified Learning Framework for Self-supervised Single and Multi-frame Monocular Depth Estimation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15103. Springer, Cham. https://doi.org/10.1007/978-3-031-72995-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72995-9_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72994-2

  • Online ISBN: 978-3-031-72995-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics