Abstract
Self-supervised monocular depth estimation has gathered notable interest since it can liberate training from dependency on depth annotations. In monocular video training case, recent methods only conduct view synthesis between existing camera views, leading to insufficient guidance. To tackle this, we try to synthesize more virtual camera views by flow-based video frame interpolation (VFI), termed as temporal augmentation. For multi-frame inference, to sidestep the problem of dynamic objects encountered by explicit geometry-based methods like ManyDepth, we return to the feature fusion paradigm and design a VFI-assisted multi-frame fusion module to align and aggregate multi-frame features, using motion and occlusion information obtained by the flow-based VFI model. Finally, we construct a unified self-supervised learning framework, named Mono-ViFI, to bilaterally connect single- and multi-frame depth. In this framework, spatial data augmentation through image affine transformation is incorporated for data diversity, along with a triplet depth consistency loss for regularization. The single- and multi-frame models can share weights, making our framework compact and memory-efficient. Extensive experiments demonstrate that our method can bring significant improvements to current advanced architectures. Source code is available at https://github.com/LiuJF1226/Mono-ViFI.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bae, J., Moon, S., Im, S.: Deep digging into the generalization of self-supervised monocular depth estimation. In: AAAI, pp. 187–196 (2023)
Bangunharcana, A., Magd, A., Kim, K.S.: DualRefine: self-supervised depth and pose estimation through iterative epipolar sampling and refinement toward equilibrium. In: CVPR, pp. 726–738 (2023)
Bello, J.L.G., Kim, M.: Forget about the lidar: self-supervised depth estimators with med probability volumes. In: NeurIPS, pp. 12626–12637 (2020)
Bello, J.L.G., Moon, J., Kim, M.: Positional information is all you need: A novel pipeline for self-supervised svde from videos. arXiv:2205.08851 (2022)
Bian, J., et al.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: NeurIPS, pp. 35–45 (2019)
Chan, K.C., Wang, X., Yu, K., Dong, C., Loy, C.C.: Basicvsr: the search for essential components in video super-resolution and beyond. In: CVPR, pp. 4945–4954 (2021)
Chan, K.C., Zhou, S., Xu, X., Loy, C.C.: BasicVSR++: improving video super-resolution with enhanced propagation and alignment. In: CVPR, pp. 5962–5971 (2022)
Chen, Y., Liu, S., Wang, X.: Learning continuous image representation with local implicit image function. In: CVPR, pp. 8624–8634 (2021)
Chen, Y., Schmid, C., Sminchisescu, C.: Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera. In: ICCV, pp. 7063–7072 (2019)
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV, pp. 2650–2658 (2015)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NeurIPS, pp. 2366–2374 (2014)
Feng, Z., Yang, L., Jing, L., Wang, H., Tian, Y., Li, B.: Disentangling object motion and occlusion for unsupervised multi-frame monocular depth. In: ECCV, pp. 228–244 (2022)
Garg, R., Kumar, B.G.V., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: ECCV, pp. 740–756 (2016)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. IJRR 32(11), 1231–1237 (2013)
Godard, C., Aodha, O.M., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, pp. 6602–6611 (2017)
Godard, C., Aodha, O.M., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: CVPR, pp. 3828–3838 (2019)
Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: unsupervised monocular depth learning from unknown cameras. In: ICCV, pp. 8976–8985 (2019)
Guizilini, V., Ambrus, R., Chen, D., Zakharov, S., Gaidon, A.: Multi-frame self-supervised depth with transformers. In: CVPR, pp. 160–170 (2022)
Guizilini, V., Ambrus, R., Pillai, S., Gaidon, A.: 3D packing for self-supervised monocular depth estimation. In: CVPR, pp. 2482–2491 (2020)
Han, W., Yin, J., Jin, X., Dai, X., Shen, J.: BRNet: exploring comprehensive features for monocular depth estimation. In: ECCV, pp. 586–602 (2022)
Han, W., Yin, J., Shen, J.: Self-supervised monocular depth estimation by direction-aware cumulative convolution network. In: ICCV (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
He, M., Hui, L., Bian, Y., Ren, J., Xie, J., Yang, J.: Ra-depth: resolution adaptive self-supervised monocular depth estimation. In: ECCV, pp. 565–581 (2022)
Huang, Z., Zhang, T., Heng, W., Shi, B., Zhou, S.: Real-time intermediate flow estimation for video frame interpolation. In: ECCV, pp. 624–642 (2022)
Jung, H., Park, E., Yoo, S.: Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation. In: ICCV, pp. 12622–12632 (2021)
Klingner, M., Termöhlen, J.A., Mikolajczyk, J., Fingscheidt, T.: Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: ECCV, pp. 582–600 (2020)
Kong, L., Jiang, B., Luo, D., Chu, W., Huang, X., Tai, Y., Wang, C., Yang, J.: IFRNet: intermediate feature refine network for efficient frame interpolation. In: CVPR, pp. 1959–1968 (2022)
Kuznietsov, Y., Proesmans, M., Gool, L.V.: CoMoDA: continuous monocular depth adaptation using past experiences. In: WACV, pp. 2906–2916 (2021)
Lee, S., Im, S., Lin, S., Kweon, I.S.: Learning monocular depth in dynamic scenes via instance-aware projection consistency. In: AAAI, pp. 1863–1872 (2021)
Lee, S., Rameau, F., Pan, F., Kweon, I.S.: Attentive and contrastive learning for joint depth and motion field estimation. In: ICCV, pp. 4862–4871 (2021)
Li, H., Gordon, A., Zhao, H., Casser, V., Angelova, A.: Unsupervised monocular depth learning in dynamic scenes. In: CoRL, pp. 1908–1917 (2020)
Li, Z., Zhu, Z.L., Han, L.H., Hou, Q., Guo, C.L., Cheng, M.M.: AMT: all-pairs multi-field transforms for efficient frame interpolation. In: CVPR, pp. 9801–9810 (2023)
Liu, J., Kong, L., Yang, J.: Designing and searching for lightweight monocular depth network. In: ICONIP, pp. 477–488 (2021)
Liu, J., Kong, L., Yang, J.: ATCA: an arc trajectory based model with curvature attention for video frame interpolation. In: ICIP, pp. 1486–1490 (2022)
Liu, J., Kong, L., Yang, J., Liu, W.: Towards better data exploitation in self-supervised monocular depth estimation. IEEE Robot. Autom. Lett. 9(1), 763–770 (2024)
Luo, X., Huang, J.B., Szeliski, R., Matzen, K., Kopf, J.: Consistent video depth estimation. TOG 39(71), 1–13 (2020)
Lyu, X., Liu, L., Wang, M., Kong, X., Liu, L., Liu, Y., Chen, X., Yuan, Y.: HR-depth: high resolution self-supervised monocular depth estimation. In: AAAI, pp. 2294–2301 (2021)
Ma, J., Lei, X., Liu, N., Zhao, X., Pu, S.: Towards comprehensive representation enhancement in semantics-guided self-supervised monocular depth estimation. In: ECCV, pp. 304–321 (2022)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: ECCV, pp. 405–421 (2020)
Patil, V., Gansbeke, W.V., Dai, D., Gool, L.V.: Don’t forget the past: recurrent depth estimation from monocular video. IEEE Robot. Autom. Lett. 5(4), 6813–6820 (2020)
Peng, R., Wang, R., Lai, Y., Tang, L., Cai, Y.: Excavating the potential capacity of self supervised monocular depth estimation. In: ICCV, pp. 15560–15569 (2021)
Petrovai, A., Nedevschi, S.: Exploiting pseudo labels in a self-supervised learning framework for improved monocular depth estimation. In: CVPR, pp. 1568–1578 (2022)
Pilzer, A., Lathuilière, S., Sebe, N., Ricci, E.: Refine and distill: exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation. In: CVPR, pp. 9760–9769 (2019)
Poggi, M., Aleotti, F., Tosi, F., Mattoccia, S.: On the uncertainty of self-supervised monocular depth estimation. In: CVPR, pp. 3224–3234 (2020)
Ranjan, A., et al.: Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: CVPR, pp. 12240–12249 (2019)
Ren, W., Wang, L., Piao, Y., Zhang, M., Lu, H., Liu, T.: Adaptive co-teaching for unsupervised monocular depth estimation. In: ECCV, pp. 89–105 (2022)
Ruhkamp, P., Gao, D., Chen, H., Navab, N., Busam, B.: Attention meets geometry: Geometry guided spatial-temporal attention for consistent self-supervised monocular depth estimation. In: 3DV, pp. 837–847 (2021)
Saxena, A., Sun, M., Ng, A.Y.: Make3D: learning 3D scene structure from a single still image. PAMI 31(5), 824–840 (2009)
Tancik, M., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. In: NeurIPS, pp. 7537–7547 (2020)
Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. In: 3DV, pp. 11–20 (2017)
Wang, C., Buenaposada, J.M., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: CVPR, pp. 2022–2030 (2018)
Wang, R., Pizer, S.M., Frahm, J.M.: Recurrent neural network for (un-)supervised learning of monocular video visual odometry and depth. In: CVPR, pp. 5555–5564 (2019)
Wang, R., Yu, Z., Gao, S.: PlaneDepth: self-supervised depth estimation via orthogonal planes. In: CVPR, pp. 21425–21434 (2023)
Wang, X., et al.: Crafting monocular cues and velocity guidance for self-supervised multi-frame depth learning. In: AAAI, pp. 2689–2697 (2023)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP 13(4), 600–612 (2004)
Watson, J., Aodha, O.M., Prisacariu, V.A., Brostow, G.J., Firman, M.: The temporal opportunist: Self-supervised multi-frame monocular depth. In: CVPR, pp. 1164–1174 (2021)
Watson, J., Firman, M., Brostow, G.J., Turmukhambetov, D.: Self-supervised monocular depth hints. In: ICCV, pp. 2162–2171 (2019)
Yin, Z., Shi, J.: GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: CVPR, pp. 1983–1992 (2018)
Zhang, H., Li, Y., Cao, Y., Liu, Y., Shen, C., Yan, Y.: Exploiting temporal consistency for real-time video depth estimation. In: ICCV, pp. 1725–1734 (2019)
Zhang, N., Nex, F., Vosselman, G., Kerle, N.: Lite-Mono: a lightweight CNN and transformer architecture for self-supervised monocular depth estimation. In: CVPR, pp. 18537–18546 (2023)
Zhao, C., et al.: GasMono: geometry-aided self-supervised monocular depth estimation for indoor scenes. In: ICCV, pp. 16163–16174 (2023)
Zhao, C., et al.: MonoViT: self-supervised monocular depth estimation with a vision transformer. In: 3DV, pp. 668–678 (2022)
Zhou, H., Greenwood, D., Taylor, S.: Self-supervised monocular depth estimation with internal feature fusion. In: BMVC (2021)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR, pp. 6612–6619 (2017)
Zhou, Z., Fan, X., Shi, P., Xin, Y.: R-MSFM: recurrent multi-scale feature modulation for monocular depth estimating. In: ICCV, pp. 12757–12766 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, J., Kong, L., Li, B., Wang, Z., Gu, H., Chen, J. (2025). Mono-ViFI: A Unified Learning Framework for Self-supervised Single and Multi-frame Monocular Depth Estimation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15103. Springer, Cham. https://doi.org/10.1007/978-3-031-72995-9_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-72995-9_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72994-2
Online ISBN: 978-3-031-72995-9
eBook Packages: Computer ScienceComputer Science (R0)