Skip to main content

PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15125))

Included in the following conference series:

  • 246 Accesses

Abstract

This paper introduces PatchRefiner, an advanced framework for metric single image depth estimation aimed at high-resolution real-domain inputs. While depth estimation is crucial for applications such as autonomous driving, 3D generative modeling, and 3D reconstruction, achieving accurate high-resolution depth in real-world scenarios is challenging due to the constraints of existing architectures and the scarcity of detailed real-world depth data. PatchRefiner adopts a tile-based methodology, reconceptualizing high-resolution depth estimation as a refinement process, which results in notable performance enhancements. Utilizing a pseudo-labeling strategy that leverages synthetic data, PatchRefiner incorporates a Detail and Scale Disentangling (DSD) loss to enhance detail capture while maintaining scale accuracy, thus facilitating the effective transfer of knowledge from synthetic to real-world data. Our extensive evaluations demonstrate PatchRefiner’s superior performance, significantly outperforming existing benchmarks on the Unreal4KStereo dataset by 18.1% in terms of the root mean squared error (RMSE) and showing marked improvements in detail accuracy and consistent scale estimation on diverse real-world datasets like CityScape, ScanNet++, and ETH3D.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bhat, S.F., Alhashim, I., Wonka, P.: AdaBins: depth estimation using adaptive bins. In: CVPR, pp. 4009–4018 (2021)

    Google Scholar 

  2. Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., Müller, M.: ZoeDepth: zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)

  3. Chen, C., et al.: Progressive feature alignment for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 627–636 (2019)

    Google Scholar 

  4. Chen, C., Chen, X., Cheng, H.: On the over-smoothing problem of CNN based disparity estimation. In: ICCV, pp. 8997–9005 (2019)

    Google Scholar 

  5. Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild. In: NeurIPS, vol. 29 (2016)

    Google Scholar 

  6. Chen, Y., Liu, S., Wang, X.: Learning continuous image representation with local implicit image function. In: CVPR, pp. 8628–8638 (2021)

    Google Scholar 

  7. Chen, Y.C., Lin, Y.Y., Yang, M.H., Huang, J.B.: CrDoCo: pixel-level domain transfer with cross-domain consistency. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1791–1800 (2019)

    Google Scholar 

  8. Chen, Z., Zhang, R., Zhang, G., Ma, Z., Lei, T.: Digging into pseudo label: a low-budget approach for semi-supervised semantic segmentation. IEEE Access 8, 41830–41837 (2020)

    Article  Google Scholar 

  9. Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016)

    Google Scholar 

  10. De Lutio, R., Becker, A., D’Aronco, S., Russo, S., Wegner, J.D., Schindler, K.: Learning graph regularisation for guided super-resolution. In: CVPR, pp. 1979–1988 (2022)

    Google Scholar 

  11. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NeurIPS, vol. 27 (2014)

    Google Scholar 

  12. Farahani, A., Voghoei, S., Rasheed, K., Arabnia, H.R.: A brief review of domain adaptation. In: Advances in Data Science and Information Engineering: Proceedings from ICDATA 2020 and IKE 2020, pp. 877–894 (2021)

    Google Scholar 

  13. Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: CVPR, pp. 2002–2011 (2018)

    Google Scholar 

  14. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR, pp. 3354–3361. IEEE (2012)

    Google Scholar 

  15. Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: ICCV, pp. 3828–3838 (2019)

    Google Scholar 

  16. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

    Google Scholar 

  17. Hu, Z., Yang, Z., Hu, X., Nevatia, R.: Simple: similar pseudo label exploitation for semi-supervised classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15099–15108 (2021)

    Google Scholar 

  18. Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: DeepMVS: learning multi-view stereopsis. In: CVPR (2018)

    Google Scholar 

  19. Hui, T.-W., Loy, C.C., Tang, X.: Depth map super-resolution by deep multi-scale guidance. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 353–369. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_22

    Chapter  Google Scholar 

  20. Janoch, A., et al.: A category-level 3D object dataset: putting the Kinect to work. In: Consumer Depth Cameras for Computer Vision: Research Topics and Applications, pp. 141–165 (2013)

    Google Scholar 

  21. Kanopoulos, N., Vasanthavada, N., Baker, R.L.: Design of an image edge detection filter using the Sobel operator. IEEE J. Solid-State Circuits 23(2), 358–367 (1988)

    Article  Google Scholar 

  22. Koutilya, P., Zhou, H., Jacobs, D.: SharinGAN: combining synthetic and real data for unsupervised geometry estimation. In: CVPR, vol. 2, p. 5 (2020)

    Google Scholar 

  23. Kundu, J.N., Uppala, P.K., Pahuja, A., Babu, R.V.: AdaDepth: unsupervised content congruent adaptation for depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2656–2665 (2018)

    Google Scholar 

  24. Lee, J.-H., Kim, C.-S.: Multi-loss rebalancing algorithm for monocular depth estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 785–801. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_46

    Chapter  Google Scholar 

  25. Lehtinen, J., et al.: Noise2noise: learning image restoration without clean data. arXiv preprint arXiv:1803.04189 (2018)

  26. Li, Y., Yuan, L., Vasconcelos, N.: Bidirectional learning for domain adaptation of semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6936–6945 (2019)

    Google Scholar 

  27. Li, Z., Bhat, S.F., Wonka, P.: PatchFusion: an end-to-end tile-based framework for high-resolution monocular metric depth estimation. arXiv preprint arXiv:2312.02284 (2023)

  28. Li, Z., et al.: Unsupervised domain adaptation for monocular 3D object detection via self-training. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 245–262. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_15

    Chapter  Google Scholar 

  29. Li, Z., Chen, Z., Liu, X., Jiang, J.: DepthFormer: exploiting long-range correlation and local information for accurate monocular depth estimation. Mach. Intell. Res. 1–18 (2023)

    Google Scholar 

  30. Li, Z., Wang, X., Liu, X., Jiang, J.: BinsFormer: revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987 (2022)

  31. Liu, C., Kumar, S., Gu, S., Timofte, R., Van Gool, L.: Single image depth prediction made better: a multivariate gaussian take. In: CVPR, pp. 17346–17356 (2023)

    Google Scholar 

  32. Lopez-Rodriguez, A., Mikolajczyk, K.: DESC: domain adaptation for depth estimation via semantic consistency. Int. J. Comput. Vis. 131(3), 752–771 (2023)

    Article  Google Scholar 

  33. Metzger, N., Daudt, R.C., Schindler, K.: Guided depth super-resolution by deep anisotropic diffusion. In: CVPR, pp. 18237–18246 (2023)

    Google Scholar 

  34. Miangoleh, S.M.H., Dille, S., Mai, L., Paris, S., Aksoy, Y.: Boosting monocular depth estimation models to high-resolution via content-adaptive multi-resolution merging. In: CVPR, pp. 9685–9694 (2021)

    Google Scholar 

  35. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)

    Article  Google Scholar 

  36. Pastore, G., Cermelli, F., Xian, Y., Mancini, M., Akata, Z., Caputo, B.: A closer look at self-training for zero-label semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2693–2702 (2021)

    Google Scholar 

  37. Paul, S., Tsai, Y.-H., Schulter, S., Roy-Chowdhury, A.K., Chandraker, M.: Domain adaptive semantic segmentation using weak labels. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 571–587. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_33

    Chapter  Google Scholar 

  38. Piccinelli, L., Sakaridis, C., Yu, F.: iDisc: internal discretization for monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21477–21487 (2023)

    Google Scholar 

  39. Poucin, F., Kraus, A., Simon, M.: Boosting instance segmentation with synthetic data: a study to overcome the limits of real world data sets. In: International Conference on Computer Vision Workshops, pp. 945–953 (2021)

    Google Scholar 

  40. Pseudo-Label, D.H.L.: The simple and efficient semi-supervised learning method for deep neural networks. In: ICML 2013 Workshop: Challenges in Representation Learning, pp. 1–6 (2013)

    Google Scholar 

  41. Rajpal, A., Cheema, N., Illgner-Fehns, K., Slusallek, P., Jaiswal, S.: High-resolution synthetic RGB-D datasets for monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1188–1198 (2023)

    Google Scholar 

  42. Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI 44(3) (2022)

    Google Scholar 

  43. Rey-Area, M., Yuan, M., Richardt, C.: 360monodepth: high-resolution 360deg monocular depth estimation. In: CVPR, pp. 3762–3772 (2022)

    Google Scholar 

  44. Saito, K., Ushiku, Y., Harada, T.: Asymmetric tri-training for unsupervised domain adaptation. In: International Conference on Machine Learning, pp. 2988–2997. PMLR (2017)

    Google Scholar 

  45. Scharstein, D.: High-resolution stereo datasets with subpixel-accurate ground truth. In: Jiang, X., Hornegger, J., Koch, R. (eds.) GCPR 2014. LNCS, vol. 8753, pp. 31–42. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11752-2_3

    Chapter  Google Scholar 

  46. Schops, T., et al.: A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: CVPR, pp. 3260–3269 (2017)

    Google Scholar 

  47. Shin, I., et al.: MM-TTA: multi-modal test-time adaptation for 3d semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16928–16937 (2022)

    Google Scholar 

  48. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54

    Chapter  Google Scholar 

  49. Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: a RGB-D scene understanding benchmark suite. In: CVPR, pp. 567–576 (2015)

    Google Scholar 

  50. Taherkhani, F., Dabouei, A., Soleymani, S., Dawson, J., Nasrabadi, N.M.: Self-supervised Wasserstein pseudo-labeling for semi-supervised image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12267–12277 (2021)

    Google Scholar 

  51. Tosi, F., Liao, Y., Schmitt, C., Geiger, A.: SMD-nets: stereo mixture density networks. In: CVPR, pp. 8942–8952 (2021)

    Google Scholar 

  52. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP 13(4), 600–612 (2004)

    Google Scholar 

  53. Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J. Big Data 3(1), 1–40 (2016)

    Article  Google Scholar 

  54. Xian, K., Zhang, J., Wang, O., Mai, L., Lin, Z., Cao, Z.: Structure-guided ranking loss for single image depth prediction. In: CVPR, pp. 611–620 (2020)

    Google Scholar 

  55. Yang, J., Alvarez, J.M., Liu, M.: Self-supervised learning of depth inference for multi-view stereo. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7526–7534 (2021)

    Google Scholar 

  56. Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: unleashing the power of large-scale unlabeled data. arXiv preprint arXiv:2401.10891 (2024)

  57. Yen, Y.T., Lu, C.N., Chiu, W.C., Tsai, Y.H.: 3D-PL: domain adaptive depth estimation with 3D-aware pseudo-labeling. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 710–728. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_41

    Chapter  Google Scholar 

  58. Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: ScanNet++: a high-fidelity dataset of 3D indoor scenes. In: ICCV, pp. 12–22 (2023)

    Google Scholar 

  59. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV, pp. 3836–3847 (2023)

    Google Scholar 

  60. Zhao, S., Fu, H., Gong, M., Tao, D.: Geometry-aware symmetric domain adaptation for monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9788–9798 (2019)

    Google Scholar 

  61. Zhao, W., Rao, Y., Liu, Z., Liu, B., Zhou, J., Lu, J.: Unleashing text-to-image diffusion models for visual perception. In: ICCV (2023)

    Google Scholar 

  62. Zhao, X., Schulter, S., Sharma, G., Tsai, Y.-H., Chandraker, M., Wu, Y.: Object detection with a unified label space from multiple datasets. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 178–193. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_11

    Chapter  Google Scholar 

  63. Zhao, Z., Zhang, J., Xu, S., Lin, Z., Pfister, H.: Discrete cosine transform network for guided depth map super-resolution. In: CVPR, pp. 5697–5707 (2022)

    Google Scholar 

  64. Zheng, C., Cham, T.J., Cai, J.: T2Net: synthetic-to-realistic translation for solving single-image depth estimation tasks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783 (2018)

    Google Scholar 

  65. Zhong, Z., Liu, X., Jiang, J., Zhao, D., Ji, X.: Guided depth map super-resolution: a survey. ACM Comput. Surv. (2023)

    Google Scholar 

  66. Zou, Y., Yu, Z., Kumar, B., Wang, J.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 289–305 (2018)

    Google Scholar 

Download references

Acknowledgements

This publication is funded in part by KAUST under award #ORA-2023-5241 (for the NTGI-AI project).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhenyu Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3727 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Z., Bhat, S.F., Wonka, P. (2025). PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15125. Springer, Cham. https://doi.org/10.1007/978-3-031-72855-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72855-6_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72854-9

  • Online ISBN: 978-3-031-72855-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics