PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation

Li, Zhenyu; Bhat, Shariq Farooq; Wonka, Peter

doi:10.1007/978-3-031-72855-6_15

Zhenyu Li¹³,
Shariq Farooq Bhat¹³ &
Peter Wonka¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15125))

Included in the following conference series:

European Conference on Computer Vision

255 Accesses

Abstract

This paper introduces PatchRefiner, an advanced framework for metric single image depth estimation aimed at high-resolution real-domain inputs. While depth estimation is crucial for applications such as autonomous driving, 3D generative modeling, and 3D reconstruction, achieving accurate high-resolution depth in real-world scenarios is challenging due to the constraints of existing architectures and the scarcity of detailed real-world depth data. PatchRefiner adopts a tile-based methodology, reconceptualizing high-resolution depth estimation as a refinement process, which results in notable performance enhancements. Utilizing a pseudo-labeling strategy that leverages synthetic data, PatchRefiner incorporates a Detail and Scale Disentangling (DSD) loss to enhance detail capture while maintaining scale accuracy, thus facilitating the effective transfer of knowledge from synthetic to real-world data. Our extensive evaluations demonstrate PatchRefiner’s superior performance, significantly outperforming existing benchmarks on the Unreal4KStereo dataset by 18.1% in terms of the root mean squared error (RMSE) and showing marked improvements in detail accuracy and consistent scale estimation on diverse real-world datasets like CityScape, ScanNet++, and ETH3D.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bhat, S.F., Alhashim, I., Wonka, P.: AdaBins: depth estimation using adaptive bins. In: CVPR, pp. 4009–4018 (2021)
Google Scholar
Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., Müller, M.: ZoeDepth: zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)
Chen, C., et al.: Progressive feature alignment for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 627–636 (2019)
Google Scholar
Chen, C., Chen, X., Cheng, H.: On the over-smoothing problem of CNN based disparity estimation. In: ICCV, pp. 8997–9005 (2019)
Google Scholar
Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild. In: NeurIPS, vol. 29 (2016)
Google Scholar
Chen, Y., Liu, S., Wang, X.: Learning continuous image representation with local implicit image function. In: CVPR, pp. 8628–8638 (2021)
Google Scholar
Chen, Y.C., Lin, Y.Y., Yang, M.H., Huang, J.B.: CrDoCo: pixel-level domain transfer with cross-domain consistency. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1791–1800 (2019)
Google Scholar
Chen, Z., Zhang, R., Zhang, G., Ma, Z., Lei, T.: Digging into pseudo label: a low-budget approach for semi-supervised semantic segmentation. IEEE Access 8, 41830–41837 (2020)
Article Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016)
Google Scholar
De Lutio, R., Becker, A., D’Aronco, S., Russo, S., Wegner, J.D., Schindler, K.: Learning graph regularisation for guided super-resolution. In: CVPR, pp. 1979–1988 (2022)
Google Scholar
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NeurIPS, vol. 27 (2014)
Google Scholar
Farahani, A., Voghoei, S., Rasheed, K., Arabnia, H.R.: A brief review of domain adaptation. In: Advances in Data Science and Information Engineering: Proceedings from ICDATA 2020 and IKE 2020, pp. 877–894 (2021)
Google Scholar
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: CVPR, pp. 2002–2011 (2018)
Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR, pp. 3354–3361. IEEE (2012)
Google Scholar
Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: ICCV, pp. 3828–3838 (2019)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
Hu, Z., Yang, Z., Hu, X., Nevatia, R.: Simple: similar pseudo label exploitation for semi-supervised classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15099–15108 (2021)
Google Scholar
Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: DeepMVS: learning multi-view stereopsis. In: CVPR (2018)
Google Scholar
Hui, T.-W., Loy, C.C., Tang, X.: Depth map super-resolution by deep multi-scale guidance. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 353–369. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_22
Chapter Google Scholar
Janoch, A., et al.: A category-level 3D object dataset: putting the Kinect to work. In: Consumer Depth Cameras for Computer Vision: Research Topics and Applications, pp. 141–165 (2013)
Google Scholar
Kanopoulos, N., Vasanthavada, N., Baker, R.L.: Design of an image edge detection filter using the Sobel operator. IEEE J. Solid-State Circuits 23(2), 358–367 (1988)
Article Google Scholar
Koutilya, P., Zhou, H., Jacobs, D.: SharinGAN: combining synthetic and real data for unsupervised geometry estimation. In: CVPR, vol. 2, p. 5 (2020)
Google Scholar
Kundu, J.N., Uppala, P.K., Pahuja, A., Babu, R.V.: AdaDepth: unsupervised content congruent adaptation for depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2656–2665 (2018)
Google Scholar
Lee, J.-H., Kim, C.-S.: Multi-loss rebalancing algorithm for monocular depth estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 785–801. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_46
Chapter Google Scholar
Lehtinen, J., et al.: Noise2noise: learning image restoration without clean data. arXiv preprint arXiv:1803.04189 (2018)
Li, Y., Yuan, L., Vasconcelos, N.: Bidirectional learning for domain adaptation of semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6936–6945 (2019)
Google Scholar
Li, Z., Bhat, S.F., Wonka, P.: PatchFusion: an end-to-end tile-based framework for high-resolution monocular metric depth estimation. arXiv preprint arXiv:2312.02284 (2023)
Li, Z., et al.: Unsupervised domain adaptation for monocular 3D object detection via self-training. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 245–262. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_15
Chapter Google Scholar
Li, Z., Chen, Z., Liu, X., Jiang, J.: DepthFormer: exploiting long-range correlation and local information for accurate monocular depth estimation. Mach. Intell. Res. 1–18 (2023)
Google Scholar
Li, Z., Wang, X., Liu, X., Jiang, J.: BinsFormer: revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987 (2022)
Liu, C., Kumar, S., Gu, S., Timofte, R., Van Gool, L.: Single image depth prediction made better: a multivariate gaussian take. In: CVPR, pp. 17346–17356 (2023)
Google Scholar
Lopez-Rodriguez, A., Mikolajczyk, K.: DESC: domain adaptation for depth estimation via semantic consistency. Int. J. Comput. Vis. 131(3), 752–771 (2023)
Article Google Scholar
Metzger, N., Daudt, R.C., Schindler, K.: Guided depth super-resolution by deep anisotropic diffusion. In: CVPR, pp. 18237–18246 (2023)
Google Scholar
Miangoleh, S.M.H., Dille, S., Mai, L., Paris, S., Aksoy, Y.: Boosting monocular depth estimation models to high-resolution via content-adaptive multi-resolution merging. In: CVPR, pp. 9685–9694 (2021)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Article Google Scholar
Pastore, G., Cermelli, F., Xian, Y., Mancini, M., Akata, Z., Caputo, B.: A closer look at self-training for zero-label semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2693–2702 (2021)
Google Scholar
Paul, S., Tsai, Y.-H., Schulter, S., Roy-Chowdhury, A.K., Chandraker, M.: Domain adaptive semantic segmentation using weak labels. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 571–587. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_33
Chapter Google Scholar
Piccinelli, L., Sakaridis, C., Yu, F.: iDisc: internal discretization for monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21477–21487 (2023)
Google Scholar
Poucin, F., Kraus, A., Simon, M.: Boosting instance segmentation with synthetic data: a study to overcome the limits of real world data sets. In: International Conference on Computer Vision Workshops, pp. 945–953 (2021)
Google Scholar
Pseudo-Label, D.H.L.: The simple and efficient semi-supervised learning method for deep neural networks. In: ICML 2013 Workshop: Challenges in Representation Learning, pp. 1–6 (2013)
Google Scholar
Rajpal, A., Cheema, N., Illgner-Fehns, K., Slusallek, P., Jaiswal, S.: High-resolution synthetic RGB-D datasets for monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1188–1198 (2023)
Google Scholar
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI 44(3) (2022)
Google Scholar
Rey-Area, M., Yuan, M., Richardt, C.: 360monodepth: high-resolution 360deg monocular depth estimation. In: CVPR, pp. 3762–3772 (2022)
Google Scholar
Saito, K., Ushiku, Y., Harada, T.: Asymmetric tri-training for unsupervised domain adaptation. In: International Conference on Machine Learning, pp. 2988–2997. PMLR (2017)
Google Scholar
Scharstein, D.: High-resolution stereo datasets with subpixel-accurate ground truth. In: Jiang, X., Hornegger, J., Koch, R. (eds.) GCPR 2014. LNCS, vol. 8753, pp. 31–42. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11752-2_3
Chapter Google Scholar
Schops, T., et al.: A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: CVPR, pp. 3260–3269 (2017)
Google Scholar
Shin, I., et al.: MM-TTA: multi-modal test-time adaptation for 3d semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16928–16937 (2022)
Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Chapter Google Scholar
Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: a RGB-D scene understanding benchmark suite. In: CVPR, pp. 567–576 (2015)
Google Scholar
Taherkhani, F., Dabouei, A., Soleymani, S., Dawson, J., Nasrabadi, N.M.: Self-supervised Wasserstein pseudo-labeling for semi-supervised image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12267–12277 (2021)
Google Scholar
Tosi, F., Liao, Y., Schmitt, C., Geiger, A.: SMD-nets: stereo mixture density networks. In: CVPR, pp. 8942–8952 (2021)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP 13(4), 600–612 (2004)
Google Scholar
Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J. Big Data 3(1), 1–40 (2016)
Article Google Scholar
Xian, K., Zhang, J., Wang, O., Mai, L., Lin, Z., Cao, Z.: Structure-guided ranking loss for single image depth prediction. In: CVPR, pp. 611–620 (2020)
Google Scholar
Yang, J., Alvarez, J.M., Liu, M.: Self-supervised learning of depth inference for multi-view stereo. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7526–7534 (2021)
Google Scholar
Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: unleashing the power of large-scale unlabeled data. arXiv preprint arXiv:2401.10891 (2024)
Yen, Y.T., Lu, C.N., Chiu, W.C., Tsai, Y.H.: 3D-PL: domain adaptive depth estimation with 3D-aware pseudo-labeling. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 710–728. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_41
Chapter Google Scholar
Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: ScanNet++: a high-fidelity dataset of 3D indoor scenes. In: ICCV, pp. 12–22 (2023)
Google Scholar
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV, pp. 3836–3847 (2023)
Google Scholar
Zhao, S., Fu, H., Gong, M., Tao, D.: Geometry-aware symmetric domain adaptation for monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9788–9798 (2019)
Google Scholar
Zhao, W., Rao, Y., Liu, Z., Liu, B., Zhou, J., Lu, J.: Unleashing text-to-image diffusion models for visual perception. In: ICCV (2023)
Google Scholar
Zhao, X., Schulter, S., Sharma, G., Tsai, Y.-H., Chandraker, M., Wu, Y.: Object detection with a unified label space from multiple datasets. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 178–193. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_11
Chapter Google Scholar
Zhao, Z., Zhang, J., Xu, S., Lin, Z., Pfister, H.: Discrete cosine transform network for guided depth map super-resolution. In: CVPR, pp. 5697–5707 (2022)
Google Scholar
Zheng, C., Cham, T.J., Cai, J.: T2Net: synthetic-to-realistic translation for solving single-image depth estimation tasks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783 (2018)
Google Scholar
Zhong, Z., Liu, X., Jiang, J., Zhao, D., Ji, X.: Guided depth map super-resolution: a survey. ACM Comput. Surv. (2023)
Google Scholar
Zou, Y., Yu, Z., Kumar, B., Wang, J.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 289–305 (2018)
Google Scholar

Download references

Acknowledgements

This publication is funded in part by KAUST under award #ORA-2023-5241 (for the NTGI-AI project).

Author information

Authors and Affiliations

King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
Zhenyu Li, Shariq Farooq Bhat & Peter Wonka

Authors

Zhenyu Li
View author publications
You can also search for this author in PubMed Google Scholar
Shariq Farooq Bhat
View author publications
You can also search for this author in PubMed Google Scholar
Peter Wonka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhenyu Li .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3727 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Z., Bhat, S.F., Wonka, P. (2025). PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15125. Springer, Cham. https://doi.org/10.1007/978-3-031-72855-6_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-72855-6_15
Published: 09 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72854-9
Online ISBN: 978-3-031-72855-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation