Abstract
This work addresses the task of zero-shot monocular depth estimation. A recent advance in this field has been the idea of utilising Text-to-Image foundation models, such as Stable Diffusion [51]. Foundation models provide a rich and generic image representation, and therefore, little training data is required to reformulate them as a depth estimation model that predicts highly-detailed depth maps and has good generalisation capabilities. However, the realisation of this idea has so far led to approaches which are, unfortunately, highly inefficient at test-time due to the underlying iterative denoising process. In this work, we propose a different realisation of this idea and present PrimeDepth, a method that is highly efficient at test time while keeping, or even enhancing, the positive aspects of diffusion-based approaches. Our key idea is to extract from Stable Diffusion a rich, but frozen, image representation by running a single denoising step. This representation, we term preimage, is then fed into a refiner network with an architectural inductive bias, before entering the downstream task. We validate experimentally that PrimeDepth is two orders of magnitude faster than the leading diffusion-based method, Marigold [28], while being more robust for challenging scenarios and quantitatively marginally superior. Thereby, we reduce the gap to the currently leading data-driven approach, Depth Anything [72], which is still quantitatively superior, but predicts less detailed depth maps and requires 20 times more labelled data. Due to the complementary nature of our approach, even a simple averaging between PrimeDepth and Depth Anything predictions can improve upon both methods and sets a new state-of-the-art in zero-shot monocular depth estimation. In future, data-driven approaches may also benefit from integrating our preimage.
D. Zavadski and D. Kalšan—Equal Contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
See Robust Vision Challenge 2020 (http://www.robustvision.net/rvc2020.php).
- 2.
We take the average with respect to both measures, \(\delta _1\) and AbsRel.
- 3.
Computed from 1000 runs of different image resolutions on an A100 GPU.
References
Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: Depth estimation using adaptive bins. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 4009–4018 (2021)
Cabon, Y., Murray, N., Humenberger, M.: Virtual kitti 2. arXiv preprint arXiv:2001.10773 (2020)
Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 11621–11631 (2020)
Chen, P.Y., Liu, A.H., Liu, Y.C., Wang, Y.C.F.: Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 2624–2632 (2019)
Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild. Adv. Neural Inform. Process. Syst. 29 (2016)
Cheng, J., Yin, W., Wang, K., Chen, X., Wang, S., Yang, X.: Adaptive fusion of single-view and multi-view depth for autonomous driving. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 10138–10147 (2024)
Choi, J., Choi, Y., Kim, Y., Kim, J., Yoon, S.: Custom-edit: Text-guided image editing with customized diffusion models. arXiv preprint arXiv:2305.15779 (2023)
Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427 (2022)
Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Adv. Neural Inform. Process. Syst. 34, 8780–8794 (2021)
Dockhorn, T., Vahdat, A., Kreis, K.: Genie: Higher-order denoising diffusion solvers. Adv. Neural Inform. Process. Syst. 35, 30150–30166 (2022)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Duan, Y., Guo, X., Zhu, Z.: Diffusiondepth: Diffusion denoising approach for monocular depth estimation. arXiv preprint arXiv:2303.05021 (2023)
Eftekhar, A., Sax, A., Malik, J., Zamir, A.: Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In: Int. Conf. Comput. Vis. pp. 10786–10796 (2021)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inform. Process. Syst. 27 (2014)
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 2002–2011 (2018)
Garg, R., Bg, V.K., Carneiro, G., Reid, I.: Unsupervised cnn for single view depth estimation: Geometry to the rescue. In: Eur. Conf. Comput. Vis. pp. 740–756. Springer (2016)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 3354–3361. IEEE (2012)
Goel, V., Peruzzo, E., Jiang, Y., Xu, D., Sebe, N., Darrell, T., Wang, Z., Shi, H.: Pair-diffusion: Object-level image editing with structure-and-appearance paired diffusion models. arXiv preprint arXiv:2303.17546 (2023)
Guizilini, V., Hou, R., Li, J., Ambrus, R., Gaidon, A.: Semantically-guided representation learning for self-supervised monocular depth. arXiv preprint arXiv:2002.12319 (2020)
Hedlin, E., Sharma, G., Mahajan, S., Isack, H., Kar, A., Tagliasacchi, A., Yi, K.M.: Unsupervised semantic correspondence using stable diffusion. Adv. Neural Inform. Process. Syst. 36 (2024)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural Inform. Process. Syst. 33, 6840–6851 (2020)
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23(1), 2249–2281 (2022)
Hoiem, D., Efros, A.A., Hebert, M.: Automatic photo pop-up. In: ACM SIGGRAPH, pp. 577–584 (2005)
Hu, E.J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: Int. Conf. Learn. Represent. (2022)
Hu, H., Chan, K.C., Su, Y.C., Chen, W., Li, Y., Sohn, K., Zhao, Y., Ben, X., Gong, B., Cohen, W., et al.: Instruct-imagen: Image generation with multi-modal instruction. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 4754–4763 (2024)
Hu, V.T., Zhang, D.W., Asano, Y.M., Burghouts, G.J., Snoek, C.G.: Self-guided diffusion models. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 18413–18422 (2023)
Ji, Y., Chen, Z., Xie, E., Hong, L., Liu, X., Liu, Z., Lu, T., Li, Z., Luo, P.: Ddp: Diffusion model for dense visual prediction. In: Int. Conf. Comput. Vis. pp. 21741–21752 (October 2023)
Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Repurposing diffusion-based image generators for monocular depth estimation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 9492–9502 (June 2024)
Kirillov, A., Girshick, R., He, K., Dollar, P.: Panoptic feature pyramid networks. In: IEEE Conf. Comput. Vis. Pattern Recog. (June 2019)
Klingner, M., Termöhlen, J.A., Mikolajczyk, J., Fingscheidt, T.: Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: Eur. Conf. Comput. Vis. pp. 582–600. Springer (2020)
Koh, J.Y., Park, S.H., Song, J.: Improving text generation on images with synthetic captions. arXiv preprint arXiv:2406.00505 (2024)
Kondapaneni, N., Marks, M., Knott, M., Guimaraes, R., Perona, P.: Text-image alignment for diffusion-based perception. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 13883–13893 (June 2024)
Lavreniuk, M., Bhat, S.F., Müller, M., Wonka, P.: Evp: Enhanced visual perception using inverse multi-attentive feature refinement and regularized image-text alignment. arXiv preprint arXiv:2312.08548 (2023)
Lee, H.Y., Tseng, H.Y., Lee, H.Y., Yang, M.H.: Exploiting diffusion prior for generalizable dense prediction. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 7861–7871 (June 2024)
Lee, J.H., Han, M.K., Ko, D.W., Suh, I.H.: From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326 (2019)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Int. Conf. Mach. Learn. Proceedings of Machine Learning Research, vol. 202, pp. 19730–19742. PMLR (23–29 Jul 2023)
Lin, S., Wang, A., Yang, X.: Sdxl-lightning: Progressive adversarial diffusion distillation. arXiv preprint arXiv:2402.13929 (2024)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Int. Conf. Comput. Vis. pp. 2980–2988 (2017)
Liu, B., Wang, C., Cao, T., Jia, K., Huang, J.: Towards understanding cross and self-attention in stable diffusion for text-guided image editing. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 7817–7826 (2024)
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., Wei, F., Guo, B.: Swin transformer v2: Scaling up capacity and resolution. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 12009–12019 (June 2022)
Long, X., Lin, C., Liu, L., Li, W., Theobalt, C., Yang, R., Wang, W.: Adaptive surface normal constraint for depth estimation. In: Int. Conf. Comput. Vis. pp. 12849–12858 (2021)
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Adv. Neural Inform. Process. Syst. 35, 5775–5787 (2022)
Luo, G., Dunlap, L., Park, D.H., Holynski, A., Darrell, T.: Diffusion hyperfeatures: Searching through time and space for semantic correspondence. Adv. Neural Inform. Process. Syst. 36 (2024)
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: Int. Conf. on 3D Imaging, Modeling, Processing, Visualization and Transmission. pp. 565–571. IEEE (2016)
Palmer, S.E.: Vision science: Photons to phenomenology. MIT press (1999)
Patni, S., Agarwal, A., Arora, C.: Ecodepth: Effective conditioning of diffusion models for monocular depth estimation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 28285–28295 (June 2024)
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Int. Conf. Comput. Vis. pp. 12179–12188 (2021)
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 44(3), 1623–1637 (2020)
Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M.A., Paczan, N., Webb, R., Susskind, J.M.: Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In: Int. Conf. Comput. Vis. (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 10684–10695 (June 2022)
Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., Norouzi, M.: Palette: Image-to-image diffusion models. In: ACM SIGGRAPH. pp. 1–10 (2022)
Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042 (2023)
Saxena, A., Sun, M., Ng, A.Y.: Make3d: Learning 3d scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 824–840 (2008)
Saxena, S., Herrmann, C., Hur, J., Kar, A., Norouzi, M., Sun, D., Fleet, D.J.: The surprising effectiveness of diffusion models for optical flow and monocular depth estimation. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Adv. Neural Inform. Process. Syst. vol. 36, pp. 39443–39469. Curran Associates, Inc. (2023)
Saxena, S., Hur, J., Herrmann, C., Sun, D., Fleet, D.J.: Zero-shot metric depth with a field-of-view conditioned diffusion model. arXiv preprint arXiv:2312.13252 (2023)
Saxena, S., Kar, A., Norouzi, M., Fleet, D.J.: Monocular depth estimation using diffusion models. arXiv preprint arXiv:2302.14816 (2023)
Schilling, H., Gutsche, M., Brock, A., Spath, D., Rother, C., Krispin, K.: Mind the gap-a benchmark for dense depth prediction beyond lidar. In: IEEE Conf. Comput. Vis. Pattern Recog. Worksh. pp. 338–339 (2020)
Schops, T., Schonberger, J.L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., Geiger, A.: A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 3260–3269 (2017)
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M.: Laion-5b: An open large-scale dataset for training next generation image-text models. Adv. Neural Inform. Process. Syst. 35, 25278–25294 (2022)
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: Eur. Conf. Comput. Vis. pp. 746–760. Springer (2012)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: Int. Conf. Mach. Learn. pp. 2256–2265. PMLR (2015)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: Int. Conf. Learn. Represent. (2021)
Tang, L., Jia, M., Wang, Q., Phoo, C.P., Hariharan, B.: Emergent correspondence from image diffusion. Adv. Neural Inform. Process. Syst. 36, 1363–1389 (2023)
Wan, Q., Huang, Z., Kang, B., Feng, J., Zhang, L.: Harnessing diffusion models for visual perception with meta prompts. arXiv preprint arXiv:2312.14733 (2023)
Wang, T., Zhang, T., Zhang, B., Ouyang, H., Chen, D., Chen, Q., Wen, F.: Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952 (2022)
Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X., Hu, X., Lu, T., Lu, L., Li, H., et al.: Internimage: Exploring large-scale vision foundation models with deformable convolutions. arXiv preprint arXiv:2211.05778 (2022)
Wu, W., Zhao, Y., Chen, H., Gu, Y., Zhao, R., He, Y., Zhou, H., Shou, M.Z., Shen, C.: Datasetdm: Synthesizing data with perception annotations using diffusion models. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Adv. Neural Inform. Process. Syst. vol. 36, pp. 54683–54695. Curran Associates, Inc. (2023)
Xian, K., Shen, C., Cao, Z., Lu, H., Xiao, Y., Li, R., Luo, Z.: Monocular relative depth perception with web stereo data supervision. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 311–320 (2018)
Xu, X., Zhao, H., Vineet, V., Lim, S.N., Torralba, A.: Mtformer: Multi-task learning via transformer and cross-task reasoning. In: Eur. Conf. Comput. Vis. pp. 304–321. Springer (2022)
Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Unleashing the power of large-scale unlabeled data. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 10371–10381 (2024)
Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 6613–6623 (2024)
Yin, W., Liu, Y., Shen, C., Yan, Y.: Enforcing geometric constraints of virtual normal for depth prediction. In: Int. Conf. Comput. Vis. pp. 5684–5693 (2019)
Yin, W., Zhang, C., Chen, H., Cai, Z., Yu, G., Wang, K., Chen, X., Shen, C.: Metric3d: Towards zero-shot metric 3d prediction from a single image. In: Int. Conf. Comput. Vis. pp. 9043–9053 (2023)
Yin, W., Zhang, J., Wang, O., Niklaus, S., Mai, L., Chen, S., Shen, C.: Learning to recover 3d scene shape from a single image. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 204–213 (2021)
Zavadski, D., Feiden, J.F., Rother, C.: Controlnet-xs: Designing an efficient and effective architecture for controlling text-to-image diffusion models. arXiv preprint arXiv:2312.06573 (2023)
Zhang, C., Yin, W., Wang, B., Yu, G., Fu, B., Shen, C.: Hierarchical normalization for robust monocular depth estimation. Adv. Neural Inform. Process. Syst. 35, 14128–14139 (2022)
Zhang, Q., Chen, Y.: Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902 (2022)
Zhao, W., Rao, Y., Liu, Z., Liu, B., Zhou, J., Lu, J.: Unleashing text-to-image diffusion models for visual perception. In: Int. Conf. Comput. Vis. pp. 5729–5739 (October 2023)
Zheng, W., Teng, J., Yang, Z., Wang, W., Chen, J., Gu, X., Dong, Y., Ding, M., Tang, J.: Cogview3: Finer and faster text-to-image generation via relay diffusion. arXiv preprint arXiv:2403.05121 (2024)
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 633–641 (2017)
Acknowledgements
We thank Yannick Pauler, Nicolas Bender and Friedrich Feiden for their help. The project has been supported by the Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA) funded by the German Academic Exchange Service (DAAD). The authors gratefully acknowledge the support by the Ministry of Science, Research and the Arts Baden-Württemberg (MWK) through bwHPC, SDS@hd and the German Research Foundation (DFG) through the grants INST 35/1597-1 FUGG and INST 35/1503-1 FUGG.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zavadski, D., Kalšan, D., Rother, C. (2025). PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage. In: Cho, M., Laptev, I., Tran, D., Yao, A., Zha, H. (eds) Computer Vision – ACCV 2024. ACCV 2024. Lecture Notes in Computer Science, vol 15476. Springer, Singapore. https://doi.org/10.1007/978-981-96-0917-8_2
Download citation
DOI: https://doi.org/10.1007/978-981-96-0917-8_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0916-1
Online ISBN: 978-981-96-0917-8
eBook Packages: Computer ScienceComputer Science (R0)