Depth-Guided NeRF Training via Earth Mover’s Distance

Rau, Anita; Aklilu, Josiah; Christopher Holsinger, F.; Yeung-Levy, Serena

doi:10.1007/978-3-031-73039-9_1

Anita Rau¹³,
Josiah Aklilu¹³,
F. Christopher Holsinger¹³ &
…
Serena Yeung-Levy¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15122))

Included in the following conference series:

European Conference on Computer Vision

391 Accesses

Abstract

Neural Radiance Fields (NeRFs) are trained to minimize the rendering loss of predicted viewpoints. However, the photometric loss often does not provide enough information to disambiguate between different possible geometries yielding the same image. Previous work has thus incorporated depth supervision during NeRF training, leveraging dense predictions from pre-trained depth networks as pseudo-ground truth. While these depth priors are assumed to be perfect once filtered for noise, in practice, their accuracy is more challenging to capture. This work proposes a novel approach to uncertainty in depth priors for NeRF supervision. Instead of using custom-trained depth or uncertainty priors, we use off-the-shelf pretrained diffusion models to predict depth and capture uncertainty during the denoising process. Because we know that depth priors are prone to errors, we propose to supervise the ray termination distance distribution with Earth Mover’s Distance instead of enforcing the rendered depth to replicate the depth prior exactly through $L_2$-loss. Our depth-guided NeRF outperforms all baselines on standard depth metrics by a large margin while maintaining performance on photometric measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Photo-realistic Neural Domain Randomization

ACFNeRF: Accelerating and Cache-Free Neural Rendering via Point Cloud-Based Distance Fields

Learning Unsigned Distance Functions from Multi-view Images with Volume Rendering Priors

Notes

1.
Both [10, 21] are called DDP, so we refer to them as DiffDP and DDPrior.

References

Adamkiewicz, M., et al.: Vision-only robot navigation in a neural radiance world. IEEE Rob. Autom. Lett. 7(2), 4606–4613 (2022)
Article Google Scholar
Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., Müller, M.: ZoeDepth: zero-shot transfer by combining relative and metric depth (2023). https://doi.org/10.48550/ARXIV.2302.12288. https://arxiv.org/abs/2302.12288
Blukis, V., et al.: One-shot neural fields for 3D object understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW) on XRNeRF: Advances in NeRF for the Metaverse 2023. IEEE/CVF (2023)
Google Scholar
Couprie, C., Farabet, C., Najman, L., LeCun, Y.: Indoor semantic segmentation using depth information. arXiv preprint arXiv:1301.3572 (2013)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839 (2017)
Google Scholar
Deng, K., Liu, A., Zhu, J.Y., Ramanan, D.: Depth-supervised NeRF: fewer views and faster training for free. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12882–12891 (2022)
Google Scholar
Feydy, J., Séjourné, T., Vialard, F.X., Amari, S.I., Trouvé, A., Peyré, G.: Interpolating between optimal transport and MMD using Sinkhorn divergences. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2681–2690. PMLR (2019)
Google Scholar
Wang, G., Chen, Z., Loy, C.C., Liu, Z.: SparseNeRF: distilling depth ranking for few-shot novel view synthesis. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Jain, A., Tancik, M., Abbeel, P.: Putting nerf on a diet: semantically consistent few-shot view synthesis. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5865–5874. IEEE Computer Society, Los Alamitos, CA, USA, October 2021. https://doi.org/10.1109/ICCV48922.2021.00583. https://doi.ieeecomputersociety.org/10.1109/ICCV48922.2021.00583
Ji, Y., et al.: DDP: diffusion model for dense visual prediction (2023)
Google Scholar
Kendall, A., Gal, Y.: What uncertainties do we need in Bayesian deep learning for computer vision? (2017)
Google Scholar
Li, Z., Chen, Z., Liu, X., Jiang, J.: DepthFormer: exploiting long-range correlation and local information for accurate monocular depth estimation. Mach. Intell. Res. 20(6), 837–854 (2023)
Article Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Chapter Google Scholar
Niemeyer, M., Barron, J.T., Mildenhall, B., Sajjadi, M.S.M., Geiger, A., Radwan, N.: RegNeRF: regularizing neural radiance fields for view synthesis from sparse inputs. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Prinzler, M., Hilliges, O., Thies, J.: DINER: depth-aware image-based neural radiance fields. In: Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)
Google Scholar
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12179–12188 (2021)
Google Scholar
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 44(3), 1623–1637 (2020)
Article Google Scholar
Roessle, B., Barron, J.T., Mildenhall, B., Srinivasan, P.P., Niebner, M.: Dense depth priors for neural radiance fields from sparse input views. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12882–12891. IEEE Computer Society, Los Alamitos, CA, USA, June 2022. https://doi.org/10.1109/CVPR52688.2022.01255. https://doi.ieeecomputersociety.org/10.1109/CVPR52688.2022.01255
Saxena, S., Kar, A., Norouzi, M., Fleet, D.J.: Monocular depth estimation using diffusion models (2023)
Google Scholar
Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Song, J., et al.: Därf: boosting radiance fields from sparse inputs with monocular depth adaptation. In: Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 68458–68470 (2023)
Google Scholar
Uy, M.A., Martin-Brualla, R., Guibas, L., Li, K.: SCADE: NeRFs from space carving with ambiguity-aware depth estimates. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Wang, Y., Long, Y., Fan, S.H., Dou, Q.: Neural rendering for stereo 3D reconstruction of deformable tissues in robotic surgery. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds.) MICCAI 2022. LNCS, vol. 13437, pp. 431–441. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16449-1_41
Chapter Google Scholar
Wei, Y., Liu, S., Rao, Y., Zhao, W., Lu, J., Zhou, J.: NerfingMVS: guided optimization of neural radiance fields for indoor multi-view stereo. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5590–5599. IEEE Computer Society, Los Alamitos, CA, USA, October 2021. https://doi.org/10.1109/ICCV48922.2021.00556. https://doi.ieeecomputersociety.org/10.1109/ICCV48922.2021.00556
Xie, Z., Geng, Z., Hu, J., Zhang, Z., Hu, H., Cao, Y.: Revealing the dark secrets of masked image modeling. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14475–14485. IEEE Computer Society, Los Alamitos, CA, USA, June 2023. https://doi.org/10.1109/CVPR52729.2023.01391. https://doi.ieeecomputersociety.org/10.1109/CVPR52729.2023.01391
Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: unleashing the power of large-scale unlabeled data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10371–10381 (2024)
Google Scholar
Yin, W., et al.: Metric3D: towards zero-shot metric 3D prediction from a single image (2023)
Google Scholar
Yin, W., et al.: Learning to recover 3D scene shape from a single image. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 204–213 (2021). https://doi.org/10.1109/CVPR46437.2021.00027
Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: neural radiance fields from one or few images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021). http://arxiv.org/abs/2012.02190v3
Yuan, W., Gu, X., Dai, Z., Zhu, S., Tan, P.: Neural window fully-connected CRFs for monocular depth estimation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3906–3915. IEEE Computer Society, Los Alamitos, CA, USA, June 2022. https://doi.org/10.1109/CVPR52688.2022.00389. https://doi.ieeecomputersociety.org/10.1109/CVPR52688.2022.00389
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Google Scholar
Zhu, Z., et al.: NICE-SLAM: neural implicit scalable encoding for slam. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12786–12796 (2022)
Google Scholar

Download references

Acknowledgements

Thanks to the anonymous reviewers for their constructive feedback. This work was supported by the Isackson Family Foundation, the Stanford Head and Neck Surgery Research Fund, and the Stanford Graduate Fellowship.

Author information

Authors and Affiliations

Stanford University, Stanford, USA
Anita Rau, Josiah Aklilu, F. Christopher Holsinger & Serena Yeung-Levy

Authors

Anita Rau
View author publications
You can also search for this author in PubMed Google Scholar
Josiah Aklilu
View author publications
You can also search for this author in PubMed Google Scholar
F. Christopher Holsinger
View author publications
You can also search for this author in PubMed Google Scholar
Serena Yeung-Levy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anita Rau .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7042 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rau, A., Aklilu, J., Christopher Holsinger, F., Yeung-Levy, S. (2025). Depth-Guided NeRF Training via Earth Mover’s Distance. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15122. Springer, Cham. https://doi.org/10.1007/978-3-031-73039-9_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-73039-9_1
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73038-2
Online ISBN: 978-3-031-73039-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Depth-Guided NeRF Training via Earth Mover’s Distance