MVDD: Multi-view Depth Diffusion Models

Wang, Zhen; Xu, Qiangeng; Tan, Feitong; Chai, Menglei; Liu, Shichen; Pandey, Rohit; Fanello, Sean; Kadambi, Achuta; Zhang, Yinda

doi:10.1007/978-3-031-72624-8_14

Zhen Wang^13,14,
Qiangeng Xu¹³,
Feitong Tan¹³,
Menglei Chai¹³,
Shichen Liu¹³,
Rohit Pandey¹³,
Sean Fanello¹³,
Achuta Kadambi^13,14 &
…
Yinda Zhang¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15071))

Included in the following conference series:

European Conference on Computer Vision

345 Accesses

Abstract

Denoising diffusion models have demonstrated outstanding results in 2D image generation, yet it remains a challenge to replicate its success in 3D shape generation. In this paper, we propose leveraging multi-view depth, which represents complex 3D shapes in a 2D data format that is easy to denoise. We pair this representation with a diffusion model, MVDD, that is capable of generating high-quality dense point clouds with 20K+ points with fine-grained details. To enforce 3D consistency in multi-view depth, we introduce an epipolar line segment attention that conditions the denoising step for a view on its neighboring views. Additionally, a depth fusion module is incorporated into diffusion steps to further ensure the alignment of depth maps. When augmented with surface reconstruction, MVDD can also produce high-quality 3D meshes. Furthermore, MVDD stands out in other tasks such as depth completion, and can serve as a 3D prior, significantly boosting many downstream tasks, such as GAN inversion. State-of-the-art results from extensive experiments demonstrate MVDD’s excellent ability in 3D shape generation, depth completion, and its potential as a 3D prior for downstream tasks.

Z. Wang—Work done while the author was an intern at Google.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image

DiffSurf: A Transformer-Based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose

SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

References

Anvekar, T., Tabib, R.A., Hegde, D., Mudengudi, U.: VG-VAE: a venatus geometry point-cloud variational auto-encoder. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2978–2985 (2022)
Google Scholar
Cai, R., et al.: Learning gradient fields for shape generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 364–381. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_22
Chapter Google Scholar
Chan, E.R., et al.: Efficient geometry-aware 3D generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16123–16133 (2022)
Google Scholar
Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5939–5948 (2019)
Google Scholar
Chou, G., Bahat, Y., Heide, F.: Diffusion-SDF: conditional generative modeling of signed distance functions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2262–2272 (2023)
Google Scholar
Chu, R., et al.: Diffcomplete: diffusion-based generative 3D shape completion. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Galliani, S., Lasinger, K., Schindler, K.: Massively parallel multiview stereopsis by surface normal diffusion. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 873–881 (2015)
Google Scholar
Gao, J., et al.: Get3d: a generative model of high quality 3D textured shapes learned from images. Adv. Neural. Inf. Process. Syst. 35, 31841–31854 (2022)
Google Scholar
Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Article MathSciNet Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
Google Scholar
Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: Proceedings of the fourth Eurographics Symposium on Geometry Processing, vol. 7, p. 0 (2006)
Google Scholar
Kim, H., Lee, H., Kang, W.H., Lee, J.Y., Kim, N.S.: Softflow: probabilistic framework for normalizing flow on manifolds. Adv. Neural. Inf. Process. Syst. 33, 16388–16397 (2020)
Google Scholar
Kim, J., Yoo, J., Lee, J., Hong, S.: SETVAE: learning hierarchical composition for generative modeling of set-structured data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 15059–15068 (2021)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
Klokov, R., Boyer, E., Verbeek, J.: Discrete point flow networks for efficient point cloud generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 694–710. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58592-1_41
Chapter Google Scholar
Lan, Y., et al.: Ln3diff: scalable latent neural fields diffusion for speedy 3D generation. arXiv preprint arXiv:2403.12019 (2024)
Lan, Y., Meng, X., Yang, S., Loy, C.C., Dai, B.: E3dge: self-supervised geometry-aware encoder for style-based 3D GAN inversion (2023)
Google Scholar
Lan, Y., et al.: Gaussian3diff: 3D gaussian diffusion for 3D full head synthesis and editing. arXiv preprint arXiv:2312.03763 (2023)
Li, W., Chen, R., Chen, X., Tan, P.: Sweetdreamer: aligning geometric priors in 2D diffusion for consistent text-to-3D. arXiv preprint arXiv:2310.02596 (2023)
Liu, Y., et al.: Syncdreamer: generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453 (2023)
Liu, Z., Feng, Y., Black, M.J., Nowrouzezahrai, D., Paull, L., Liu, W.: Meshdiffusion: score-based generative 3D mesh modeling. arXiv preprint arXiv:2303.08133 (2023)
Long, X., et al.: Wonder3D: single image to 3D using cross-domain diffusion. arXiv preprint arXiv:2310.15008 (2023)
Luo, S., Hu, W.: Diffusion probabilistic models for 3D point cloud generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2837–2845 (2021)
Google Scholar
Melas-Kyriazi, L., Laina, I., Rupprecht, C., Vedaldi, A.: Realfusion: 360deg reconstruction of any object from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8446–8455 (2023)
Google Scholar
Merrell, P., et al.: Real-time visibility-based fusion of depth maps. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. IEEE (2007)
Google Scholar
Mittal, P., Cheng, Y.C., Singh, M., Tulsiani, S.: AutoSDF: shape priors for 3D completion, reconstruction and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 306–315 (2022)
Google Scholar
Mo, S., Xie, E., Chu, R., Hong, L., Niessner, M., Li, Z.: Dit-3D: exploring plain diffusion transformers for 3D shape generation. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Nam, G., Khlifi, M., Rodriguez, A., Tono, A., Zhou, L., Guerrero, P.: 3D-LDM: neural implicit 3D shape generation with latent diffusion models. arXiv preprint arXiv:2212.00842 (2022)
Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, pp. 8162–8171. PMLR (2021)
Google Scholar
Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing flows for probabilistic modeling and inference. J. Mach. Learn. Res. 22(1), 2617–2680 (2021)
MathSciNet Google Scholar
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Peng, S., Jiang, C., Liao, Y., Niemeyer, M., Pollefeys, M., Geiger, A.: Shape as points: a differentiable Poisson solver. Adv. Neural. Inf. Process. Syst. 34, 13032–13044 (2021)
Google Scholar
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Rao, Y., Nie, Y., Dai, A.: Patchcomplete: learning multi-resolution patch priors for 3D shape completion on unseen categories. Adv. Neural. Inf. Process. Syst. 35, 34436–34450 (2022)
Google Scholar
Roich, D., Mokady, R., Bermano, A.H., Cohen-Or, D.: Pivotal tuning for latent-based editing of real images. ACM Trans. Graphics (TOG) 42(1), 1–13 (2022)
Article Google Scholar
Schönberger, J.L., Zheng, E., Frahm, J.-M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 501–518. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_31
Chapter Google Scholar
Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: MvDream: multi-view diffusion for 3D generation. arXiv preprint arXiv:2308.16512 (2023)
Shonenkov, A., Konstantinov, M., Bakshandaeva, D., Schuhmann, C., Ivanova, K., Klokova, N. (2023). https://github.com/deep-floyd/IF/tree/develop
Shu, D.W., Park, S.W., Kwon, J.: 3D point cloud generative adversarial network based on tree structured graph convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3859–3868 (2019)
Google Scholar
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265. PMLR (2015)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)
Tsalicoglou, C., Manhardt, F., Tonioni, A., Niemeyer, M., Tombari, F.: Textmesh: generation of realistic 3D meshes from text prompts. arXiv preprint arXiv:2304.12439 (2023)
Tseng, H.Y., Li, Q., Kim, C., Alsisan, S., Huang, J.B., Kopf, J.: Consistent view synthesis with pose-guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16773–16783 (2023)
Google Scholar
Valsesia, D., Fracastoro, G., Magli, E.: Learning localized generative models for 3D point clouds via graph convolution. In: International Conference on Learning Representations (2018)
Google Scholar
Wang, Z., et al.: Alto: alternating latent topologies for implicit 3D reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 259–270 (2023)
Google Scholar
Xu, Q., Wang, W., Ceylan, D., Mech, R., Neumann, U.: DISN: deep implicit surface network for high-quality single-view 3D reconstruction. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Xu, Q., Xu, Z., Philip, J., Bi, S., Shu, Z., Sunkavalli, K., Neumann, U.: Point-nerf: point-based neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5438–5448 (2022)
Google Scholar
Yang, G., Huang, X., Hao, Z., Liu, M.Y., Belongie, S., Hariharan, B.: Pointflow: 3D point cloud generation with continuous normalizing flows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4541–4550 (2019)
Google Scholar
Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: MVSNet: depth inference for unstructured multi-view stereo. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 785–801. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_47
Chapter Google Scholar
Zeng, X., et al.: Lion: latent point diffusion models for 3D shape generation. arXiv preprint arXiv:2210.06978 (2022)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Google Scholar
Zhang, X., Zhang, Z., Zhang, C., Tenenbaum, J., Freeman, B., Wu, J.: Learning to reconstruct shapes from unseen classes. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Google Scholar
Zhao, M., et al.: Efficientdreamer: high-fidelity and robust 3D creation via orthogonal-view diffusion prior. arXiv preprint arXiv:2308.13223 (2023)
Zhou, L., Du, Y., Wu, J.: 3D shape generation and completion through point-voxel diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5826–5835 (2021)
Google Scholar

Download references

Acknowledgement

A.K. was supported by a National Science Foundation (NSF) CAREER award IIS-2046737, Army Young Investigator Program Award, and Defense Advanced Research Projects Agency (DARPA) Young Faculty Award.

Author information

Authors and Affiliations

Google, San Jose, USA
Zhen Wang, Qiangeng Xu, Feitong Tan, Menglei Chai, Shichen Liu, Rohit Pandey, Sean Fanello, Achuta Kadambi & Yinda Zhang
University of California, Los Angeles, USA
Zhen Wang & Achuta Kadambi

Authors

Zhen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qiangeng Xu
View author publications
You can also search for this author in PubMed Google Scholar
Feitong Tan
View author publications
You can also search for this author in PubMed Google Scholar
Menglei Chai
View author publications
You can also search for this author in PubMed Google Scholar
Shichen Liu
View author publications
You can also search for this author in PubMed Google Scholar
Rohit Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Sean Fanello
View author publications
You can also search for this author in PubMed Google Scholar
Achuta Kadambi
View author publications
You can also search for this author in PubMed Google Scholar
Yinda Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhen Wang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 8994 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Z. et al. (2025). MVDD: Multi-view Depth Diffusion Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15071. Springer, Cham. https://doi.org/10.1007/978-3-031-72624-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-72624-8_14
Published: 26 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72623-1
Online ISBN: 978-3-031-72624-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics