Abstract
Denoising diffusion models have demonstrated outstanding results in 2D image generation, yet it remains a challenge to replicate its success in 3D shape generation. In this paper, we propose leveraging multi-view depth, which represents complex 3D shapes in a 2D data format that is easy to denoise. We pair this representation with a diffusion model, MVDD, that is capable of generating high-quality dense point clouds with 20K+ points with fine-grained details. To enforce 3D consistency in multi-view depth, we introduce an epipolar line segment attention that conditions the denoising step for a view on its neighboring views. Additionally, a depth fusion module is incorporated into diffusion steps to further ensure the alignment of depth maps. When augmented with surface reconstruction, MVDD can also produce high-quality 3D meshes. Furthermore, MVDD stands out in other tasks such as depth completion, and can serve as a 3D prior, significantly boosting many downstream tasks, such as GAN inversion. State-of-the-art results from extensive experiments demonstrate MVDD’s excellent ability in 3D shape generation, depth completion, and its potential as a 3D prior for downstream tasks.
Z. Wang—Work done while the author was an intern at Google.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anvekar, T., Tabib, R.A., Hegde, D., Mudengudi, U.: VG-VAE: a venatus geometry point-cloud variational auto-encoder. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2978–2985 (2022)
Cai, R., et al.: Learning gradient fields for shape generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 364–381. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_22
Chan, E.R., et al.: Efficient geometry-aware 3D generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16123–16133 (2022)
Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5939–5948 (2019)
Chou, G., Bahat, Y., Heide, F.: Diffusion-SDF: conditional generative modeling of signed distance functions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2262–2272 (2023)
Chu, R., et al.: Diffcomplete: diffusion-based generative 3D shape completion. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Galliani, S., Lasinger, K., Schindler, K.: Massively parallel multiview stereopsis by surface normal diffusion. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 873–881 (2015)
Gao, J., et al.: Get3d: a generative model of high quality 3D textured shapes learned from images. Adv. Neural. Inf. Process. Syst. 35, 31841–31854 (2022)
Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: Proceedings of the fourth Eurographics Symposium on Geometry Processing, vol. 7, p. 0 (2006)
Kim, H., Lee, H., Kang, W.H., Lee, J.Y., Kim, N.S.: Softflow: probabilistic framework for normalizing flow on manifolds. Adv. Neural. Inf. Process. Syst. 33, 16388–16397 (2020)
Kim, J., Yoo, J., Lee, J., Hong, S.: SETVAE: learning hierarchical composition for generative modeling of set-structured data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 15059–15068 (2021)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
Klokov, R., Boyer, E., Verbeek, J.: Discrete point flow networks for efficient point cloud generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 694–710. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58592-1_41
Lan, Y., et al.: Ln3diff: scalable latent neural fields diffusion for speedy 3D generation. arXiv preprint arXiv:2403.12019 (2024)
Lan, Y., Meng, X., Yang, S., Loy, C.C., Dai, B.: E3dge: self-supervised geometry-aware encoder for style-based 3D GAN inversion (2023)
Lan, Y., et al.: Gaussian3diff: 3D gaussian diffusion for 3D full head synthesis and editing. arXiv preprint arXiv:2312.03763 (2023)
Li, W., Chen, R., Chen, X., Tan, P.: Sweetdreamer: aligning geometric priors in 2D diffusion for consistent text-to-3D. arXiv preprint arXiv:2310.02596 (2023)
Liu, Y., et al.: Syncdreamer: generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453 (2023)
Liu, Z., Feng, Y., Black, M.J., Nowrouzezahrai, D., Paull, L., Liu, W.: Meshdiffusion: score-based generative 3D mesh modeling. arXiv preprint arXiv:2303.08133 (2023)
Long, X., et al.: Wonder3D: single image to 3D using cross-domain diffusion. arXiv preprint arXiv:2310.15008 (2023)
Luo, S., Hu, W.: Diffusion probabilistic models for 3D point cloud generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2837–2845 (2021)
Melas-Kyriazi, L., Laina, I., Rupprecht, C., Vedaldi, A.: Realfusion: 360deg reconstruction of any object from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8446–8455 (2023)
Merrell, P., et al.: Real-time visibility-based fusion of depth maps. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. IEEE (2007)
Mittal, P., Cheng, Y.C., Singh, M., Tulsiani, S.: AutoSDF: shape priors for 3D completion, reconstruction and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 306–315 (2022)
Mo, S., Xie, E., Chu, R., Hong, L., Niessner, M., Li, Z.: Dit-3D: exploring plain diffusion transformers for 3D shape generation. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Nam, G., Khlifi, M., Rodriguez, A., Tono, A., Zhou, L., Guerrero, P.: 3D-LDM: neural implicit 3D shape generation with latent diffusion models. arXiv preprint arXiv:2212.00842 (2022)
Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, pp. 8162–8171. PMLR (2021)
Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing flows for probabilistic modeling and inference. J. Mach. Learn. Res. 22(1), 2617–2680 (2021)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Peng, S., Jiang, C., Liao, Y., Niemeyer, M., Pollefeys, M., Geiger, A.: Shape as points: a differentiable Poisson solver. Adv. Neural. Inf. Process. Syst. 34, 13032–13044 (2021)
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Rao, Y., Nie, Y., Dai, A.: Patchcomplete: learning multi-resolution patch priors for 3D shape completion on unseen categories. Adv. Neural. Inf. Process. Syst. 35, 34436–34450 (2022)
Roich, D., Mokady, R., Bermano, A.H., Cohen-Or, D.: Pivotal tuning for latent-based editing of real images. ACM Trans. Graphics (TOG) 42(1), 1–13 (2022)
Schönberger, J.L., Zheng, E., Frahm, J.-M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 501–518. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_31
Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: MvDream: multi-view diffusion for 3D generation. arXiv preprint arXiv:2308.16512 (2023)
Shonenkov, A., Konstantinov, M., Bakshandaeva, D., Schuhmann, C., Ivanova, K., Klokova, N. (2023). https://github.com/deep-floyd/IF/tree/develop
Shu, D.W., Park, S.W., Kwon, J.: 3D point cloud generative adversarial network based on tree structured graph convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3859–3868 (2019)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265. PMLR (2015)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)
Tsalicoglou, C., Manhardt, F., Tonioni, A., Niemeyer, M., Tombari, F.: Textmesh: generation of realistic 3D meshes from text prompts. arXiv preprint arXiv:2304.12439 (2023)
Tseng, H.Y., Li, Q., Kim, C., Alsisan, S., Huang, J.B., Kopf, J.: Consistent view synthesis with pose-guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16773–16783 (2023)
Valsesia, D., Fracastoro, G., Magli, E.: Learning localized generative models for 3D point clouds via graph convolution. In: International Conference on Learning Representations (2018)
Wang, Z., et al.: Alto: alternating latent topologies for implicit 3D reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 259–270 (2023)
Xu, Q., Wang, W., Ceylan, D., Mech, R., Neumann, U.: DISN: deep implicit surface network for high-quality single-view 3D reconstruction. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Xu, Q., Xu, Z., Philip, J., Bi, S., Shu, Z., Sunkavalli, K., Neumann, U.: Point-nerf: point-based neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5438–5448 (2022)
Yang, G., Huang, X., Hao, Z., Liu, M.Y., Belongie, S., Hariharan, B.: Pointflow: 3D point cloud generation with continuous normalizing flows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4541–4550 (2019)
Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: MVSNet: depth inference for unstructured multi-view stereo. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 785–801. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_47
Zeng, X., et al.: Lion: latent point diffusion models for 3D shape generation. arXiv preprint arXiv:2210.06978 (2022)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Zhang, X., Zhang, Z., Zhang, C., Tenenbaum, J., Freeman, B., Wu, J.: Learning to reconstruct shapes from unseen classes. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Zhao, M., et al.: Efficientdreamer: high-fidelity and robust 3D creation via orthogonal-view diffusion prior. arXiv preprint arXiv:2308.13223 (2023)
Zhou, L., Du, Y., Wu, J.: 3D shape generation and completion through point-voxel diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5826–5835 (2021)
Acknowledgement
A.K. was supported by a National Science Foundation (NSF) CAREER award IIS-2046737, Army Young Investigator Program Award, and Defense Advanced Research Projects Agency (DARPA) Young Faculty Award.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, Z. et al. (2025). MVDD: Multi-view Depth Diffusion Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15071. Springer, Cham. https://doi.org/10.1007/978-3-031-72624-8_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-72624-8_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72623-1
Online ISBN: 978-3-031-72624-8
eBook Packages: Computer ScienceComputer Science (R0)