Skip to main content
Log in

Latent diffusion transformer for point cloud generation

  • Research
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Diffusion models have been successfully applied to point cloud generation tasks recently. The main notion is using a forward process to progressively add noises into point clouds and then use a reverse process to generate point clouds by denoising these noises. However, since point cloud data is high-dimensional and exhibits complex structures, it is challenging to adequately capture the surface distribution of point clouds. Moreover, point cloud generation methods often resort to sampling methods and local operations to extract features, which inevitably ignores the global structures and overall shapes of point clouds. To address these limitations, we propose a latent diffusion model based on Transformers for point cloud generation. Instead of directly building a diffusion process based on the points, we first propose a latent compressor to convert original point clouds into a set of latent tokens before feeding them into diffusion models. Converting point clouds as latent tokens not only improves expressiveness, but also exhibits better flexibility since they can adapt to various downstream tasks. We carefully design the latent compressor based on an attention-based auto-encoder architecture to capture global structures in point clouds. Then, we propose to use transformers as the backbones of the latent diffusion module to maintain global structures. The powerful feature extraction ability of transformers guarantees the high quality and smoothness of generated point clouds. Experiments show that our method achieves superior performance in both unconditional generation on ShapeNet and multi-modal point cloud completion on ShapeNet-ViPC. Our code and samples are publicly available at https://github.com/Negai-98/LDT.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Algorithm 2
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. Aiello, E., Valsesia, D., Magli, E.: Cross-modal learning for image-guided point cloud shape completion. Adv. Neural Inf. Process. Syst. 35, 37349–37362 (2022)

    Google Scholar 

  2. Cai, R., Yang, G., Averbuch-Elor. H., et al.: Learning gradient fields for shape generation. In: European Conference on Computer Vision, pp. 364–381 (2020). https://doi.org/10.1007/978-3-030-58580-8_22

  3. Chai, S., Zhuang, L., Yan, F.: Layoutdm: transformer-based diffusion model for layout generation, pp. 18349–18358 (2023). https://doi.org/10.1109/CVPR52729.2023.01760

  4. Chang, A.X., Funkhouser, T., Guibas, L. et al.: Shapenet: an information-rich 3d model repository (2015). arXiv preprint arXiv:1512.03012

  5. Chang, H., Zhang, H., Jiang, L. et al.: Maskgit: masked generative image transformer, pp. 11305–11315 (2022) https://doi.org/10.1109/CVPR52688.2022.01103

  6. Chen, R., Han, S., Xu, J. et al.: Point-based multi-view stereo network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1538–1547 (2019). https://doi.org/10.1109/ICCV.2019.00162

  7. Chen, Z., Qiu, J., Sheng, B., et al.: Gpsd: generative parking spot detection using multi-clue recovery model. Vis. Comput. 37(9–11), 2657–2669 (2021). https://doi.org/10.1007/S00371-021-02199-Y

    Article  Google Scholar 

  8. Chen, Z., Qiu, G., Li, P., et al.: Mngnas: distilling adaptive combination of multiple searched networks for one-shot neural architecture search. IEEE Trans. Pattern Anal. Mach. Intell. 45(11), 13489–13508 (2023). https://doi.org/10.1109/TPAMI.2023.3293885

    Article  Google Scholar 

  9. Cheng, A.C., Li, X., Liu, S., et al.: Autoregressive 3d shape generation via canonical mapping. In: European Conference on Computer Vision, pp. 89–104 (2022). https://doi.org/10.1007/978-3-031-20062-5_6

  10. Cho, J., Zala, A., Bansal, M.: Dall-eval: probing the reasoning skills and social biases of text-to-image generation models. pp 3020–3031 (2023). https://doi.org/10.1109/ICCV51070.2023.00283

  11. Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 34, 8780–8794 (2021)

    Google Scholar 

  12. Dinh, L., Krueger, D., Bengio, Y.: Nice: Non-linear independent components estimation. In: International Conference on Learning Representations (2015)

  13. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=YicbFdNTTy

  14. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

  15. Groueix, T., Fisher, M., Kim, V.G., et al.: Atlasnet: a papier-mache approach to learning 3d surface generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 216–224 (2018) https://doi.org/10.1109/CVPR.2018.00030

  16. Harvey, W., Naderiparizi, S., Masrani, V., et al.: Flexible diffusion modeling of long videos. Adv. Neural Inf. Process. Syst. 35, 27953–27965 (2022)

    Google Scholar 

  17. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020)

    Google Scholar 

  18. Ho, J., Saharia, C., Chan, W., et al.: Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23, 2249–2281 (2022)

    MathSciNet  Google Scholar 

  19. Huang, R., Lam, M.W., Wang, J. et al.: Fastdiff: a fast conditional diffusion model for high-quality speech synthesis. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, pp. 4157–4163 (2022). https://doi.org/10.24963/ijcai.2022/577

  20. Huang, R., Zhao, Z., Liu, H. et al.: Prodiff: Progressive fast diffusion model for high-quality text-to-speech. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 2595–2605, (2022). https://doi.org/10.1145/3503161.3547855

  21. Jiang, N., Sheng, B., Li, P., et al.: Photohelper: portrait photographing guidance via deep feature retrieval and fusion. IEEE Trans. Multimed. 25, 2226–2238 (2023). https://doi.org/10.1109/TMM.2022.3144890

    Article  Google Scholar 

  22. Kim, H., Lee, H., Kang, W.H., et al.: Softflow: probabilistic framework for normalizing flow on manifolds. Adv. Neural Inf. Process. Syst. 33, 16388–16397 (2020)

    Google Scholar 

  23. Kim, J., Yoo, J., Lee, J. et al.: Setvae: learning hierarchical composition for generative modeling of set-structured data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15059–15068 (2021). https://doi.org/10.1109/CVPR46437.2021.01481

  24. Klokov, R., Boyer, E., Verbeek, J.: Discrete point flow networks for efficient point cloud generation. In: European Conference on Computer Vision, pp. 694–710 (2020). https://doi.org/10.1007/978-3-030-58592-1_41

  25. Lai, X., Liu, J., Jiang, L. et al.: Stratified transformer for 3d point cloud segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8500–8509 (2022). https://doi.org/10.1109/CVPR52688.2022.00831

  26. Lee, J., Lee, Y., Kim, J. et al.: Set transformer: a framework for attention-based permutation-invariant neural networks. In: International Conference on Machine Learning, pp. 3744–3753 (2019)

  27. Li, J., Chen, J., Sheng, B., et al.: Automatic detection and classification system of domestic waste via multimodel cascaded convolutional neural network. IEEE Trans. Ind. Inf. 18(1), 163–173 (2022). https://doi.org/10.1109/TII.2021.3085669

    Article  Google Scholar 

  28. Lin, X., Sun, S., Huang, W., et al.: Eapt: efficient attention pyramid transformer for image processing. IEEE Trans. Multimed. 25, 50–61 (2023). https://doi.org/10.1109/TMM.2021.3120873

    Article  Google Scholar 

  29. Liu, Q., Zhao, J., Cheng, C., et al.: Pointalcr: adversarial latent GAN and contrastive regularization for point cloud completion. Vis. Comput. 38(9), 3341–3349 (2022). https://doi.org/10.1007/S00371-022-02550-X

    Article  Google Scholar 

  30. Liu, Z., Tang, H., Lin, Y., et al.: Point-voxel cnn for efficient 3d deep learning. Adv. Neural Inf. Process. Syst. 32, 963–973 (2019)

    Google Scholar 

  31. Luo, S., Hu, W.: Diffusion probabilistic models for 3d point cloud generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2837–2845 (2021). https://doi.org/10.1109/CVPR46437.2021.00286

  32. Lyu, Z., Kong, Z., Xu, X. et al.: A conditional point diffusion-refinement paradigm for 3d point cloud completion. In: International Conference on Learning Representations (2022)

  33. Ma, B., Liu, Y.S., Han, Z.: Reconstructing surfaces for sparse point clouds with on-surface priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6315–6325 (2022). https://doi.org/10.1109/CVPR52688.2022.00621

  34. Peebles, W., Xie, S.: Scalable diffusion models with transformers, pp. 4172–4182 (2023). https://doi.org/10.1109/ICCV51070.2023.00387

  35. Peng, S., Jiang, C., Liao, Y., et al.: Shape as points: a differentiable poisson solver. Adv. Neural Inf. Process. Syst. 34, 13032–13044 (2021)

    Google Scholar 

  36. Qin, Z., Yin, M., Lin, Z., et al.: Three-view generation based on a single front view image for car. Vis. Comput. 37(8), 2195–2205 (2021). https://doi.org/10.1007/S00371-020-01979-2

    Article  Google Scholar 

  37. Ramasinghe, S., Khan, S., Barnes, N., et al.: Spectral-gans for high-resolution 3d point-cloud generation. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8169–8176 (2020). https://doi.org/10.1109/IROS45743.2020.9341265

  38. Rombach, R., Blattmann, A., Lorenz, D. et al.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10684–10695 (2022). https://doi.org/10.1109/CVPR52688.2022.01042

  39. Ruan, L., Ma, Y., Yang, H. et al.: Mm-diffusion: learning multi-modal diffusion models for joint audio and video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10219–10228 (2023)

  40. Sheng, B., Andge Riaz Ali, P.L., Chen, C.L.P.: Improving video temporal consistency via broad learning system. IEEE Trans. Cybern. 52(7), 6662–6675 (2022). https://doi.org/10.1109/TCYB.2021.3079311

    Article  Google Scholar 

  41. Shu, D.W., Park, S.W., Kwon, U.: 3d point cloud generative adversarial network based on tree structured graph convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3859–3868 (2019). https://doi.org/10.1109/ICCV.2019.00396

  42. Tchapmi, L.P., Kosaraju, V., Rezatofighi, H. et al.: Topnet: structural point cloud decoder. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 383–392, (2019). https://doi.org/10.1109/CVPR.2019.00047

  43. Vahdat, A., Kreis, K., Kautz, J.: Score-based generative modeling in latent space. Adv. Neural Inf. Process. Syst. 34, 11287–11302 (2021)

    Google Scholar 

  44. Wu, J., Zhang, C., Xue, T., et al.: Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. Adv. Neural Inf. Process. Syst. 29, 82–90 (2016)

    Google Scholar 

  45. Xiang, M., Ye, H., Yang, B., et al.: Multi-space and detail-supplemented attention network for point cloud completion. Appl. Intell. 53(12), 14971–14985 (2023). https://doi.org/10.1007/s10489-022-04219-3

    Article  Google Scholar 

  46. Xiang, P., Wen, X., Liu, Y.S. et al.: Snowflakenet: point cloud completion by snowflake point deconvolution with skip-transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5499–5509 (2021). https://doi.org/10.1109/ICCV48922.2021.00545

  47. Xu, F., Wang, Z., Wang, H., et al.: Dynamic vehicle pose estimation and tracking based on motion feedback for lidars. Appl. Intell. 53(2), 2362–2390 (2023). https://doi.org/10.1007/s10489-022-03576-3

    Article  MathSciNet  Google Scholar 

  48. Xu, Q., Xu, Z., Philip, J., et al.: Point-nerf: point-based neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5438–5448 (2022). https://doi.org/10.1109/CVPR52688.2022.00536

  49. Yang, G., Huang, X., Hao, Z. et al.: Pointflow: 3d point cloud generation with continuous normalizing flows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4541–4550 (2019). https://doi.org/10.1109/ICCV.2019.00464

  50. Yang, Y., Feng, C., Shen, Y. et al.: Foldingnet: point cloud auto-encoder via deep grid deformation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 206–215, (2018). https://doi.org/10.1109/CVPR.2018.00029

  51. Yu, X., Rao, Y., Wang, Z., et al.: Pointr: diverse point cloud completion with geometry-aware transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12498–12507 (2021). https://doi.org/10.1109/ICCV48922.2021.01227

  52. Yuan, W., Khot, T., Held, D. et al.: Pcn: point completion network. In: International Conference on 3D Vision (3DV), pp. 728–737 (2018). https://doi.org/10.1109/3DV.2018.00088

  53. Zeng, X., Vahdat, A., Williams, F., et al.: Lion: latent point diffusion models for 3d shape generation. Adv. Neural Inf. Process. Syst. 35, 10021–10039 (2022)

    Google Scholar 

  54. Zhang, B., Gu, S., Zhang, B. et al.: Styleswin: transformer-based gan for high-resolution image generation, pp. 11294–11304 (2022). https://doi.org/10.1109/CVPR52688.2022.01102

  55. Zhang, X., Feng, Y., Li, S., et al.: View-guided point cloud completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 15890–15899 (2021). https://doi.org/10.1109/CVPR46437.2021.01563

  56. Zhou, L., Du, Y., Wu, J.: 3d shape generation and completion through point-voxel diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5826–5835 (2021). https://doi.org/10.1109/ICCV48922.2021.00577

  57. Zhu, Z., Nan, L., Xie, H., et al.: Csdn: cross-modal shape-transfer dual-refinement network for point cloud completion. IEEE Trans. Vis. Comput. Gr. (2023). https://doi.org/10.1109/TVCG.2023.3236061

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the Beijing Natural Science Foundation (Grant No. 4222020), and in part by the National Natural Science Foundation of China (Grant No. 62241601).

Author information

Authors and Affiliations

Authors

Contributions

JJ involved in methodology and conceptualization; RZ involved in methodology, writing draft, results analysis, and visualization; ML involved in methodology, writing draft, and results analysis.

Corresponding author

Correspondence to Minglong Lei.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical and informed consent for data used

Not applicable. The current study does not involve humans and animals.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ji, J., Zhao, R. & Lei, M. Latent diffusion transformer for point cloud generation. Vis Comput 40, 3903–3917 (2024). https://doi.org/10.1007/s00371-024-03396-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-024-03396-1

Keywords