Latent diffusion transformer for point cloud generation

Ji, Junzhong; Zhao, Runfeng; Lei, Minglong

doi:10.1007/s00371-024-03396-1

Latent diffusion transformer for point cloud generation

Research
Published: 22 April 2024

Volume 40, pages 3903–3917, (2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Junzhong Ji^1,2,
Runfeng Zhao^1,2 &
Minglong Lei^1,2

1142 Accesses
Explore all metrics

Abstract

Diffusion models have been successfully applied to point cloud generation tasks recently. The main notion is using a forward process to progressively add noises into point clouds and then use a reverse process to generate point clouds by denoising these noises. However, since point cloud data is high-dimensional and exhibits complex structures, it is challenging to adequately capture the surface distribution of point clouds. Moreover, point cloud generation methods often resort to sampling methods and local operations to extract features, which inevitably ignores the global structures and overall shapes of point clouds. To address these limitations, we propose a latent diffusion model based on Transformers for point cloud generation. Instead of directly building a diffusion process based on the points, we first propose a latent compressor to convert original point clouds into a set of latent tokens before feeding them into diffusion models. Converting point clouds as latent tokens not only improves expressiveness, but also exhibits better flexibility since they can adapt to various downstream tasks. We carefully design the latent compressor based on an attention-based auto-encoder architecture to capture global structures in point clouds. Then, we propose to use transformers as the backbones of the latent diffusion module to maintain global structures. The powerful feature extraction ability of transformers guarantees the high quality and smoothness of generated point clouds. Experiments show that our method achieves superior performance in both unconditional generation on ShapeNet and multi-modal point cloud completion on ShapeNet-ViPC. Our code and samples are publicly available at https://github.com/Negai-98/LDT.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SE(3)-Diffusion: An Equivariant Diffusion Model for 3D Point Cloud Generation

Decomposed Latent Diffusion Model for 3D Point Cloud Generation

Fast point completion network

Article 28 March 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Aiello, E., Valsesia, D., Magli, E.: Cross-modal learning for image-guided point cloud shape completion. Adv. Neural Inf. Process. Syst. 35, 37349–37362 (2022)
Google Scholar
Cai, R., Yang, G., Averbuch-Elor. H., et al.: Learning gradient fields for shape generation. In: European Conference on Computer Vision, pp. 364–381 (2020). https://doi.org/10.1007/978-3-030-58580-8_22
Chai, S., Zhuang, L., Yan, F.: Layoutdm: transformer-based diffusion model for layout generation, pp. 18349–18358 (2023). https://doi.org/10.1109/CVPR52729.2023.01760
Chang, A.X., Funkhouser, T., Guibas, L. et al.: Shapenet: an information-rich 3d model repository (2015). arXiv preprint arXiv:1512.03012
Chang, H., Zhang, H., Jiang, L. et al.: Maskgit: masked generative image transformer, pp. 11305–11315 (2022) https://doi.org/10.1109/CVPR52688.2022.01103
Chen, R., Han, S., Xu, J. et al.: Point-based multi-view stereo network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1538–1547 (2019). https://doi.org/10.1109/ICCV.2019.00162
Chen, Z., Qiu, J., Sheng, B., et al.: Gpsd: generative parking spot detection using multi-clue recovery model. Vis. Comput. 37(9–11), 2657–2669 (2021). https://doi.org/10.1007/S00371-021-02199-Y
Article Google Scholar
Chen, Z., Qiu, G., Li, P., et al.: Mngnas: distilling adaptive combination of multiple searched networks for one-shot neural architecture search. IEEE Trans. Pattern Anal. Mach. Intell. 45(11), 13489–13508 (2023). https://doi.org/10.1109/TPAMI.2023.3293885
Article Google Scholar
Cheng, A.C., Li, X., Liu, S., et al.: Autoregressive 3d shape generation via canonical mapping. In: European Conference on Computer Vision, pp. 89–104 (2022). https://doi.org/10.1007/978-3-031-20062-5_6
Cho, J., Zala, A., Bansal, M.: Dall-eval: probing the reasoning skills and social biases of text-to-image generation models. pp 3020–3031 (2023). https://doi.org/10.1109/ICCV51070.2023.00283
Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 34, 8780–8794 (2021)
Google Scholar
Dinh, L., Krueger, D., Bengio, Y.: Nice: Non-linear independent components estimation. In: International Conference on Learning Representations (2015)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=YicbFdNTTy
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Groueix, T., Fisher, M., Kim, V.G., et al.: Atlasnet: a papier-mache approach to learning 3d surface generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 216–224 (2018) https://doi.org/10.1109/CVPR.2018.00030
Harvey, W., Naderiparizi, S., Masrani, V., et al.: Flexible diffusion modeling of long videos. Adv. Neural Inf. Process. Syst. 35, 27953–27965 (2022)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020)
Google Scholar
Ho, J., Saharia, C., Chan, W., et al.: Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23, 2249–2281 (2022)
MathSciNet Google Scholar
Huang, R., Lam, M.W., Wang, J. et al.: Fastdiff: a fast conditional diffusion model for high-quality speech synthesis. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, pp. 4157–4163 (2022). https://doi.org/10.24963/ijcai.2022/577
Huang, R., Zhao, Z., Liu, H. et al.: Prodiff: Progressive fast diffusion model for high-quality text-to-speech. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 2595–2605, (2022). https://doi.org/10.1145/3503161.3547855
Jiang, N., Sheng, B., Li, P., et al.: Photohelper: portrait photographing guidance via deep feature retrieval and fusion. IEEE Trans. Multimed. 25, 2226–2238 (2023). https://doi.org/10.1109/TMM.2022.3144890
Article Google Scholar
Kim, H., Lee, H., Kang, W.H., et al.: Softflow: probabilistic framework for normalizing flow on manifolds. Adv. Neural Inf. Process. Syst. 33, 16388–16397 (2020)
Google Scholar
Kim, J., Yoo, J., Lee, J. et al.: Setvae: learning hierarchical composition for generative modeling of set-structured data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15059–15068 (2021). https://doi.org/10.1109/CVPR46437.2021.01481
Klokov, R., Boyer, E., Verbeek, J.: Discrete point flow networks for efficient point cloud generation. In: European Conference on Computer Vision, pp. 694–710 (2020). https://doi.org/10.1007/978-3-030-58592-1_41
Lai, X., Liu, J., Jiang, L. et al.: Stratified transformer for 3d point cloud segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8500–8509 (2022). https://doi.org/10.1109/CVPR52688.2022.00831
Lee, J., Lee, Y., Kim, J. et al.: Set transformer: a framework for attention-based permutation-invariant neural networks. In: International Conference on Machine Learning, pp. 3744–3753 (2019)
Li, J., Chen, J., Sheng, B., et al.: Automatic detection and classification system of domestic waste via multimodel cascaded convolutional neural network. IEEE Trans. Ind. Inf. 18(1), 163–173 (2022). https://doi.org/10.1109/TII.2021.3085669
Article Google Scholar
Lin, X., Sun, S., Huang, W., et al.: Eapt: efficient attention pyramid transformer for image processing. IEEE Trans. Multimed. 25, 50–61 (2023). https://doi.org/10.1109/TMM.2021.3120873
Article Google Scholar
Liu, Q., Zhao, J., Cheng, C., et al.: Pointalcr: adversarial latent GAN and contrastive regularization for point cloud completion. Vis. Comput. 38(9), 3341–3349 (2022). https://doi.org/10.1007/S00371-022-02550-X
Article Google Scholar
Liu, Z., Tang, H., Lin, Y., et al.: Point-voxel cnn for efficient 3d deep learning. Adv. Neural Inf. Process. Syst. 32, 963–973 (2019)
Google Scholar
Luo, S., Hu, W.: Diffusion probabilistic models for 3d point cloud generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2837–2845 (2021). https://doi.org/10.1109/CVPR46437.2021.00286
Lyu, Z., Kong, Z., Xu, X. et al.: A conditional point diffusion-refinement paradigm for 3d point cloud completion. In: International Conference on Learning Representations (2022)
Ma, B., Liu, Y.S., Han, Z.: Reconstructing surfaces for sparse point clouds with on-surface priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6315–6325 (2022). https://doi.org/10.1109/CVPR52688.2022.00621
Peebles, W., Xie, S.: Scalable diffusion models with transformers, pp. 4172–4182 (2023). https://doi.org/10.1109/ICCV51070.2023.00387
Peng, S., Jiang, C., Liao, Y., et al.: Shape as points: a differentiable poisson solver. Adv. Neural Inf. Process. Syst. 34, 13032–13044 (2021)
Google Scholar
Qin, Z., Yin, M., Lin, Z., et al.: Three-view generation based on a single front view image for car. Vis. Comput. 37(8), 2195–2205 (2021). https://doi.org/10.1007/S00371-020-01979-2
Article Google Scholar
Ramasinghe, S., Khan, S., Barnes, N., et al.: Spectral-gans for high-resolution 3d point-cloud generation. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8169–8176 (2020). https://doi.org/10.1109/IROS45743.2020.9341265
Rombach, R., Blattmann, A., Lorenz, D. et al.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10684–10695 (2022). https://doi.org/10.1109/CVPR52688.2022.01042
Ruan, L., Ma, Y., Yang, H. et al.: Mm-diffusion: learning multi-modal diffusion models for joint audio and video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10219–10228 (2023)
Sheng, B., Andge Riaz Ali, P.L., Chen, C.L.P.: Improving video temporal consistency via broad learning system. IEEE Trans. Cybern. 52(7), 6662–6675 (2022). https://doi.org/10.1109/TCYB.2021.3079311
Article Google Scholar
Shu, D.W., Park, S.W., Kwon, U.: 3d point cloud generative adversarial network based on tree structured graph convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3859–3868 (2019). https://doi.org/10.1109/ICCV.2019.00396
Tchapmi, L.P., Kosaraju, V., Rezatofighi, H. et al.: Topnet: structural point cloud decoder. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 383–392, (2019). https://doi.org/10.1109/CVPR.2019.00047
Vahdat, A., Kreis, K., Kautz, J.: Score-based generative modeling in latent space. Adv. Neural Inf. Process. Syst. 34, 11287–11302 (2021)
Google Scholar
Wu, J., Zhang, C., Xue, T., et al.: Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. Adv. Neural Inf. Process. Syst. 29, 82–90 (2016)
Google Scholar
Xiang, M., Ye, H., Yang, B., et al.: Multi-space and detail-supplemented attention network for point cloud completion. Appl. Intell. 53(12), 14971–14985 (2023). https://doi.org/10.1007/s10489-022-04219-3
Article Google Scholar
Xiang, P., Wen, X., Liu, Y.S. et al.: Snowflakenet: point cloud completion by snowflake point deconvolution with skip-transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5499–5509 (2021). https://doi.org/10.1109/ICCV48922.2021.00545
Xu, F., Wang, Z., Wang, H., et al.: Dynamic vehicle pose estimation and tracking based on motion feedback for lidars. Appl. Intell. 53(2), 2362–2390 (2023). https://doi.org/10.1007/s10489-022-03576-3
Article MathSciNet Google Scholar
Xu, Q., Xu, Z., Philip, J., et al.: Point-nerf: point-based neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5438–5448 (2022). https://doi.org/10.1109/CVPR52688.2022.00536
Yang, G., Huang, X., Hao, Z. et al.: Pointflow: 3d point cloud generation with continuous normalizing flows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4541–4550 (2019). https://doi.org/10.1109/ICCV.2019.00464
Yang, Y., Feng, C., Shen, Y. et al.: Foldingnet: point cloud auto-encoder via deep grid deformation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 206–215, (2018). https://doi.org/10.1109/CVPR.2018.00029
Yu, X., Rao, Y., Wang, Z., et al.: Pointr: diverse point cloud completion with geometry-aware transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12498–12507 (2021). https://doi.org/10.1109/ICCV48922.2021.01227
Yuan, W., Khot, T., Held, D. et al.: Pcn: point completion network. In: International Conference on 3D Vision (3DV), pp. 728–737 (2018). https://doi.org/10.1109/3DV.2018.00088
Zeng, X., Vahdat, A., Williams, F., et al.: Lion: latent point diffusion models for 3d shape generation. Adv. Neural Inf. Process. Syst. 35, 10021–10039 (2022)
Google Scholar
Zhang, B., Gu, S., Zhang, B. et al.: Styleswin: transformer-based gan for high-resolution image generation, pp. 11294–11304 (2022). https://doi.org/10.1109/CVPR52688.2022.01102
Zhang, X., Feng, Y., Li, S., et al.: View-guided point cloud completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 15890–15899 (2021). https://doi.org/10.1109/CVPR46437.2021.01563
Zhou, L., Du, Y., Wu, J.: 3d shape generation and completion through point-voxel diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5826–5835 (2021). https://doi.org/10.1109/ICCV48922.2021.00577
Zhu, Z., Nan, L., Xie, H., et al.: Csdn: cross-modal shape-transfer dual-refinement network for point cloud completion. IEEE Trans. Vis. Comput. Gr. (2023). https://doi.org/10.1109/TVCG.2023.3236061
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the Beijing Natural Science Foundation (Grant No. 4222020), and in part by the National Natural Science Foundation of China (Grant No. 62241601).

Author information

Authors and Affiliations

Beijing Municipal Key Laboratory of Multimedia and Intelligent Software Technology, Beijing University of Technology, Beijing, 100124, China
Junzhong Ji, Runfeng Zhao & Minglong Lei
Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
Junzhong Ji, Runfeng Zhao & Minglong Lei

Authors

Junzhong Ji
View author publications
You can also search for this author inPubMed Google Scholar
Runfeng Zhao
View author publications
You can also search for this author inPubMed Google Scholar
Minglong Lei
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

JJ involved in methodology and conceptualization; RZ involved in methodology, writing draft, results analysis, and visualization; ML involved in methodology, writing draft, and results analysis.

Corresponding author

Correspondence to Minglong Lei.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical and informed consent for data used

Not applicable. The current study does not involve humans and animals.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ji, J., Zhao, R. & Lei, M. Latent diffusion transformer for point cloud generation. Vis Comput 40, 3903–3917 (2024). https://doi.org/10.1007/s00371-024-03396-1

Download citation

Accepted: 29 March 2024
Published: 22 April 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s00371-024-03396-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Latent diffusion transformer for point cloud generation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SE(3)-Diffusion: An Equivariant Diffusion Model for 3D Point Cloud Generation

Decomposed Latent Diffusion Model for 3D Point Cloud Generation

Fast point completion network

Explore related subjects

Data availability statement

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical and informed consent for data used

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now