Zero3D: Semantic-Driven 3D Shape Generation for Zero-Shot Learning

Han, Bo; Shen, Yixuan; Fu, Yitong

doi:10.1007/978-3-031-50072-5_33

Bo Han¹²,
Yixuan Shen¹³ &
Yitong Fu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14496))

Included in the following conference series:

Computer Graphics International Conference

406 Accesses

Abstract

Semantic-driven 3D shape generation aims to generate 3D shapes conditioned on textual input. However, previous approaches have faced challenges with the single-category generation, low-frequency details, and the requirement for large quantities of paired data. To address these issues, we propose a multi-category diffusion model. Specifically, our approach includes the following components: 1) To mitigate the problem of limited large-scale paired data, we establish a connection between text, 2D images, and 3D shapes through the use of the pre-trained CLIP model, enabling zero-shot learning. 2) To obtain the multi-category 3D shape feature, we employ a conditional flow model to generate a multi-category shape vector conditioned on the CLIP embedding. 3) To generate multi-category 3D shapes, we utilize a hidden-layer diffusion model conditioned on the multi-category shape vector, resulting in significant reductions in training time and memory consumption. We evaluate the generated results of our framework and demonstrate that our method outperforms existing methods. The code and more qualitative samples can be found at website.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bhagat, P., Choudhary, P., Singh, K.M.: A study on zero-shot learning from semantic viewpoint. Vis. Comput. 39(5), 2149–2163 (2023)
Article Google Scholar
Chen, K., Choy, C.B., Savva, M., Chang, A.X., Funkhouser, T., Savarese, S.: Text2shape: generating shapes from natural language by learning joint embeddings. arXiv preprint arXiv:1803.08495 (2018)
Chiang, P.Z., Tsai, M.S., Tseng, H.Y., Lai, W.S., Chiu, W.C.: Stylizing 3D scene via implicit representation and hypernetwork. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1475–1484 (2022)
Google Scholar
Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_38
Chapter Google Scholar
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: an overview. IEEE Signal Process. Mag. 35(1), 53–65 (2018)
Article Google Scholar
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. arXiv preprint arXiv:1605.08803 (2016)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2020)
Google Scholar
Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Article MathSciNet Google Scholar
Han, Z., Shang, M., Wang, X., Liu, Y.S., Zwicker, M.: Y2seq2seq: cross-modal representation learning for 3D shape and text by joint reconstruction and prediction of view and word sequences. In: AAAI, vol. 33, pp. 126–133 (2019)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
Google Scholar
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. arXiv preprint arXiv:2204.03458 (2022)
Hong, F., Zhang, M., Pan, L., Cai, Z., Yang, L., Liu, Z.: Avatarclip: zero-shot text-driven generation and animation of 3D avatars. ACM Trans. Graph. (TOG) 41(4), 1–19 (2022)
Article Google Scholar
Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: CVPR, pp. 867–876 (2022)
Google Scholar
Jiang, C., Sud, A., Makadia, A., Huang, J., Nießner, M., Funkhouser, T., et al.: Local implicit grid representations for 3D scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6001–6010 (2020)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Li, S., Wu, F., fan, Y., Song, X., Dong, W.: PLDGAN: portrait line drawing generation with prior knowledge and conditioning target. Vis. Comput. 1–12 (2023)
Google Scholar
Lin, C.H., et al.: Magic3D: high-resolution text-to-3D content creation. arXiv preprint arXiv:2211.10440 (2022)
Luo, S., Hu, W.: Diffusion probabilistic models for 3D point cloud generation. In: CVPR, pp. 2837–2845 (2021)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Article Google Scholar
Mohammad Khalid, N., Xie, T., Belilovsky, E., Popa, T.: Clip-mesh: generating textured meshes from text using pretrained image-text models. In: SIGGRAPH, pp. 1–8 (2022)
Google Scholar
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)
Presnov, D., Berels, M., Kolb, A.: Pacemod: parametric contour-based modifications for glyph generation. Vis. Comput. 1–14 (2023)
Google Scholar
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3D classification and segmentation. In: CVPR, pp. 652–660 (2017)
Google Scholar
Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: International Conference on Machine Learning, pp. 1530–1538. PMLR (2015)
Google Scholar
Sanghi, A., et al.: Clip-forge: towards zero-shot text-to-shape generation. In: CVPR, pp. 18603–18613 (2022)
Google Scholar
Shen, Y., Feng, C., Yang, Y., Tian, D.: Mining point cloud local structures by kernel correlation and graph pooling. In: CVPR, pp. 4548–4557 (2018)
Google Scholar
Shi, Z., Peng, S., Xu, Y., Liao, Y., Shen, Y.: Deep generative models on 3D representations: a survey. arXiv preprint arXiv:2210.15663 (2022)
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Tang, C., Yang, X., Wu, B., Han, Z., Chang, Y.: Part2word: learning joint embedding of point clouds and text by matching parts to words. arXiv preprint arXiv:2107.01872 (2021)
Upadhyay, A.K., Khandelwal, K.: Metaverse: the future of immersive training. Strateg. HR Rev. 21(3), 83–86 (2022)
Article Google Scholar
Valsesia, D., Fracastoro, G., Magli, E.: Learning localized generative models for 3D point clouds via graph convolution. In: International Conference on Learning Representations (2019)
Google Scholar
Wu, J., Zhang, C., Xue, T., Freeman, B., Tenenbaum, J.: Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In: NeurIPS, vol. 29 (2016)
Google Scholar
Yang, G., Huang, X., Hao, Z., Liu, M.Y., Belongie, S., Hariharan, B.: Pointflow: 3D point cloud generation with continuous normalizing flows. In: ICCV, pp. 4541–4550 (2019)
Google Scholar
Yang, Y., Feng, C., Shen, Y., Tian, D.: Foldingnet: point cloud auto-encoder via deep grid deformation. In: CVPR, pp. 206–215 (2018)
Google Scholar
Zhou, L., Du, Y., Wu, J.: 3D shape generation and completion through point-voxel diffusion. In: ICCV, pp. 5826–5835 (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Bo Han & Yitong Fu
National University of Singapore, Singapore, Singapore
Yixuan Shen

Authors

Bo Han
View author publications
You can also search for this author in PubMed Google Scholar
Yixuan Shen
View author publications
You can also search for this author in PubMed Google Scholar
Yitong Fu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Han .

Editor information

Editors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Bin Sheng
Shanghai Jiao Tong University, Shanghai, China
Lei Bi
University of Sydney, Sydney, NSW, Australia
Jinman Kim
MIRALab-CUI, University of Geneve, Carouge, Geneve, Switzerland
Nadia Magnenat-Thalmann
Swiss Federal Institute of Technology, Lausanne, Switzerland
Daniel Thalmann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, B., Shen, Y., Fu, Y. (2024). Zero3D: Semantic-Driven 3D Shape Generation for Zero-Shot Learning. In: Sheng, B., Bi, L., Kim, J., Magnenat-Thalmann, N., Thalmann, D. (eds) Advances in Computer Graphics. CGI 2023. Lecture Notes in Computer Science, vol 14496. Springer, Cham. https://doi.org/10.1007/978-3-031-50072-5_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-50072-5_33
Published: 29 December 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-50071-8
Online ISBN: 978-3-031-50072-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Zero3D: Semantic-Driven 3D Shape Generation for Zero-Shot Learning