Abstract
Semantic-driven 3D shape generation aims to generate 3D shapes conditioned on textual input. However, previous approaches have faced challenges with the single-category generation, low-frequency details, and the requirement for large quantities of paired data. To address these issues, we propose a multi-category diffusion model. Specifically, our approach includes the following components: 1) To mitigate the problem of limited large-scale paired data, we establish a connection between text, 2D images, and 3D shapes through the use of the pre-trained CLIP model, enabling zero-shot learning. 2) To obtain the multi-category 3D shape feature, we employ a conditional flow model to generate a multi-category shape vector conditioned on the CLIP embedding. 3) To generate multi-category 3D shapes, we utilize a hidden-layer diffusion model conditioned on the multi-category shape vector, resulting in significant reductions in training time and memory consumption. We evaluate the generated results of our framework and demonstrate that our method outperforms existing methods. The code and more qualitative samples can be found at website.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bhagat, P., Choudhary, P., Singh, K.M.: A study on zero-shot learning from semantic viewpoint. Vis. Comput. 39(5), 2149–2163 (2023)
Chen, K., Choy, C.B., Savva, M., Chang, A.X., Funkhouser, T., Savarese, S.: Text2shape: generating shapes from natural language by learning joint embeddings. arXiv preprint arXiv:1803.08495 (2018)
Chiang, P.Z., Tsai, M.S., Tseng, H.Y., Lai, W.S., Chiu, W.C.: Stylizing 3D scene via implicit representation and hypernetwork. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1475–1484 (2022)
Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_38
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: an overview. IEEE Signal Process. Mag. 35(1), 53–65 (2018)
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. arXiv preprint arXiv:1605.08803 (2016)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2020)
Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Han, Z., Shang, M., Wang, X., Liu, Y.S., Zwicker, M.: Y2seq2seq: cross-modal representation learning for 3D shape and text by joint reconstruction and prediction of view and word sequences. In: AAAI, vol. 33, pp. 126–133 (2019)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. arXiv preprint arXiv:2204.03458 (2022)
Hong, F., Zhang, M., Pan, L., Cai, Z., Yang, L., Liu, Z.: Avatarclip: zero-shot text-driven generation and animation of 3D avatars. ACM Trans. Graph. (TOG) 41(4), 1–19 (2022)
Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: CVPR, pp. 867–876 (2022)
Jiang, C., Sud, A., Makadia, A., Huang, J., Nießner, M., Funkhouser, T., et al.: Local implicit grid representations for 3D scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6001–6010 (2020)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Li, S., Wu, F., fan, Y., Song, X., Dong, W.: PLDGAN: portrait line drawing generation with prior knowledge and conditioning target. Vis. Comput. 1–12 (2023)
Lin, C.H., et al.: Magic3D: high-resolution text-to-3D content creation. arXiv preprint arXiv:2211.10440 (2022)
Luo, S., Hu, W.: Diffusion probabilistic models for 3D point cloud generation. In: CVPR, pp. 2837–2845 (2021)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Mohammad Khalid, N., Xie, T., Belilovsky, E., Popa, T.: Clip-mesh: generating textured meshes from text using pretrained image-text models. In: SIGGRAPH, pp. 1–8 (2022)
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)
Presnov, D., Berels, M., Kolb, A.: Pacemod: parametric contour-based modifications for glyph generation. Vis. Comput. 1–14 (2023)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3D classification and segmentation. In: CVPR, pp. 652–660 (2017)
Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: International Conference on Machine Learning, pp. 1530–1538. PMLR (2015)
Sanghi, A., et al.: Clip-forge: towards zero-shot text-to-shape generation. In: CVPR, pp. 18603–18613 (2022)
Shen, Y., Feng, C., Yang, Y., Tian, D.: Mining point cloud local structures by kernel correlation and graph pooling. In: CVPR, pp. 4548–4557 (2018)
Shi, Z., Peng, S., Xu, Y., Liao, Y., Shen, Y.: Deep generative models on 3D representations: a survey. arXiv preprint arXiv:2210.15663 (2022)
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Tang, C., Yang, X., Wu, B., Han, Z., Chang, Y.: Part2word: learning joint embedding of point clouds and text by matching parts to words. arXiv preprint arXiv:2107.01872 (2021)
Upadhyay, A.K., Khandelwal, K.: Metaverse: the future of immersive training. Strateg. HR Rev. 21(3), 83–86 (2022)
Valsesia, D., Fracastoro, G., Magli, E.: Learning localized generative models for 3D point clouds via graph convolution. In: International Conference on Learning Representations (2019)
Wu, J., Zhang, C., Xue, T., Freeman, B., Tenenbaum, J.: Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In: NeurIPS, vol. 29 (2016)
Yang, G., Huang, X., Hao, Z., Liu, M.Y., Belongie, S., Hariharan, B.: Pointflow: 3D point cloud generation with continuous normalizing flows. In: ICCV, pp. 4541–4550 (2019)
Yang, Y., Feng, C., Shen, Y., Tian, D.: Foldingnet: point cloud auto-encoder via deep grid deformation. In: CVPR, pp. 206–215 (2018)
Zhou, L., Du, Y., Wu, J.: 3D shape generation and completion through point-voxel diffusion. In: ICCV, pp. 5826–5835 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Han, B., Shen, Y., Fu, Y. (2024). Zero3D: Semantic-Driven 3D Shape Generation for Zero-Shot Learning. In: Sheng, B., Bi, L., Kim, J., Magnenat-Thalmann, N., Thalmann, D. (eds) Advances in Computer Graphics. CGI 2023. Lecture Notes in Computer Science, vol 14496. Springer, Cham. https://doi.org/10.1007/978-3-031-50072-5_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-50072-5_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-50071-8
Online ISBN: 978-3-031-50072-5
eBook Packages: Computer ScienceComputer Science (R0)