Skip to main content

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending shape ReCon [101] to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding.

R. Dong—Project lead.

Work done during Z. Qi and R. Dong’s internships at MEGVII & IIISCT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    URL & License.

  2. 2.

    “Likes” statistics can be found at Sketchfab.

References

  1. Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.J.: Learning representations and generative models for 3D point clouds. In: International Conference on Machine Learning (ICML) (2018)

    Google Scholar 

  2. Alayrac, J., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

    Google Scholar 

  3. Bai, Y., et al.: Sequential modeling enables scalable learning for large vision models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Google Scholar 

  4. Betker, J., et al.: Improving image generation with better captions (2023)

    Google Scholar 

  5. Bommasani, R., et al.: On the opportunities and risks of foundation models. CoRR abs/2108.07258 (2021)

    Google Scholar 

  6. Bradski, G., Grossberg, S.: Recognition of 3-D objects from multiple 2-D views by a self-organizing neural architecture. In: Cherkassky, V., Friedman, J.H., Wechsler, H. (eds.) NATO ASI Series, vol. 136, pp. 349–375. Springer, Heidelberg (1994). https://doi.org/10.1007/978-3-642-79119-2_17

  7. Bronstein, A.M., Bronstein, M.M., Guibas, L.J., Ovsjanikov, M.: Shape google: geometric words and expressions for invariant shape retrieval. ACM Trans. Graph. 30(1), 1:1–1:20 (2011)

    Google Scholar 

  8. Brown, T.B., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

    Google Scholar 

  9. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  10. Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. CoRR abs/1512.03012 (2015)

    Google Scholar 

  11. Chang, M., et al.: GOAT: GO to any thing. In: Robotics: Science and Systems (RSS) (2024)

    Google Scholar 

  12. Chen, B., et al.: SpatialVLM: endowing vision-language models with spatial reasoning capabilities. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Google Scholar 

  13. Chen, D.Z., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 202–221. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_13

    Chapter  Google Scholar 

  14. Chen, D.Z., Gholami, A., Nießner, M., Chang, A.X.: Scan2Cap: context-aware dense captioning in RGB-D scans. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  15. Chen, G., Wang, M., Yang, Y., Yu, K., Yuan, L., Yue, Y.: PointGPT: auto-regressively generative pre-training from point clouds. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  16. Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. CoRR abs/2306.15195 (2023)

    Google Scholar 

  17. Chen, S., Garcia, R., Laptev, I., Schmid, C.: SUGAR: pre-training 3D visual representations for robotics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18049–18060 (2024)

    Google Scholar 

  18. Chen, X., et al.: PaLI-X: on scaling up a multilingual vision and language model. In: International Conference on Learning Representations (ICLR) (2023)

    Google Scholar 

  19. Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality, March 2023. https://lmsys.org/blog/2023-03-30-vicuna/

  20. Collins, J., et al.: ABO: dataset and benchmarks for real-world 3D object understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  21. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  22. Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  23. Das, A., et al.: Visual dialog. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 41(5), 1242–1256 (2019)

    Article  Google Scholar 

  24. Davison, J., Feldman, J., Rush, A.M.: Commonsense knowledge mining from pretrained models. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019 (2019)

    Google Scholar 

  25. Deitke, M., et al.: Objaverse: a universe of annotated 3d objects. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  26. Ding, Y., Zhang, X., Paxton, C., Zhang, S.: Task and motion planning with large language models for object rearrangement. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2023)

    Google Scholar 

  27. Dong, R., et al.: DreamLLM: synergistic multimodal comprehension and creation. In: International Conference on Learning Representations (ICLR) (2024)

    Google Scholar 

  28. Dong, R., et al.: Autoencoders as cross-modal teachers: can pretrained 2D image transformers help 3D representation learning? In: International Conference on Learning Representations (ICLR) (2023)

    Google Scholar 

  29. Driess, D., et al.: PaLM-E: an embodied multimodal language model. In: International Conference on Machine Learning (ICML) (2023)

    Google Scholar 

  30. Fan, G., Qi, Z., Shi, W., Ma, K.: Point-GCC: universal self-supervised 3D scene pre-training via geometry-color contrast. CoRR abs/2305.19623 (2023)

    Google Scholar 

  31. Fu, H., et al.: 3D-future: 3D furniture shape with texture. Int. J. Comput. Vision 129, 3313–3337 (2021)

    Article  Google Scholar 

  32. Gao, Y., Wang, Z., Zheng, W.S., Xie, C., Zhou, Y.: Sculpting holistic 3D representation in contrastive language-image-3D pre-training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Google Scholar 

  33. Ge, Y., Ge, Y., Zeng, Z., Wang, X., Shan, Y.: Planting a SEED of vision in large language model. In: International Conference on Learning Representations (ICLR) (2024)

    Google Scholar 

  34. Geng, H., Li, Z., Geng, Y., Chen, J., Dong, H., Wang, H.: PartManip: learning cross-category generalizable part manipulation policy from point cloud observations. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  35. Geng, H., Wei, S., Deng, C., Shen, B., Wang, H., Guibas, L.: SAGE: bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions. In: Robotics: Science and Systems (RSS) (2024)

    Google Scholar 

  36. Geng, H., et al.: GAPartNet: cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  37. Geng, Y., An, B., Geng, H., Chen, Y., Yang, Y., Dong, H.: RLAfford: end-to-end affordance learning for robotic manipulation. In: IEEE International Conference on Robotics and Automation (ICRA) (2023)

    Google Scholar 

  38. Girdhar, R., et al.: ImageBind: one embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190 (2023)

    Google Scholar 

  39. Gong, R., et al.: ARNOLD: a benchmark for language-grounded task learning with continuous states in realistic 3D scenes. In: International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  40. Grabner, H., Gall, J., Gool, L.V.: What makes a chair a chair? In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2011)

    Google Scholar 

  41. Guo, Z., et al.: Point-bind & point-LLM: aligning point cloud with multi-modality for 3D understanding, generation, and instruction following. CoRR abs/2309.00615 (2023)

    Google Scholar 

  42. Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  43. Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_23

    Chapter  Google Scholar 

  44. Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  45. Hamdi, A., Giancola, S., Ghanem, B.: MVTN: multi-view transformation network for 3D shape recognition. In: International Conference on Computer Vision (ICCV), pp. 1–11. IEEE (2021)

    Google Scholar 

  46. Hong, Y., et al.: 3D-LLM: injecting the 3D world into large language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  47. Hou, J., Xie, S., Graham, B., Dai, A., Nießner, M.: Pri3D: can 3D priors help 2D representation learning? In: International Conference on Computer Vision (ICCV), pp. 5673–5682. IEEE (2021)

    Google Scholar 

  48. Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations (ICLR) (2022)

    Google Scholar 

  49. Hu, R., van Kaick, O., Wu, B., Huang, H., Shamir, A., Zhang, H.: Learning how objects function via co-analysis of interactions. ACM Trans. Graph. 35(4), 47:1–47:13 (2016)

    Google Scholar 

  50. Hu, R., Li, W., van Kaick, O., Shamir, A., Zhang, H., Huang, H.: Learning to predict part mobility from a single static snapshot. ACM Trans. Graph. 36(6), 227:1–227:13 (2017)

    Google Scholar 

  51. Hu, R., Zhu, C., van Kaick, O., Liu, L., Shamir, A., Zhang, H.: Interaction context (ICON): towards a geometric functionality descriptor. ACM Trans. Graph. 34(4), 83:1–83:12 (2015)

    Google Scholar 

  52. Huang, J., et al.: An embodied generalist agent in 3D world. In: International Conference on Machine Learning (ICML) (2024)

    Google Scholar 

  53. Huang, T., et al.: CLIP2Point: transfer CLIP to point cloud classification with image-depth pre-training. In: International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  54. Huang, W., Mordatch, I., Pathak, D.: One policy to control them all: shared modular policies for agent-agnostic control. In: International Conference on Machine Learning (ICML) (2020)

    Google Scholar 

  55. Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L.: VoxPoser: composable 3D value maps for robotic manipulation with language models. In: Annual Conference on Robot Learnin (CoRL) (2023)

    Google Scholar 

  56. Huang, W., et al.: Inner monologue: embodied reasoning through planning with language models. In: Annual Conference on Robot Learning (CoRL) (2022)

    Google Scholar 

  57. Ichter, B., et al.: Do as I can, not as I say: grounding language in robotic affordances. In: Annual Conference on Robot Learnin (CoRL) (2022)

    Google Scholar 

  58. Ilharco, G., et al.: OpenCLIP, July 2021

    Google Scholar 

  59. Jia, M., et al.: Visual prompt tuning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13693, pp. 709–727. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19827-4_41

  60. Jiang, Z., Xu, F.F., Araki, J., Neubig, G.: How can we know what language models know. Trans. Assoc. Comput. Linguistics 8, 423–438 (2020)

    Article  Google Scholar 

  61. Kanade, T., Okutomi, M.: A stereo matching algorithm with an adaptive window: theory and experiment. IEEE Trans. Pattern Anal. Mach. Intell. 16(9), 920–932 (1994)

    Article  Google Scholar 

  62. Kim, V.G., Chaudhuri, S., Guibas, L.J., Funkhouser, T.A.: Shape2Pose: human-centric shape analysis. ACM Trans. Graph. 33(4), 120:1–120:12 (2014)

    Google Scholar 

  63. Koh, J.Y., Fried, D., Salakhutdinov, R.: Generating images with multimodal language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  64. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)

    Article  MathSciNet  Google Scholar 

  65. Li, J., Li, D., Savarese, S., Hoi, S.C.H.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning (ICML) (2023)

    Google Scholar 

  66. Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (2021)

    Google Scholar 

  67. Li, X., Wang, H., Yi, L., Guibas, L.J., Abbott, A.L., Song, S.: Category-level articulated object pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  68. Li, X., et al.: ManipLLM: embodied multimodal large language model for object-centric robotic manipulation (2023)

    Google Scholar 

  69. Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 292–305. Association for Computational Linguistics, Singapore (2023)

    Google Scholar 

  70. Liang, Y., et al.: TaskMatrix.AI: completing tasks by connecting foundation models with millions of APIs. Intell. Comput. 3, 0063 (2024)

    Google Scholar 

  71. Lin, K., Agia, C., Migimatsu, T., Pavone, M., Bohg, J.: Text2Motion: from natural language instructions to feasible plans. Auton. Robot. 47(8), 1345–1365 (2023)

    Article  Google Scholar 

  72. Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Google Scholar 

  73. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  74. Liu, M., et al.: OpenShape: scaling up 3D shape representation towards open-world understanding. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  75. Liu, X., Wang, B., Wang, H., Yi, L.: Few-shot physically-aware articulated mesh generation via hierarchical deformation. In: International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  76. Liu, X., Yi, L.: GeneOH diffusion: towards generalizable hand-object interaction denoising via denoising diffusion. In: International Conference on Learning Representations (ICLR) (2024)

    Google Scholar 

  77. Liu, X., Zhang, J., Hu, R., Huang, H., Wang, H., Yi, L.: Self-supervised category-level articulated object pose estimation with part-level SE(3) equivariance. In: International Conference on Learning Representations (ICLR) (2023)

    Google Scholar 

  78. Liu, Y., Fan, B., Xiang, S., Pan, C.: Relation-shape convolutional neural network for point cloud analysis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  79. Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? CoRR abs/2307.06281 (2023)

    Google Scholar 

  80. Liu, Y., et al.: SyncDreamer: generating multiview-consistent images from a single-view image. In: International Conference on Learning Representations (ICLR) (2024)

    Google Scholar 

  81. Liu, Y., Chen, J., Zhang, Z., Huang, J., Yi, L.: LeaF: learning frames for 4D point cloud sequence understanding. In: International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  82. Lu, C., et al.: Beyond holistic object recognition: enriching image understanding with part states. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  83. Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3D captioning with pretrained models. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  84. Ma, X., et al.: SQA3D: situated question answering in 3D scenes. In: International Conference on Learning Representations (ICLR) (2023)

    Google Scholar 

  85. Ma, X., Qin, C., You, H., Ran, H., Fu, Y.: Rethinking network design and local geometry in point cloud: a simple residual MLP framework. In: International Conference on Learning Representations (ICLR). OpenReview.net (2022)

    Google Scholar 

  86. Mo, K., et al.: PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  87. Mu, Y., et al.: EmbodiedGPT: vision-language pre-training via embodied chain of thought. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  88. OpenAI: GPT-4 technical report. CoRR abs/2303.08774 (2023). https://openai.com/research/gpt-4

  89. OpenAI: GPT-4V(ision) system card (2023). https://openai.com/research/gpt-4v-system-card

  90. Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

    Google Scholar 

  91. Pan, X., Dong, L., Huang, S., Peng, Z., Chen, W., Wei, F.: Kosmos-G: generating images in context with multimodal large language models. In: International Conference on Learning Representations (ICLR) (2024)

    Google Scholar 

  92. Pang, Y., Wang, W., Tay, F.E.H., Liu, W., Tian, Y., Yuan, L.: Masked autoencoders for point cloud self-supervised learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13662, pp. 604–621. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_35

  93. Peng, B., Li, C., He, P., Galley, M., Gao, J.: Instruction tuning with GPT-4. CoRR abs/2304.03277 (2023)

    Google Scholar 

  94. Peng, S., Genova, K., Jiang, C.M., Tagliasacchi, A., Pollefeys, M., Funkhouser, T.A.: OpenScene: 3D scene understanding with open vocabularies. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  95. Peng, Z., et al.: KOSMOS-2: grounding multimodal large language models to the world. CoRR abs/2306.14824 (2023)

    Google Scholar 

  96. Petroni, F., et al.: Language models as knowledge bases? In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019 (2019)

    Google Scholar 

  97. Pirk, S., et al.: Understanding and exploiting object interaction landscapes. ACM Trans. Graph. 36(3), 31:1–31:14 (2017)

    Google Scholar 

  98. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 77–85 (2017)

    Google Scholar 

  99. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, pp. 5099–5108 (2017)

    Google Scholar 

  100. Qi, H., Kumar, A., Calandra, R., Ma, Y., Malik, J.: In-hand object rotation via rapid motor adaptation. In: Annual Conference on Robot Learning (CoRL) (2023)

    Google Scholar 

  101. Qi, Z., et al.: Contrast with reconstruct: contrastive 3D representation learning guided by generative pretraining. In: International Conference on Machine Learning (ICML) (2023)

    Google Scholar 

  102. Qi, Z., Yu, M., Dong, R., Ma, K.: VPP: efficient conditional 3D generation via voxel-point progressive representation. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  103. Qi, Z., et al.: GPT4Point: a unified framework for point-language understanding and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26417–26427 (2024)

    Google Scholar 

  104. Qian, G., et al.: PointNeXt: revisiting PointNet++ with improved training and scaling strategies. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

    Google Scholar 

  105. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  106. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  107. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)

    Google Scholar 

  108. Ren, J., Pan, L., Liu, Z.: Benchmarking and analyzing point cloud classification under corruptions. In: International Conference on Machine Learning (ICML) (2022)

    Google Scholar 

  109. Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018 (2018)

    Google Scholar 

  110. Shen, W., Yang, G., Yu, A., Wong, J., Kaelbling, L.P., Isola, P.: Distilled feature fields enable few-shot language-guided manipulation. In: Annual Conference on Robot Learning (CoRL) (2023)

    Google Scholar 

  111. Shen, Y., Song, K., Tan, X., Li, D., Lu, W., Zhuang, Y.: HuggingGPT: solving AI tasks with ChatGPT and its friends in HuggingFace. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  112. Shi, H., Xu, H., Clarke, S., Li, Y., Wu, J.: RoboCook: long-horizon elasto-plastic object manipulation with diverse tools. In: Annual Conference on Robot Learning (CoRL) (2023)

    Google Scholar 

  113. Shutterstock: Turbosquid. https://www.turbosquid.com/

  114. Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.G.: Multi-view convolutional neural networks for 3D shape recognition. In: International Conference on Computer Vision (ICCV) (2015)

    Google Scholar 

  115. Sun, J., Zhang, Q., Kailkhura, B., Yu, Z., Xiao, C., Mao, Z.M.: ModelNet40-C: a robustness benchmark for 3D point cloud recognition under corruption. In: ICLR 2022 Workshop on Socially Responsible Machine Learning (2022)

    Google Scholar 

  116. Sun, Q., et al.: Generative multimodal models are in-context learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Google Scholar 

  117. Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: EVA-CLIP: improved training techniques for CLIP at scale. CoRR abs/2303.15389 (2023)

    Google Scholar 

  118. Sun, Q., et al.: Emu: generative pretraining in multimodality. In: International Conference on Learning Representations (ICLR) (2024)

    Google Scholar 

  119. Surís, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via Python execution for reasoning. In: International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  120. Taori, R., et al.: Stanford Alpaca: an instruction-following LLaMA model (2023). https://github.com/tatsu-lab/stanford_alpaca

  121. Touvron, H., et al.: LLaMA: open and efficient foundation language models. CoRR abs/2302.13971 (2023)

    Google Scholar 

  122. Uy, M.A., Pham, Q.H., Hua, B.S., Nguyen, T., Yeung, S.K.: Revisiting point cloud classification: a new benchmark dataset and classification model on real-world data. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1588–1597 (2019)

    Google Scholar 

  123. Wan, W., et al.: UniDexGrasp++: improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning. In: International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  124. Wang, G., et al.: Voyager: an open-ended embodied agent with large language models. T. Mach. Learn. Res. (TMLR) (2024)

    Google Scholar 

  125. Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6d object pose and size estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  126. Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. ACM Trans. Graph. 38(5), 146:1–146:12 (2019)

    Google Scholar 

  127. Wang, Z., Yu, X., Rao, Y., Zhou, J., Lu, J.: Take-a-photo: 3D-to-2D generative pre-training of point cloud models. In: International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  128. Wen, H., Liu, Y., Huang, J., Duan, B., Yi, L.: Point primitive transformer for long-term 4D point cloud video understanding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13689, pp. 19–35. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_2

  129. Weng, Y., et al.: CAPTRA: category-level pose tracking for rigid and articulated objects from point clouds. In: International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  130. Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual ChatGPT: talking, drawing and editing with visual foundation models. CoRR abs/2303.04671 (2023)

    Google Scholar 

  131. Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.: Next-GPT: any-to-any multimodal LLM. In: International Conference on Machine Learning (ICML) (2024)

    Google Scholar 

  132. Wu, Z., et al.: 3D ShapeNets: a deep representation for volumetric shapes. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1912–1920 (2015)

    Google Scholar 

  133. Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: PointLLM: empowering large language models to understand point clouds. CoRR abs/2308.16911 (2023)

    Google Scholar 

  134. Xu, Y., et al.: UniDexGrasp: universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  135. Xu, Z., Shen, Y., Huang, L.: MULTIINSTRUCT: improving multi-modal zero-shot learning via instruction tuning. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers) (2023)

    Google Scholar 

  136. Xue, L., et al.: ULIP: learning unified representation of language, image and point cloud for 3D understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  137. Xue, L., et al.: ULIP-2: towards scalable multimodal pre-training for 3D understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Google Scholar 

  138. Yang, R., et al.: GPT4Tools: teaching large language model to use tools via self-instruction. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  139. Yang, Z., et al.: MM-REACT: prompting ChatGPT for multimodal reasoning and action. CoRR abs/2303.11381 (2023)

    Google Scholar 

  140. Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. CoRR abs/2304.14178 (2023)

    Google Scholar 

  141. Ye, S., Chen, D., Han, S., Liao, J.: 3D question answering. IEEE Trans. Vis. Comput. Graph. (2022)

    Google Scholar 

  142. Yi, L., Huang, H., Liu, D., Kalogerakis, E., Su, H., Guibas, L.J.: Deep part induction from articulated object pairs. ACM Trans. Graph. 37(6), 209 (2018)

    Article  Google Scholar 

  143. You, Y., Shen, B., Deng, C., Geng, H., Wang, H., Guibas, L.J.: Make a donut: language-guided hierarchical EMD-space planning for zero-shot deformable object manipulation. CoRR abs/2311.02787 (2023)

    Google Scholar 

  144. Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. In: International Conference on Machine Learning (ICML) (2024)

    Google Scholar 

  145. Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., Lu, J.: Point-BERT: pre-training 3D point cloud transformers with masked point modeling. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  146. Zeid, K.A., Schult, J., Hermans, A., Leibe, B.: Point2Vec for self-supervised representation learning on point clouds. In: Köthe, U., Rother, C. (eds.) DAGM GCPR 2023. LNCS, vol. 14264, pp. 131–146. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-54605-1_9

  147. Zhang, J., Dong, R., Ma, K.: CLIP-FO3D: learning free open-world 3D scene representations from 2D dense CLIP. In: International Conference on Computer Vision (ICCV Workshop) (2023)

    Google Scholar 

  148. Zhang, R., et al.: Point-m2AE: multi-scale masked autoencoders for hierarchical point cloud pre-training. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

    Google Scholar 

  149. Zhang, R., et al.: PointCLIP: point cloud understanding by CLIP. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  150. Zhang, R., et al.: LLaMA-adapter: efficient fine-tuning of language models with zero-init attention. In: International Conference on Learning Representations (ICLR) (2024)

    Google Scholar 

  151. Zhang, R., Wang, L., Qiao, Y., Gao, P., Li, H.: Learning 3D representations from 2D pre-trained models via image-to-point masked autoencoders. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  152. Zhang, S., et al.: GPT4RoI: instruction tuning large language model on region-of-interest. CoRR abs/2307.03601 (2023)

    Google Scholar 

  153. Zhang, Z., Cao, S., Wang, Y.: TAMM: TriAdapter multi-modal learning for 3D shape understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Google Scholar 

  154. Zhao, L., et al.: ChatSpot: bootstrapping multimodal LLMs via precise referring instruction tuning. In: International Joint Conference on Artificial Intelligence (IJCAI) (2024)

    Google Scholar 

  155. Zhao, X., Wang, H., Komura, T.: Indexing 3D scenes using the interaction bisector surface. ACM Trans. Graph. 33(3), 22:1–22:14 (2014)

    Google Scholar 

  156. Zheng, J., Zheng, Q., Fang, L., Liu, Y., Yi, L.: CAMS: canonicalized manipulation spaces for category-level functional hand-object manipulation synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  157. Zheng, L., et al.: Judging LLM-as-a-judge with MT-bench and chatbot arena. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

    Google Scholar 

  158. Zhou, J., Wang, J., Ma, B., Liu, Y., Huang, T., Wang, X.: Uni3D: exploring unified 3D representation at scale. In: International Conference on Learning Representations (ICLR) (2024)

    Google Scholar 

  159. Zhou, Y., et al.: Analyzing and mitigating object hallucination in large vision-language models. In: International Conference on Learning Representations (ICLR) (2024)

    Google Scholar 

  160. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. In: International Conference on Learning Representations (ICLR) (2024)

    Google Scholar 

  161. Zhu, X., et al.: PointCLIP V2: prompting CLIP and GPT for powerful 3D open-world learning. In: International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  162. Zhu, Z., Ma, X., Chen, Y., Deng, Z., Huang, S., Li, Q.: 3D-VisTA: pre-trained transformer for 3D vision and text alignment. In: International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

Download references

Acknowledgments

The work was supported by the Dushi Program from Tsinghua University, the National Key R&D Program of China (2022YFB2804103), and the National Science and Technology Major Project of China (2023ZD0121300).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Yi .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2104 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Qi, Z. et al. (2025). ShapeLLM: Universal 3D Object Understanding for Embodied Interaction. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15101. Springer, Cham. https://doi.org/10.1007/978-3-031-72775-7_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72775-7_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72774-0

  • Online ISBN: 978-3-031-72775-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics