Abstract
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending shape ReCon [101] to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding.
R. Dong—Project lead.
Work done during Z. Qi and R. Dong’s internships at MEGVII & IIISCT.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.J.: Learning representations and generative models for 3D point clouds. In: International Conference on Machine Learning (ICML) (2018)
Alayrac, J., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Bai, Y., et al.: Sequential modeling enables scalable learning for large vision models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Betker, J., et al.: Improving image generation with better captions (2023)
Bommasani, R., et al.: On the opportunities and risks of foundation models. CoRR abs/2108.07258 (2021)
Bradski, G., Grossberg, S.: Recognition of 3-D objects from multiple 2-D views by a self-organizing neural architecture. In: Cherkassky, V., Friedman, J.H., Wechsler, H. (eds.) NATO ASI Series, vol. 136, pp. 349–375. Springer, Heidelberg (1994). https://doi.org/10.1007/978-3-642-79119-2_17
Bronstein, A.M., Bronstein, M.M., Guibas, L.J., Ovsjanikov, M.: Shape google: geometric words and expressions for invariant shape retrieval. ACM Trans. Graph. 30(1), 1:1–1:20 (2011)
Brown, T.B., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. CoRR abs/1512.03012 (2015)
Chang, M., et al.: GOAT: GO to any thing. In: Robotics: Science and Systems (RSS) (2024)
Chen, B., et al.: SpatialVLM: endowing vision-language models with spatial reasoning capabilities. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Chen, D.Z., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 202–221. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_13
Chen, D.Z., Gholami, A., Nießner, M., Chang, A.X.: Scan2Cap: context-aware dense captioning in RGB-D scans. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Chen, G., Wang, M., Yang, Y., Yu, K., Yuan, L., Yue, Y.: PointGPT: auto-regressively generative pre-training from point clouds. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. CoRR abs/2306.15195 (2023)
Chen, S., Garcia, R., Laptev, I., Schmid, C.: SUGAR: pre-training 3D visual representations for robotics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18049–18060 (2024)
Chen, X., et al.: PaLI-X: on scaling up a multilingual vision and language model. In: International Conference on Learning Representations (ICLR) (2023)
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality, March 2023. https://lmsys.org/blog/2023-03-30-vicuna/
Collins, J., et al.: ABO: dataset and benchmarks for real-world 3D object understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Das, A., et al.: Visual dialog. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 41(5), 1242–1256 (2019)
Davison, J., Feldman, J., Rush, A.M.: Commonsense knowledge mining from pretrained models. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019 (2019)
Deitke, M., et al.: Objaverse: a universe of annotated 3d objects. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Ding, Y., Zhang, X., Paxton, C., Zhang, S.: Task and motion planning with large language models for object rearrangement. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2023)
Dong, R., et al.: DreamLLM: synergistic multimodal comprehension and creation. In: International Conference on Learning Representations (ICLR) (2024)
Dong, R., et al.: Autoencoders as cross-modal teachers: can pretrained 2D image transformers help 3D representation learning? In: International Conference on Learning Representations (ICLR) (2023)
Driess, D., et al.: PaLM-E: an embodied multimodal language model. In: International Conference on Machine Learning (ICML) (2023)
Fan, G., Qi, Z., Shi, W., Ma, K.: Point-GCC: universal self-supervised 3D scene pre-training via geometry-color contrast. CoRR abs/2305.19623 (2023)
Fu, H., et al.: 3D-future: 3D furniture shape with texture. Int. J. Comput. Vision 129, 3313–3337 (2021)
Gao, Y., Wang, Z., Zheng, W.S., Xie, C., Zhou, Y.: Sculpting holistic 3D representation in contrastive language-image-3D pre-training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Ge, Y., Ge, Y., Zeng, Z., Wang, X., Shan, Y.: Planting a SEED of vision in large language model. In: International Conference on Learning Representations (ICLR) (2024)
Geng, H., Li, Z., Geng, Y., Chen, J., Dong, H., Wang, H.: PartManip: learning cross-category generalizable part manipulation policy from point cloud observations. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Geng, H., Wei, S., Deng, C., Shen, B., Wang, H., Guibas, L.: SAGE: bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions. In: Robotics: Science and Systems (RSS) (2024)
Geng, H., et al.: GAPartNet: cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Geng, Y., An, B., Geng, H., Chen, Y., Yang, Y., Dong, H.: RLAfford: end-to-end affordance learning for robotic manipulation. In: IEEE International Conference on Robotics and Automation (ICRA) (2023)
Girdhar, R., et al.: ImageBind: one embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190 (2023)
Gong, R., et al.: ARNOLD: a benchmark for language-grounded task learning with continuous states in realistic 3D scenes. In: International Conference on Computer Vision (ICCV) (2023)
Grabner, H., Gall, J., Gool, L.V.: What makes a chair a chair? In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2011)
Guo, Z., et al.: Point-bind & point-LLM: aligning point cloud with multi-modality for 3D understanding, generation, and instruction following. CoRR abs/2309.00615 (2023)
Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_23
Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Hamdi, A., Giancola, S., Ghanem, B.: MVTN: multi-view transformation network for 3D shape recognition. In: International Conference on Computer Vision (ICCV), pp. 1–11. IEEE (2021)
Hong, Y., et al.: 3D-LLM: injecting the 3D world into large language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Hou, J., Xie, S., Graham, B., Dai, A., Nießner, M.: Pri3D: can 3D priors help 2D representation learning? In: International Conference on Computer Vision (ICCV), pp. 5673–5682. IEEE (2021)
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations (ICLR) (2022)
Hu, R., van Kaick, O., Wu, B., Huang, H., Shamir, A., Zhang, H.: Learning how objects function via co-analysis of interactions. ACM Trans. Graph. 35(4), 47:1–47:13 (2016)
Hu, R., Li, W., van Kaick, O., Shamir, A., Zhang, H., Huang, H.: Learning to predict part mobility from a single static snapshot. ACM Trans. Graph. 36(6), 227:1–227:13 (2017)
Hu, R., Zhu, C., van Kaick, O., Liu, L., Shamir, A., Zhang, H.: Interaction context (ICON): towards a geometric functionality descriptor. ACM Trans. Graph. 34(4), 83:1–83:12 (2015)
Huang, J., et al.: An embodied generalist agent in 3D world. In: International Conference on Machine Learning (ICML) (2024)
Huang, T., et al.: CLIP2Point: transfer CLIP to point cloud classification with image-depth pre-training. In: International Conference on Computer Vision (ICCV) (2023)
Huang, W., Mordatch, I., Pathak, D.: One policy to control them all: shared modular policies for agent-agnostic control. In: International Conference on Machine Learning (ICML) (2020)
Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L.: VoxPoser: composable 3D value maps for robotic manipulation with language models. In: Annual Conference on Robot Learnin (CoRL) (2023)
Huang, W., et al.: Inner monologue: embodied reasoning through planning with language models. In: Annual Conference on Robot Learning (CoRL) (2022)
Ichter, B., et al.: Do as I can, not as I say: grounding language in robotic affordances. In: Annual Conference on Robot Learnin (CoRL) (2022)
Ilharco, G., et al.: OpenCLIP, July 2021
Jia, M., et al.: Visual prompt tuning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13693, pp. 709–727. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19827-4_41
Jiang, Z., Xu, F.F., Araki, J., Neubig, G.: How can we know what language models know. Trans. Assoc. Comput. Linguistics 8, 423–438 (2020)
Kanade, T., Okutomi, M.: A stereo matching algorithm with an adaptive window: theory and experiment. IEEE Trans. Pattern Anal. Mach. Intell. 16(9), 920–932 (1994)
Kim, V.G., Chaudhuri, S., Guibas, L.J., Funkhouser, T.A.: Shape2Pose: human-centric shape analysis. ACM Trans. Graph. 33(4), 120:1–120:12 (2014)
Koh, J.Y., Fried, D., Salakhutdinov, R.: Generating images with multimodal language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)
Li, J., Li, D., Savarese, S., Hoi, S.C.H.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning (ICML) (2023)
Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (2021)
Li, X., Wang, H., Yi, L., Guibas, L.J., Abbott, A.L., Song, S.: Category-level articulated object pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Li, X., et al.: ManipLLM: embodied multimodal large language model for object-centric robotic manipulation (2023)
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 292–305. Association for Computational Linguistics, Singapore (2023)
Liang, Y., et al.: TaskMatrix.AI: completing tasks by connecting foundation models with millions of APIs. Intell. Comput. 3, 0063 (2024)
Lin, K., Agia, C., Migimatsu, T., Pavone, M., Bohg, J.: Text2Motion: from natural language instructions to feasible plans. Auton. Robot. 47(8), 1345–1365 (2023)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Liu, M., et al.: OpenShape: scaling up 3D shape representation towards open-world understanding. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Liu, X., Wang, B., Wang, H., Yi, L.: Few-shot physically-aware articulated mesh generation via hierarchical deformation. In: International Conference on Computer Vision (ICCV) (2023)
Liu, X., Yi, L.: GeneOH diffusion: towards generalizable hand-object interaction denoising via denoising diffusion. In: International Conference on Learning Representations (ICLR) (2024)
Liu, X., Zhang, J., Hu, R., Huang, H., Wang, H., Yi, L.: Self-supervised category-level articulated object pose estimation with part-level SE(3) equivariance. In: International Conference on Learning Representations (ICLR) (2023)
Liu, Y., Fan, B., Xiang, S., Pan, C.: Relation-shape convolutional neural network for point cloud analysis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? CoRR abs/2307.06281 (2023)
Liu, Y., et al.: SyncDreamer: generating multiview-consistent images from a single-view image. In: International Conference on Learning Representations (ICLR) (2024)
Liu, Y., Chen, J., Zhang, Z., Huang, J., Yi, L.: LeaF: learning frames for 4D point cloud sequence understanding. In: International Conference on Computer Vision (ICCV) (2023)
Lu, C., et al.: Beyond holistic object recognition: enriching image understanding with part states. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3D captioning with pretrained models. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Ma, X., et al.: SQA3D: situated question answering in 3D scenes. In: International Conference on Learning Representations (ICLR) (2023)
Ma, X., Qin, C., You, H., Ran, H., Fu, Y.: Rethinking network design and local geometry in point cloud: a simple residual MLP framework. In: International Conference on Learning Representations (ICLR). OpenReview.net (2022)
Mo, K., et al.: PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Mu, Y., et al.: EmbodiedGPT: vision-language pre-training via embodied chain of thought. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
OpenAI: GPT-4 technical report. CoRR abs/2303.08774 (2023). https://openai.com/research/gpt-4
OpenAI: GPT-4V(ision) system card (2023). https://openai.com/research/gpt-4v-system-card
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Pan, X., Dong, L., Huang, S., Peng, Z., Chen, W., Wei, F.: Kosmos-G: generating images in context with multimodal large language models. In: International Conference on Learning Representations (ICLR) (2024)
Pang, Y., Wang, W., Tay, F.E.H., Liu, W., Tian, Y., Yuan, L.: Masked autoencoders for point cloud self-supervised learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13662, pp. 604–621. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_35
Peng, B., Li, C., He, P., Galley, M., Gao, J.: Instruction tuning with GPT-4. CoRR abs/2304.03277 (2023)
Peng, S., Genova, K., Jiang, C.M., Tagliasacchi, A., Pollefeys, M., Funkhouser, T.A.: OpenScene: 3D scene understanding with open vocabularies. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Peng, Z., et al.: KOSMOS-2: grounding multimodal large language models to the world. CoRR abs/2306.14824 (2023)
Petroni, F., et al.: Language models as knowledge bases? In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019 (2019)
Pirk, S., et al.: Understanding and exploiting object interaction landscapes. ACM Trans. Graph. 36(3), 31:1–31:14 (2017)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 77–85 (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, pp. 5099–5108 (2017)
Qi, H., Kumar, A., Calandra, R., Ma, Y., Malik, J.: In-hand object rotation via rapid motor adaptation. In: Annual Conference on Robot Learning (CoRL) (2023)
Qi, Z., et al.: Contrast with reconstruct: contrastive 3D representation learning guided by generative pretraining. In: International Conference on Machine Learning (ICML) (2023)
Qi, Z., Yu, M., Dong, R., Ma, K.: VPP: efficient conditional 3D generation via voxel-point progressive representation. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Qi, Z., et al.: GPT4Point: a unified framework for point-language understanding and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26417–26427 (2024)
Qian, G., et al.: PointNeXt: revisiting PointNet++ with improved training and scaling strategies. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Ren, J., Pan, L., Liu, Z.: Benchmarking and analyzing point cloud classification under corruptions. In: International Conference on Machine Learning (ICML) (2022)
Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018 (2018)
Shen, W., Yang, G., Yu, A., Wong, J., Kaelbling, L.P., Isola, P.: Distilled feature fields enable few-shot language-guided manipulation. In: Annual Conference on Robot Learning (CoRL) (2023)
Shen, Y., Song, K., Tan, X., Li, D., Lu, W., Zhuang, Y.: HuggingGPT: solving AI tasks with ChatGPT and its friends in HuggingFace. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Shi, H., Xu, H., Clarke, S., Li, Y., Wu, J.: RoboCook: long-horizon elasto-plastic object manipulation with diverse tools. In: Annual Conference on Robot Learning (CoRL) (2023)
Shutterstock: Turbosquid. https://www.turbosquid.com/
Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.G.: Multi-view convolutional neural networks for 3D shape recognition. In: International Conference on Computer Vision (ICCV) (2015)
Sun, J., Zhang, Q., Kailkhura, B., Yu, Z., Xiao, C., Mao, Z.M.: ModelNet40-C: a robustness benchmark for 3D point cloud recognition under corruption. In: ICLR 2022 Workshop on Socially Responsible Machine Learning (2022)
Sun, Q., et al.: Generative multimodal models are in-context learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: EVA-CLIP: improved training techniques for CLIP at scale. CoRR abs/2303.15389 (2023)
Sun, Q., et al.: Emu: generative pretraining in multimodality. In: International Conference on Learning Representations (ICLR) (2024)
Surís, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via Python execution for reasoning. In: International Conference on Computer Vision (ICCV) (2023)
Taori, R., et al.: Stanford Alpaca: an instruction-following LLaMA model (2023). https://github.com/tatsu-lab/stanford_alpaca
Touvron, H., et al.: LLaMA: open and efficient foundation language models. CoRR abs/2302.13971 (2023)
Uy, M.A., Pham, Q.H., Hua, B.S., Nguyen, T., Yeung, S.K.: Revisiting point cloud classification: a new benchmark dataset and classification model on real-world data. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1588–1597 (2019)
Wan, W., et al.: UniDexGrasp++: improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning. In: International Conference on Computer Vision (ICCV) (2023)
Wang, G., et al.: Voyager: an open-ended embodied agent with large language models. T. Mach. Learn. Res. (TMLR) (2024)
Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6d object pose and size estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. ACM Trans. Graph. 38(5), 146:1–146:12 (2019)
Wang, Z., Yu, X., Rao, Y., Zhou, J., Lu, J.: Take-a-photo: 3D-to-2D generative pre-training of point cloud models. In: International Conference on Computer Vision (ICCV) (2023)
Wen, H., Liu, Y., Huang, J., Duan, B., Yi, L.: Point primitive transformer for long-term 4D point cloud video understanding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13689, pp. 19–35. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_2
Weng, Y., et al.: CAPTRA: category-level pose tracking for rigid and articulated objects from point clouds. In: International Conference on Computer Vision (ICCV) (2021)
Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual ChatGPT: talking, drawing and editing with visual foundation models. CoRR abs/2303.04671 (2023)
Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.: Next-GPT: any-to-any multimodal LLM. In: International Conference on Machine Learning (ICML) (2024)
Wu, Z., et al.: 3D ShapeNets: a deep representation for volumetric shapes. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1912–1920 (2015)
Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: PointLLM: empowering large language models to understand point clouds. CoRR abs/2308.16911 (2023)
Xu, Y., et al.: UniDexGrasp: universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Xu, Z., Shen, Y., Huang, L.: MULTIINSTRUCT: improving multi-modal zero-shot learning via instruction tuning. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers) (2023)
Xue, L., et al.: ULIP: learning unified representation of language, image and point cloud for 3D understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Xue, L., et al.: ULIP-2: towards scalable multimodal pre-training for 3D understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Yang, R., et al.: GPT4Tools: teaching large language model to use tools via self-instruction. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Yang, Z., et al.: MM-REACT: prompting ChatGPT for multimodal reasoning and action. CoRR abs/2303.11381 (2023)
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. CoRR abs/2304.14178 (2023)
Ye, S., Chen, D., Han, S., Liao, J.: 3D question answering. IEEE Trans. Vis. Comput. Graph. (2022)
Yi, L., Huang, H., Liu, D., Kalogerakis, E., Su, H., Guibas, L.J.: Deep part induction from articulated object pairs. ACM Trans. Graph. 37(6), 209 (2018)
You, Y., Shen, B., Deng, C., Geng, H., Wang, H., Guibas, L.J.: Make a donut: language-guided hierarchical EMD-space planning for zero-shot deformable object manipulation. CoRR abs/2311.02787 (2023)
Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. In: International Conference on Machine Learning (ICML) (2024)
Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., Lu, J.: Point-BERT: pre-training 3D point cloud transformers with masked point modeling. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Zeid, K.A., Schult, J., Hermans, A., Leibe, B.: Point2Vec for self-supervised representation learning on point clouds. In: Köthe, U., Rother, C. (eds.) DAGM GCPR 2023. LNCS, vol. 14264, pp. 131–146. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-54605-1_9
Zhang, J., Dong, R., Ma, K.: CLIP-FO3D: learning free open-world 3D scene representations from 2D dense CLIP. In: International Conference on Computer Vision (ICCV Workshop) (2023)
Zhang, R., et al.: Point-m2AE: multi-scale masked autoencoders for hierarchical point cloud pre-training. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Zhang, R., et al.: PointCLIP: point cloud understanding by CLIP. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Zhang, R., et al.: LLaMA-adapter: efficient fine-tuning of language models with zero-init attention. In: International Conference on Learning Representations (ICLR) (2024)
Zhang, R., Wang, L., Qiao, Y., Gao, P., Li, H.: Learning 3D representations from 2D pre-trained models via image-to-point masked autoencoders. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Zhang, S., et al.: GPT4RoI: instruction tuning large language model on region-of-interest. CoRR abs/2307.03601 (2023)
Zhang, Z., Cao, S., Wang, Y.: TAMM: TriAdapter multi-modal learning for 3D shape understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Zhao, L., et al.: ChatSpot: bootstrapping multimodal LLMs via precise referring instruction tuning. In: International Joint Conference on Artificial Intelligence (IJCAI) (2024)
Zhao, X., Wang, H., Komura, T.: Indexing 3D scenes using the interaction bisector surface. ACM Trans. Graph. 33(3), 22:1–22:14 (2014)
Zheng, J., Zheng, Q., Fang, L., Liu, Y., Yi, L.: CAMS: canonicalized manipulation spaces for category-level functional hand-object manipulation synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Zheng, L., et al.: Judging LLM-as-a-judge with MT-bench and chatbot arena. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)
Zhou, J., Wang, J., Ma, B., Liu, Y., Huang, T., Wang, X.: Uni3D: exploring unified 3D representation at scale. In: International Conference on Learning Representations (ICLR) (2024)
Zhou, Y., et al.: Analyzing and mitigating object hallucination in large vision-language models. In: International Conference on Learning Representations (ICLR) (2024)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. In: International Conference on Learning Representations (ICLR) (2024)
Zhu, X., et al.: PointCLIP V2: prompting CLIP and GPT for powerful 3D open-world learning. In: International Conference on Computer Vision (ICCV) (2023)
Zhu, Z., Ma, X., Chen, Y., Deng, Z., Huang, S., Li, Q.: 3D-VisTA: pre-trained transformer for 3D vision and text alignment. In: International Conference on Computer Vision (ICCV) (2023)
Acknowledgments
The work was supported by the Dushi Program from Tsinghua University, the National Key R&D Program of China (2022YFB2804103), and the National Science and Technology Major Project of China (2023ZD0121300).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Qi, Z. et al. (2025). ShapeLLM: Universal 3D Object Understanding for Embodied Interaction. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15101. Springer, Cham. https://doi.org/10.1007/978-3-031-72775-7_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-72775-7_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72774-0
Online ISBN: 978-3-031-72775-7
eBook Packages: Computer ScienceComputer Science (R0)