ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Qi, Zekun; Dong, Runpei; Zhang, Shaochen; Geng, Haoran; Han, Chunrui; Ge, Zheng; Yi, Li; Ma, Kaisheng

doi:10.1007/978-3-031-72775-7_13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15101))

Included in the following conference series:

European Conference on Computer Vision

571 Accesses

Abstract

This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending shape ReCon [101] to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding.

R. Dong—Project lead.

Work done during Z. Qi and R. Dong’s internships at MEGVII & IIISCT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

M3DBench: Towards Omni 3D Assistant with Interleaved Multi-modal Instructions

Uni3DL: A Unified Model for 3D Vision-Language Understanding

F-HOI: Toward Fine-Grained Semantic-Aligned 3D Human-Object Interactions

Notes

1.
URL & License.
2.
“Likes” statistics can be found at Sketchfab.

References

Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.J.: Learning representations and generative models for 3D point clouds. In: International Conference on Machine Learning (ICML) (2018)
Google Scholar
Alayrac, J., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Bai, Y., et al.: Sequential modeling enables scalable learning for large vision models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar
Betker, J., et al.: Improving image generation with better captions (2023)
Google Scholar
Bommasani, R., et al.: On the opportunities and risks of foundation models. CoRR abs/2108.07258 (2021)
Google Scholar
Bradski, G., Grossberg, S.: Recognition of 3-D objects from multiple 2-D views by a self-organizing neural architecture. In: Cherkassky, V., Friedman, J.H., Wechsler, H. (eds.) NATO ASI Series, vol. 136, pp. 349–375. Springer, Heidelberg (1994). https://doi.org/10.1007/978-3-642-79119-2_17
Bronstein, A.M., Bronstein, M.M., Guibas, L.J., Ovsjanikov, M.: Shape google: geometric words and expressions for invariant shape retrieval. ACM Trans. Graph. 30(1), 1:1–1:20 (2011)
Google Scholar
Brown, T.B., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. CoRR abs/1512.03012 (2015)
Google Scholar
Chang, M., et al.: GOAT: GO to any thing. In: Robotics: Science and Systems (RSS) (2024)
Google Scholar
Chen, B., et al.: SpatialVLM: endowing vision-language models with spatial reasoning capabilities. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar
Chen, D.Z., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 202–221. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_13
Chapter Google Scholar
Chen, D.Z., Gholami, A., Nießner, M., Chang, A.X.: Scan2Cap: context-aware dense captioning in RGB-D scans. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Chen, G., Wang, M., Yang, Y., Yu, K., Yuan, L., Yue, Y.: PointGPT: auto-regressively generative pre-training from point clouds. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. CoRR abs/2306.15195 (2023)
Google Scholar
Chen, S., Garcia, R., Laptev, I., Schmid, C.: SUGAR: pre-training 3D visual representations for robotics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18049–18060 (2024)
Google Scholar
Chen, X., et al.: PaLI-X: on scaling up a multilingual vision and language model. In: International Conference on Learning Representations (ICLR) (2023)
Google Scholar
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality, March 2023. https://lmsys.org/blog/2023-03-30-vicuna/
Collins, J., et al.: ABO: dataset and benchmarks for real-world 3D object understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Das, A., et al.: Visual dialog. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 41(5), 1242–1256 (2019)
Article Google Scholar
Davison, J., Feldman, J., Rush, A.M.: Commonsense knowledge mining from pretrained models. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019 (2019)
Google Scholar
Deitke, M., et al.: Objaverse: a universe of annotated 3d objects. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Ding, Y., Zhang, X., Paxton, C., Zhang, S.: Task and motion planning with large language models for object rearrangement. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2023)
Google Scholar
Dong, R., et al.: DreamLLM: synergistic multimodal comprehension and creation. In: International Conference on Learning Representations (ICLR) (2024)
Google Scholar
Dong, R., et al.: Autoencoders as cross-modal teachers: can pretrained 2D image transformers help 3D representation learning? In: International Conference on Learning Representations (ICLR) (2023)
Google Scholar
Driess, D., et al.: PaLM-E: an embodied multimodal language model. In: International Conference on Machine Learning (ICML) (2023)
Google Scholar
Fan, G., Qi, Z., Shi, W., Ma, K.: Point-GCC: universal self-supervised 3D scene pre-training via geometry-color contrast. CoRR abs/2305.19623 (2023)
Google Scholar
Fu, H., et al.: 3D-future: 3D furniture shape with texture. Int. J. Comput. Vision 129, 3313–3337 (2021)
Article Google Scholar
Gao, Y., Wang, Z., Zheng, W.S., Xie, C., Zhou, Y.: Sculpting holistic 3D representation in contrastive language-image-3D pre-training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar
Ge, Y., Ge, Y., Zeng, Z., Wang, X., Shan, Y.: Planting a SEED of vision in large language model. In: International Conference on Learning Representations (ICLR) (2024)
Google Scholar
Geng, H., Li, Z., Geng, Y., Chen, J., Dong, H., Wang, H.: PartManip: learning cross-category generalizable part manipulation policy from point cloud observations. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Geng, H., Wei, S., Deng, C., Shen, B., Wang, H., Guibas, L.: SAGE: bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions. In: Robotics: Science and Systems (RSS) (2024)
Google Scholar
Geng, H., et al.: GAPartNet: cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Geng, Y., An, B., Geng, H., Chen, Y., Yang, Y., Dong, H.: RLAfford: end-to-end affordance learning for robotic manipulation. In: IEEE International Conference on Robotics and Automation (ICRA) (2023)
Google Scholar
Girdhar, R., et al.: ImageBind: one embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190 (2023)
Google Scholar
Gong, R., et al.: ARNOLD: a benchmark for language-grounded task learning with continuous states in realistic 3D scenes. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Grabner, H., Gall, J., Gool, L.V.: What makes a chair a chair? In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2011)
Google Scholar
Guo, Z., et al.: Point-bind & point-LLM: aligning point cloud with multi-modality for 3D understanding, generation, and instruction following. CoRR abs/2309.00615 (2023)
Google Scholar
Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_23
Chapter Google Scholar
Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Hamdi, A., Giancola, S., Ghanem, B.: MVTN: multi-view transformation network for 3D shape recognition. In: International Conference on Computer Vision (ICCV), pp. 1–11. IEEE (2021)
Google Scholar
Hong, Y., et al.: 3D-LLM: injecting the 3D world into large language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Hou, J., Xie, S., Graham, B., Dai, A., Nießner, M.: Pri3D: can 3D priors help 2D representation learning? In: International Conference on Computer Vision (ICCV), pp. 5673–5682. IEEE (2021)
Google Scholar
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations (ICLR) (2022)
Google Scholar
Hu, R., van Kaick, O., Wu, B., Huang, H., Shamir, A., Zhang, H.: Learning how objects function via co-analysis of interactions. ACM Trans. Graph. 35(4), 47:1–47:13 (2016)
Google Scholar
Hu, R., Li, W., van Kaick, O., Shamir, A., Zhang, H., Huang, H.: Learning to predict part mobility from a single static snapshot. ACM Trans. Graph. 36(6), 227:1–227:13 (2017)
Google Scholar
Hu, R., Zhu, C., van Kaick, O., Liu, L., Shamir, A., Zhang, H.: Interaction context (ICON): towards a geometric functionality descriptor. ACM Trans. Graph. 34(4), 83:1–83:12 (2015)
Google Scholar
Huang, J., et al.: An embodied generalist agent in 3D world. In: International Conference on Machine Learning (ICML) (2024)
Google Scholar
Huang, T., et al.: CLIP2Point: transfer CLIP to point cloud classification with image-depth pre-training. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Huang, W., Mordatch, I., Pathak, D.: One policy to control them all: shared modular policies for agent-agnostic control. In: International Conference on Machine Learning (ICML) (2020)
Google Scholar
Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L.: VoxPoser: composable 3D value maps for robotic manipulation with language models. In: Annual Conference on Robot Learnin (CoRL) (2023)
Google Scholar
Huang, W., et al.: Inner monologue: embodied reasoning through planning with language models. In: Annual Conference on Robot Learning (CoRL) (2022)
Google Scholar
Ichter, B., et al.: Do as I can, not as I say: grounding language in robotic affordances. In: Annual Conference on Robot Learnin (CoRL) (2022)
Google Scholar
Ilharco, G., et al.: OpenCLIP, July 2021
Google Scholar
Jia, M., et al.: Visual prompt tuning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13693, pp. 709–727. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19827-4_41
Jiang, Z., Xu, F.F., Araki, J., Neubig, G.: How can we know what language models know. Trans. Assoc. Comput. Linguistics 8, 423–438 (2020)
Article Google Scholar
Kanade, T., Okutomi, M.: A stereo matching algorithm with an adaptive window: theory and experiment. IEEE Trans. Pattern Anal. Mach. Intell. 16(9), 920–932 (1994)
Article Google Scholar
Kim, V.G., Chaudhuri, S., Guibas, L.J., Funkhouser, T.A.: Shape2Pose: human-centric shape analysis. ACM Trans. Graph. 33(4), 120:1–120:12 (2014)
Google Scholar
Koh, J.Y., Fried, D., Salakhutdinov, R.: Generating images with multimodal language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)
Article MathSciNet Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.C.H.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning (ICML) (2023)
Google Scholar
Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (2021)
Google Scholar
Li, X., Wang, H., Yi, L., Guibas, L.J., Abbott, A.L., Song, S.: Category-level articulated object pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Li, X., et al.: ManipLLM: embodied multimodal large language model for object-centric robotic manipulation (2023)
Google Scholar
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 292–305. Association for Computational Linguistics, Singapore (2023)
Google Scholar
Liang, Y., et al.: TaskMatrix.AI: completing tasks by connecting foundation models with millions of APIs. Intell. Comput. 3, 0063 (2024)
Google Scholar
Lin, K., Agia, C., Migimatsu, T., Pavone, M., Bohg, J.: Text2Motion: from natural language instructions to feasible plans. Auton. Robot. 47(8), 1345–1365 (2023)
Article Google Scholar
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Liu, M., et al.: OpenShape: scaling up 3D shape representation towards open-world understanding. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Liu, X., Wang, B., Wang, H., Yi, L.: Few-shot physically-aware articulated mesh generation via hierarchical deformation. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Liu, X., Yi, L.: GeneOH diffusion: towards generalizable hand-object interaction denoising via denoising diffusion. In: International Conference on Learning Representations (ICLR) (2024)
Google Scholar
Liu, X., Zhang, J., Hu, R., Huang, H., Wang, H., Yi, L.: Self-supervised category-level articulated object pose estimation with part-level SE(3) equivariance. In: International Conference on Learning Representations (ICLR) (2023)
Google Scholar
Liu, Y., Fan, B., Xiang, S., Pan, C.: Relation-shape convolutional neural network for point cloud analysis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? CoRR abs/2307.06281 (2023)
Google Scholar
Liu, Y., et al.: SyncDreamer: generating multiview-consistent images from a single-view image. In: International Conference on Learning Representations (ICLR) (2024)
Google Scholar
Liu, Y., Chen, J., Zhang, Z., Huang, J., Yi, L.: LeaF: learning frames for 4D point cloud sequence understanding. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Lu, C., et al.: Beyond holistic object recognition: enriching image understanding with part states. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3D captioning with pretrained models. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Ma, X., et al.: SQA3D: situated question answering in 3D scenes. In: International Conference on Learning Representations (ICLR) (2023)
Google Scholar
Ma, X., Qin, C., You, H., Ran, H., Fu, Y.: Rethinking network design and local geometry in point cloud: a simple residual MLP framework. In: International Conference on Learning Representations (ICLR). OpenReview.net (2022)
Google Scholar
Mo, K., et al.: PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Mu, Y., et al.: EmbodiedGPT: vision-language pre-training via embodied chain of thought. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
OpenAI: GPT-4 technical report. CoRR abs/2303.08774 (2023). https://openai.com/research/gpt-4
OpenAI: GPT-4V(ision) system card (2023). https://openai.com/research/gpt-4v-system-card
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Pan, X., Dong, L., Huang, S., Peng, Z., Chen, W., Wei, F.: Kosmos-G: generating images in context with multimodal large language models. In: International Conference on Learning Representations (ICLR) (2024)
Google Scholar
Pang, Y., Wang, W., Tay, F.E.H., Liu, W., Tian, Y., Yuan, L.: Masked autoencoders for point cloud self-supervised learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13662, pp. 604–621. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_35
Peng, B., Li, C., He, P., Galley, M., Gao, J.: Instruction tuning with GPT-4. CoRR abs/2304.03277 (2023)
Google Scholar
Peng, S., Genova, K., Jiang, C.M., Tagliasacchi, A., Pollefeys, M., Funkhouser, T.A.: OpenScene: 3D scene understanding with open vocabularies. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Peng, Z., et al.: KOSMOS-2: grounding multimodal large language models to the world. CoRR abs/2306.14824 (2023)
Google Scholar
Petroni, F., et al.: Language models as knowledge bases? In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019 (2019)
Google Scholar
Pirk, S., et al.: Understanding and exploiting object interaction landscapes. ACM Trans. Graph. 36(3), 31:1–31:14 (2017)
Google Scholar
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 77–85 (2017)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, pp. 5099–5108 (2017)
Google Scholar
Qi, H., Kumar, A., Calandra, R., Ma, Y., Malik, J.: In-hand object rotation via rapid motor adaptation. In: Annual Conference on Robot Learning (CoRL) (2023)
Google Scholar
Qi, Z., et al.: Contrast with reconstruct: contrastive 3D representation learning guided by generative pretraining. In: International Conference on Machine Learning (ICML) (2023)
Google Scholar
Qi, Z., Yu, M., Dong, R., Ma, K.: VPP: efficient conditional 3D generation via voxel-point progressive representation. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Qi, Z., et al.: GPT4Point: a unified framework for point-language understanding and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26417–26427 (2024)
Google Scholar
Qian, G., et al.: PointNeXt: revisiting PointNet++ with improved training and scaling strategies. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Ren, J., Pan, L., Liu, Z.: Benchmarking and analyzing point cloud classification under corruptions. In: International Conference on Machine Learning (ICML) (2022)
Google Scholar
Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018 (2018)
Google Scholar
Shen, W., Yang, G., Yu, A., Wong, J., Kaelbling, L.P., Isola, P.: Distilled feature fields enable few-shot language-guided manipulation. In: Annual Conference on Robot Learning (CoRL) (2023)
Google Scholar
Shen, Y., Song, K., Tan, X., Li, D., Lu, W., Zhuang, Y.: HuggingGPT: solving AI tasks with ChatGPT and its friends in HuggingFace. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Shi, H., Xu, H., Clarke, S., Li, Y., Wu, J.: RoboCook: long-horizon elasto-plastic object manipulation with diverse tools. In: Annual Conference on Robot Learning (CoRL) (2023)
Google Scholar
Shutterstock: Turbosquid. https://www.turbosquid.com/
Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.G.: Multi-view convolutional neural networks for 3D shape recognition. In: International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Sun, J., Zhang, Q., Kailkhura, B., Yu, Z., Xiao, C., Mao, Z.M.: ModelNet40-C: a robustness benchmark for 3D point cloud recognition under corruption. In: ICLR 2022 Workshop on Socially Responsible Machine Learning (2022)
Google Scholar
Sun, Q., et al.: Generative multimodal models are in-context learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar
Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: EVA-CLIP: improved training techniques for CLIP at scale. CoRR abs/2303.15389 (2023)
Google Scholar
Sun, Q., et al.: Emu: generative pretraining in multimodality. In: International Conference on Learning Representations (ICLR) (2024)
Google Scholar
Surís, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via Python execution for reasoning. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Taori, R., et al.: Stanford Alpaca: an instruction-following LLaMA model (2023). https://github.com/tatsu-lab/stanford_alpaca
Touvron, H., et al.: LLaMA: open and efficient foundation language models. CoRR abs/2302.13971 (2023)
Google Scholar
Uy, M.A., Pham, Q.H., Hua, B.S., Nguyen, T., Yeung, S.K.: Revisiting point cloud classification: a new benchmark dataset and classification model on real-world data. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1588–1597 (2019)
Google Scholar
Wan, W., et al.: UniDexGrasp++: improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Wang, G., et al.: Voyager: an open-ended embodied agent with large language models. T. Mach. Learn. Res. (TMLR) (2024)
Google Scholar
Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6d object pose and size estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. ACM Trans. Graph. 38(5), 146:1–146:12 (2019)
Google Scholar
Wang, Z., Yu, X., Rao, Y., Zhou, J., Lu, J.: Take-a-photo: 3D-to-2D generative pre-training of point cloud models. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Wen, H., Liu, Y., Huang, J., Duan, B., Yi, L.: Point primitive transformer for long-term 4D point cloud video understanding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13689, pp. 19–35. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_2
Weng, Y., et al.: CAPTRA: category-level pose tracking for rigid and articulated objects from point clouds. In: International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual ChatGPT: talking, drawing and editing with visual foundation models. CoRR abs/2303.04671 (2023)
Google Scholar
Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.: Next-GPT: any-to-any multimodal LLM. In: International Conference on Machine Learning (ICML) (2024)
Google Scholar
Wu, Z., et al.: 3D ShapeNets: a deep representation for volumetric shapes. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1912–1920 (2015)
Google Scholar
Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: PointLLM: empowering large language models to understand point clouds. CoRR abs/2308.16911 (2023)
Google Scholar
Xu, Y., et al.: UniDexGrasp: universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Xu, Z., Shen, Y., Huang, L.: MULTIINSTRUCT: improving multi-modal zero-shot learning via instruction tuning. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers) (2023)
Google Scholar
Xue, L., et al.: ULIP: learning unified representation of language, image and point cloud for 3D understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Xue, L., et al.: ULIP-2: towards scalable multimodal pre-training for 3D understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar
Yang, R., et al.: GPT4Tools: teaching large language model to use tools via self-instruction. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Yang, Z., et al.: MM-REACT: prompting ChatGPT for multimodal reasoning and action. CoRR abs/2303.11381 (2023)
Google Scholar
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. CoRR abs/2304.14178 (2023)
Google Scholar
Ye, S., Chen, D., Han, S., Liao, J.: 3D question answering. IEEE Trans. Vis. Comput. Graph. (2022)
Google Scholar
Yi, L., Huang, H., Liu, D., Kalogerakis, E., Su, H., Guibas, L.J.: Deep part induction from articulated object pairs. ACM Trans. Graph. 37(6), 209 (2018)
Article Google Scholar
You, Y., Shen, B., Deng, C., Geng, H., Wang, H., Guibas, L.J.: Make a donut: language-guided hierarchical EMD-space planning for zero-shot deformable object manipulation. CoRR abs/2311.02787 (2023)
Google Scholar
Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. In: International Conference on Machine Learning (ICML) (2024)
Google Scholar
Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., Lu, J.: Point-BERT: pre-training 3D point cloud transformers with masked point modeling. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Zeid, K.A., Schult, J., Hermans, A., Leibe, B.: Point2Vec for self-supervised representation learning on point clouds. In: Köthe, U., Rother, C. (eds.) DAGM GCPR 2023. LNCS, vol. 14264, pp. 131–146. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-54605-1_9
Zhang, J., Dong, R., Ma, K.: CLIP-FO3D: learning free open-world 3D scene representations from 2D dense CLIP. In: International Conference on Computer Vision (ICCV Workshop) (2023)
Google Scholar
Zhang, R., et al.: Point-m2AE: multi-scale masked autoencoders for hierarchical point cloud pre-training. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Zhang, R., et al.: PointCLIP: point cloud understanding by CLIP. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Zhang, R., et al.: LLaMA-adapter: efficient fine-tuning of language models with zero-init attention. In: International Conference on Learning Representations (ICLR) (2024)
Google Scholar
Zhang, R., Wang, L., Qiao, Y., Gao, P., Li, H.: Learning 3D representations from 2D pre-trained models via image-to-point masked autoencoders. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Zhang, S., et al.: GPT4RoI: instruction tuning large language model on region-of-interest. CoRR abs/2307.03601 (2023)
Google Scholar
Zhang, Z., Cao, S., Wang, Y.: TAMM: TriAdapter multi-modal learning for 3D shape understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar
Zhao, L., et al.: ChatSpot: bootstrapping multimodal LLMs via precise referring instruction tuning. In: International Joint Conference on Artificial Intelligence (IJCAI) (2024)
Google Scholar
Zhao, X., Wang, H., Komura, T.: Indexing 3D scenes using the interaction bisector surface. ACM Trans. Graph. 33(3), 22:1–22:14 (2014)
Google Scholar
Zheng, J., Zheng, Q., Fang, L., Liu, Y., Yi, L.: CAMS: canonicalized manipulation spaces for category-level functional hand-object manipulation synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Zheng, L., et al.: Judging LLM-as-a-judge with MT-bench and chatbot arena. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)
Google Scholar
Zhou, J., Wang, J., Ma, B., Liu, Y., Huang, T., Wang, X.: Uni3D: exploring unified 3D representation at scale. In: International Conference on Learning Representations (ICLR) (2024)
Google Scholar
Zhou, Y., et al.: Analyzing and mitigating object hallucination in large vision-language models. In: International Conference on Learning Representations (ICLR) (2024)
Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. In: International Conference on Learning Representations (ICLR) (2024)
Google Scholar
Zhu, X., et al.: PointCLIP V2: prompting CLIP and GPT for powerful 3D open-world learning. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Zhu, Z., Ma, X., Chen, Y., Deng, Z., Huang, S., Li, Q.: 3D-VisTA: pre-trained transformer for 3D vision and text alignment. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar

Download references

Acknowledgments

The work was supported by the Dushi Program from Tsinghua University, the National Key R&D Program of China (2022YFB2804103), and the National Science and Technology Major Project of China (2023ZD0121300).

Author information

Authors and Affiliations

Xi’an Jiaotong University, Xi’an, China
Zekun Qi, Runpei Dong & Shaochen Zhang
IIISCT, Xi’an, China
Zekun Qi & Runpei Dong
Peking University, Beijing, China
Haoran Geng
MEGVII, Beijing, China
Chunrui Han & Zheng Ge
IIIS, Tsinghua University, Beijing, China
Li Yi & Kaisheng Ma
Shanghai AI Laboratory, Shanghai, China
Li Yi
Shanghai Qi Zhi Institute, Shanghai, China
Li Yi

Authors

Zekun Qi
View author publications
You can also search for this author in PubMed Google Scholar
Runpei Dong
View author publications
You can also search for this author in PubMed Google Scholar
Shaochen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Haoran Geng
View author publications
You can also search for this author in PubMed Google Scholar
Chunrui Han
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Ge
View author publications
You can also search for this author in PubMed Google Scholar
Li Yi
View author publications
You can also search for this author in PubMed Google Scholar
Kaisheng Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li Yi .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2104 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qi, Z. et al. (2025). ShapeLLM: Universal 3D Object Understanding for Embodied Interaction. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15101. Springer, Cham. https://doi.org/10.1007/978-3-031-72775-7_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-72775-7_13
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72774-0
Online ISBN: 978-3-031-72775-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction