Abstract
Despite significant progress in 3D point cloud segmentation, existing methods primarily address specific tasks and depend on explicit instructions to identify targets, lacking the capability to infer and understand implicit user intentions in a unified framework. In this work, we propose a model, called SegPoint, that leverages the reasoning capabilities of a multi-modal Large Language Model (LLM) to produce point-wise segmentation masks across a diverse range of tasks: 1) 3D instruction segmentation, 2) 3D referring segmentation, 3) 3D semantic segmentation, and 4) 3D open-vocabulary semantic segmentation. To advance 3D instruction research, we introduce a new benchmark, Instruct3D, designed to evaluate segmentation performance from complex and implicit instructional texts, featuring 2,565 point cloud-instruction pairs. Our experimental results demonstrate that SegPoint achieves competitive performance on established benchmarks such as ScanRefer for referring segmentation and ScanNet for semantic segmentation, while delivering outstanding outcomes on the Instruct3D dataset. To our knowledge, SegPoint is the first model to address these varied segmentation tasks within a single framework, achieving satisfactory performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., Guibas, L.: Referit3D: neural listeners for fine-grained 3D object identification in real-world scenes. In: ECCV (2020)
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)
Armeni, I., et al.: 3D semantic parsing of large-scale indoor spaces. In: CVPR (2016)
Chen, D.Z., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. In: ECCV (2020)
Chen, Z., et al.: Vision transformer adapter for dense predictions. In: ICLR (2023)
Choy, C., Gwak, J., Savarese, S.: 4D spatio-temporal convnets: Minkowski convolutional neural networks. In: CVPR (2019)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3d reconstructions of indoor scenes. In: CVPR (2017)
Ding, H., Liu, C., He, S., Jiang, X., Loy, C.C.: MeViS: a large-scale benchmark for video segmentation with motion expressions. In: ICCV (2023)
Ding, H., Liu, C., He, S., Jiang, X., Torr, P.H., Bai, S.: MOSE: a new dataset for video object segmentation in complex scenes. In: ICCV (2023)
Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query generation for referring segmentation. In: ICCV (2021)
Ding, H., Liu, C., Wang, S., Jiang, X.: VLT: Vision-language transformer and query generation for referring segmentation. IEEE TPAMI (2023)
Ding, R., Yang, J., Xue, C., Zhang, W., Bai, S., Qi, X.: PLA: language-driven open-vocabulary 3d scene understanding. In: CVPR (2023)
Ding, Z., Wang, J., Tu, Z.: Open-vocabulary panoptic segmentation with MaskCLIP. In: ICLR (2023)
Girdhar, R., et al.: ImageBind: one embedding space to bind them all. In: CVPR (2023)
Guo, Z., et al.: Point-bind & point-LLM: aligning point cloud with multi-modality for 3D understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615 (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
He, S., Ding, H.: Decoupling static and hierarchical motion perception for referring video segmentation. In: CVPR (2024)
He, S., Ding, H.: RefMask3D: Language-guided transformer for 3D referring segmentation. In: ACM MM (2024)
He, S., Jiang, X., Jiang, W., Ding, H.: Prototype adaption and projection for few-and zero-shot 3D point cloud semantic segmentation. IEEE TIP (2023)
Hong, Y., et al.: 3D-LLM: injecting the 3d world into large language models. In: NeurIPS (2023)
Hu, E.J., et al.: LORA: low-rank adaptation of large language models. In: ICLR (2022)
Huang, P.H., Lee, H.H., Chen, H.T., Liu, T.L.: Text-guided graph neural networks for referring 3D instance segmentation. In: AAAI (2021)
Jain, A., Gkanatsios, N., Mediratta, I., Fragkiadaki, K.: Bottom up top down detection transformers for language grounding in images and point clouds. In: ECCV (2022)
Jia, B., et al.: SceneVerse: scaling 3D vision-language learning for grounded scene understanding. In: ECCV (2024)
Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.W., Jia, J.: PointGroup: dual-set point grouping for 3d instance segmentation. In: CVPR (2020)
Kirillov, A., et al.: Segment anything. In: ICCV (2023)
Kolodiazhnyi, M., Vorontsova, A., Konushin, A., Rukhovich, D.: Oneformer3D: one transformer for unified point cloud segmentation. In: CVPR (2024)
Lai, X., et al.: LISA: reasoning segmentation via large language model. In: CVPR (2024)
Lai, X., Yuan, Y., Chu, R., Chen, Y., Hu, H., Jia, J.: Mask-attention-free transformer for 3D instance segmentation. In: ICCV (2023)
Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: a multi-modal model with in-context instruction tuning. arXiv:2305.03726 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: NeurIPS (2023)
Lin, Z., et al.: Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575 (2023)
Liu, C., Ding, H., Jiang, X.: GRES: Generalized referring expression segmentation. In: CVPR (2023)
Liu, C., Ding, H., Zhang, Y., Jiang, X.: Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE TIP (2023)
Liu, C., Jiang, X., Ding, H.: Instance-specific feature propagation for referring segmentation. IEEE TMM (2022)
Liu, C., Jiang, X., Ding, H.: PrimitiveNet: decomposing the global constraints for referring segmentation. Visual Intell. 2(1), 16 (2024)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
Liu, Y., et al.: Segment any point cloud sequences by distilling vision foundation models. In: NeurIPS (2023)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Nguyen, P.D., et al.: Open3Dis: open-vocabulary 3D instance segmentation with 2D mask guidance. In: CVPR (2024)
Park, N., Kim, S.: How do vision transformers work? arXiv preprint arXiv:2202.06709 (2022)
Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T., et al.: Openscene: 3D scene understanding with open vocabularies. In: CVPR (2023)
Peng, Z., et al.: KOSMOS-2: grounding multimodal large language models to the world. In: ICLR (2024)
Pi, R., et al.: DetGPT: detect what you need via reasoning. In: EMNLP (2023)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: CVPR (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: NeurIPS (2017)
Qi, Z., et al.: GPT4Point: a unified framework for point-language understanding and generation. In: CVPR (2024)
Qian, G., et al.: PointNext: revisiting PointNet++ with improved training and scaling strategies. In: NeurIPS (2022)
Qian, Z., Ma, Y., Ji, J., Sun, X.: X-refseg3D: enhancing referring 3D instance segmentation via structured cross-modal graph neural networks. In: AAAI (2024)
Rasheed, H., et al.: GLAMM: pixel grounding large multimodal model. In: CVPR (2024)
Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In: KDD (2020)
Ren, Z., et al.: PixeLLM: pixel reasoning with large multimodal model. In: CVPR (2024)
Rozenberszki, D., Litany, O., Dai, A.: Language-grounded indoor 3D semantic segmentation in the wild. In: ECCV (2022)
Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3D: mask transformer for 3D semantic instance segmentation. In: ICRA (2023)
Shuai, X., Ding, H., Ma, X., Tu, R., Jiang, Y.G., Tao, D.: A survey of multimodal-guided image editing with text-to-image diffusion models. arXiv preprint arXiv:2406.14555 (2024)
Sun, J., Qing, C., Tan, J., Xu, X.: Superpoint transformer for 3D scene instance segmentation. In: AAAI (2023)
Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: Openmask3D: open-vocabulary 3D instance segmentation. In: NeurIPS (2023)
Thomas, H., Qi, C.R., Deschaud, J.E., Marcotegui, B., Goulette, F., Guibas, L.J.: KPCONV: flexible and deformable convolution for point clouds. In: ICCV (2019)
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Wang, P.S.: OctFormer: octree-based transformers for 3D point clouds. In: SIGGRAPH (2023)
Wang, W., et al.: VisionLLM: large language model is also an open-ended decoder for vision-centric tasks. In: NeurIPS (2023)
Wang, W., et al.: PVT V2: improved baselines with pyramid vision transformer. Computational Visual Media (2022)
Wu, C., et al.: 3D-STMN: dependency-driven superpoint-text matching network for end-to-end 3D referring expression segmentation. In: AAAI (2024)
Wu, H., et al.: CVT: introducing convolutions to vision transformers. In: ICCV (2021)
Wu, J., et al.: Towards open vocabulary learning: a survey. IEEE TPAMI (2024)
Wu, X., Lao, Y., Jiang, L., Liu, X., Zhao, H.: Point transformer V2: grouped vector attention and partition-based pooling. In: NeurIPS (2022)
Wu, Y., Cheng, X., Zhang, R., Cheng, Z., Zhang, J.: EDA: explicit text-decoupling and dense alignment for 3D visual grounding. In: CVPR (2023)
Xiao, Z., Zhang, W., Wang, T., Loy, C.C., Lin, D., Pang, J.: Position-guided point cloud panoptic segmentation transformer. arXiv preprint arXiv:2303.13509 (2023)
Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: CVPR (2023)
Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: PointLLM: empowering large language models to understand point clouds. In: ECCV (2024)
Yang, J., Ding, R., Deng, W., Wang, Z., Qi, X.: RegioNPLC: regional point-language contrastive learning for open-world 3D scene understanding. In: CVPR (2024)
Yang, Y.Q., et al.: Swin3D: a pretrained transformer backbone for 3D indoor scene understanding. arXiv preprint arXiv:2304.06906 (2023)
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv:2304.14178 (2023)
Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: ScanNet++: a high-fidelity dataset of 3D indoor scenes. In: ICCV (2023)
You, H., et al.: FERRET: refer and ground anything anywhere at any granularity. In: ICLR (2024)
Zhang, H., et al.: LLAVA-grounding: grounded visual chat with large multimodal models. arXiv preprint arXiv:2312.02949 (2023)
Zhang, H., Ding, H.: Prototypical matching and open set rejection for zero-shot semantic segmentation. In: ICCV (2021)
Zhang, S., et al.: GPT4ROI: instruction tuning large language model on region-of-interest. arXiv:2307.03601 (2023)
Zhang, Y., Gong, Z., Chang, A.X.: Multi3Drefer: grounding text description to multiple 3d objects. In: ICCV (2023)
Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: ICCV (2021)
Zhou, J., Wang, J., Ma, B., Liu, Y.S., Huang, T., Wang, X.: Uni3D: exploring unified 3D representation at scale. In: ICLR (2024)
Zhou, Z., Zhang, Y., Foroosh, H.: Panoptic-PolarNet: proposal-free Lidar point cloud panoptic segmentation. In: CVPR (2021)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. In: ICLR (2024)
Zhu, Z., Ma, X., Chen, Y., Deng, Z., Huang, S., Li, Q.: 3D-Vista: pre-trained transformer for 3D vision and text alignment. In: ICCV (2023)
Acknowledgements
We thank the anonymous reviewers for their constructive suggestions. Following their advice, we have incorporated diverse location and view descriptions into our Instruct3D. This work was partially supported by the National Research Foundation Singapore Competitive Research Program (CRP29-2022-0003).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
He, S., Ding, H., Jiang, X., Wen, B. (2025).
SegPoint: Segment Any Point Cloud via Large Language Model.
In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15080. Springer, Cham. https://doi.org/10.1007/978-3-031-72670-5_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-72670-5_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72669-9
Online ISBN: 978-3-031-72670-5
eBook Packages: Computer ScienceComputer Science (R0)