Skip to main content

SegPoint: Segment Any Point Cloud via Large Language Model

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Despite significant progress in 3D point cloud segmentation, existing methods primarily address specific tasks and depend on explicit instructions to identify targets, lacking the capability to infer and understand implicit user intentions in a unified framework. In this work, we propose a model, called SegPoint, that leverages the reasoning capabilities of a multi-modal Large Language Model (LLM) to produce point-wise segmentation masks across a diverse range of tasks: 1) 3D instruction segmentation, 2) 3D referring segmentation, 3) 3D semantic segmentation, and 4) 3D open-vocabulary semantic segmentation. To advance 3D instruction research, we introduce a new benchmark, Instruct3D, designed to evaluate segmentation performance from complex and implicit instructional texts, featuring 2,565 point cloud-instruction pairs. Our experimental results demonstrate that SegPoint achieves competitive performance on established benchmarks such as ScanRefer for referring segmentation and ScanNet for semantic segmentation, while delivering outstanding outcomes on the Instruct3D dataset. To our knowledge, SegPoint is the first model to address these varied segmentation tasks within a single framework, achieving satisfactory performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., Guibas, L.: Referit3D: neural listeners for fine-grained 3D object identification in real-world scenes. In: ECCV (2020)

    Google Scholar 

  2. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)

    Google Scholar 

  3. Armeni, I., et al.: 3D semantic parsing of large-scale indoor spaces. In: CVPR (2016)

    Google Scholar 

  4. Chen, D.Z., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. In: ECCV (2020)

    Google Scholar 

  5. Chen, Z., et al.: Vision transformer adapter for dense predictions. In: ICLR (2023)

    Google Scholar 

  6. Choy, C., Gwak, J., Savarese, S.: 4D spatio-temporal convnets: Minkowski convolutional neural networks. In: CVPR (2019)

    Google Scholar 

  7. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3d reconstructions of indoor scenes. In: CVPR (2017)

    Google Scholar 

  8. Ding, H., Liu, C., He, S., Jiang, X., Loy, C.C.: MeViS: a large-scale benchmark for video segmentation with motion expressions. In: ICCV (2023)

    Google Scholar 

  9. Ding, H., Liu, C., He, S., Jiang, X., Torr, P.H., Bai, S.: MOSE: a new dataset for video object segmentation in complex scenes. In: ICCV (2023)

    Google Scholar 

  10. Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query generation for referring segmentation. In: ICCV (2021)

    Google Scholar 

  11. Ding, H., Liu, C., Wang, S., Jiang, X.: VLT: Vision-language transformer and query generation for referring segmentation. IEEE TPAMI (2023)

    Google Scholar 

  12. Ding, R., Yang, J., Xue, C., Zhang, W., Bai, S., Qi, X.: PLA: language-driven open-vocabulary 3d scene understanding. In: CVPR (2023)

    Google Scholar 

  13. Ding, Z., Wang, J., Tu, Z.: Open-vocabulary panoptic segmentation with MaskCLIP. In: ICLR (2023)

    Google Scholar 

  14. Girdhar, R., et al.: ImageBind: one embedding space to bind them all. In: CVPR (2023)

    Google Scholar 

  15. Guo, Z., et al.: Point-bind & point-LLM: aligning point cloud with multi-modality for 3D understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615 (2023)

  16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  17. He, S., Ding, H.: Decoupling static and hierarchical motion perception for referring video segmentation. In: CVPR (2024)

    Google Scholar 

  18. He, S., Ding, H.: RefMask3D: Language-guided transformer for 3D referring segmentation. In: ACM MM (2024)

    Google Scholar 

  19. He, S., Jiang, X., Jiang, W., Ding, H.: Prototype adaption and projection for few-and zero-shot 3D point cloud semantic segmentation. IEEE TIP (2023)

    Google Scholar 

  20. Hong, Y., et al.: 3D-LLM: injecting the 3d world into large language models. In: NeurIPS (2023)

    Google Scholar 

  21. Hu, E.J., et al.: LORA: low-rank adaptation of large language models. In: ICLR (2022)

    Google Scholar 

  22. Huang, P.H., Lee, H.H., Chen, H.T., Liu, T.L.: Text-guided graph neural networks for referring 3D instance segmentation. In: AAAI (2021)

    Google Scholar 

  23. Jain, A., Gkanatsios, N., Mediratta, I., Fragkiadaki, K.: Bottom up top down detection transformers for language grounding in images and point clouds. In: ECCV (2022)

    Google Scholar 

  24. Jia, B., et al.: SceneVerse: scaling 3D vision-language learning for grounded scene understanding. In: ECCV (2024)

    Google Scholar 

  25. Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.W., Jia, J.: PointGroup: dual-set point grouping for 3d instance segmentation. In: CVPR (2020)

    Google Scholar 

  26. Kirillov, A., et al.: Segment anything. In: ICCV (2023)

    Google Scholar 

  27. Kolodiazhnyi, M., Vorontsova, A., Konushin, A., Rukhovich, D.: Oneformer3D: one transformer for unified point cloud segmentation. In: CVPR (2024)

    Google Scholar 

  28. Lai, X., et al.: LISA: reasoning segmentation via large language model. In: CVPR (2024)

    Google Scholar 

  29. Lai, X., Yuan, Y., Chu, R., Chen, Y., Hu, H., Jia, J.: Mask-attention-free transformer for 3D instance segmentation. In: ICCV (2023)

    Google Scholar 

  30. Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: a multi-modal model with in-context instruction tuning. arXiv:2305.03726 (2023)

  31. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: NeurIPS (2023)

    Google Scholar 

  32. Lin, Z., et al.: Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575 (2023)

  33. Liu, C., Ding, H., Jiang, X.: GRES: Generalized referring expression segmentation. In: CVPR (2023)

    Google Scholar 

  34. Liu, C., Ding, H., Zhang, Y., Jiang, X.: Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE TIP (2023)

    Google Scholar 

  35. Liu, C., Jiang, X., Ding, H.: Instance-specific feature propagation for referring segmentation. IEEE TMM (2022)

    Google Scholar 

  36. Liu, C., Jiang, X., Ding, H.: PrimitiveNet: decomposing the global constraints for referring segmentation. Visual Intell. 2(1), 16 (2024)

    Article  Google Scholar 

  37. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

    Google Scholar 

  38. Liu, Y., et al.: Segment any point cloud sequences by distilling vision foundation models. In: NeurIPS (2023)

    Google Scholar 

  39. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  40. Nguyen, P.D., et al.: Open3Dis: open-vocabulary 3D instance segmentation with 2D mask guidance. In: CVPR (2024)

    Google Scholar 

  41. Park, N., Kim, S.: How do vision transformers work? arXiv preprint arXiv:2202.06709 (2022)

  42. Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T., et al.: Openscene: 3D scene understanding with open vocabularies. In: CVPR (2023)

    Google Scholar 

  43. Peng, Z., et al.: KOSMOS-2: grounding multimodal large language models to the world. In: ICLR (2024)

    Google Scholar 

  44. Pi, R., et al.: DetGPT: detect what you need via reasoning. In: EMNLP (2023)

    Google Scholar 

  45. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: CVPR (2017)

    Google Scholar 

  46. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: NeurIPS (2017)

    Google Scholar 

  47. Qi, Z., et al.: GPT4Point: a unified framework for point-language understanding and generation. In: CVPR (2024)

    Google Scholar 

  48. Qian, G., et al.: PointNext: revisiting PointNet++ with improved training and scaling strategies. In: NeurIPS (2022)

    Google Scholar 

  49. Qian, Z., Ma, Y., Ji, J., Sun, X.: X-refseg3D: enhancing referring 3D instance segmentation via structured cross-modal graph neural networks. In: AAAI (2024)

    Google Scholar 

  50. Rasheed, H., et al.: GLAMM: pixel grounding large multimodal model. In: CVPR (2024)

    Google Scholar 

  51. Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In: KDD (2020)

    Google Scholar 

  52. Ren, Z., et al.: PixeLLM: pixel reasoning with large multimodal model. In: CVPR (2024)

    Google Scholar 

  53. Rozenberszki, D., Litany, O., Dai, A.: Language-grounded indoor 3D semantic segmentation in the wild. In: ECCV (2022)

    Google Scholar 

  54. Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3D: mask transformer for 3D semantic instance segmentation. In: ICRA (2023)

    Google Scholar 

  55. Shuai, X., Ding, H., Ma, X., Tu, R., Jiang, Y.G., Tao, D.: A survey of multimodal-guided image editing with text-to-image diffusion models. arXiv preprint arXiv:2406.14555 (2024)

  56. Sun, J., Qing, C., Tan, J., Xu, X.: Superpoint transformer for 3D scene instance segmentation. In: AAAI (2023)

    Google Scholar 

  57. Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: Openmask3D: open-vocabulary 3D instance segmentation. In: NeurIPS (2023)

    Google Scholar 

  58. Thomas, H., Qi, C.R., Deschaud, J.E., Marcotegui, B., Goulette, F., Guibas, L.J.: KPCONV: flexible and deformable convolution for point clouds. In: ICCV (2019)

    Google Scholar 

  59. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  60. Wang, P.S.: OctFormer: octree-based transformers for 3D point clouds. In: SIGGRAPH (2023)

    Google Scholar 

  61. Wang, W., et al.: VisionLLM: large language model is also an open-ended decoder for vision-centric tasks. In: NeurIPS (2023)

    Google Scholar 

  62. Wang, W., et al.: PVT V2: improved baselines with pyramid vision transformer. Computational Visual Media (2022)

    Google Scholar 

  63. Wu, C., et al.: 3D-STMN: dependency-driven superpoint-text matching network for end-to-end 3D referring expression segmentation. In: AAAI (2024)

    Google Scholar 

  64. Wu, H., et al.: CVT: introducing convolutions to vision transformers. In: ICCV (2021)

    Google Scholar 

  65. Wu, J., et al.: Towards open vocabulary learning: a survey. IEEE TPAMI (2024)

    Google Scholar 

  66. Wu, X., Lao, Y., Jiang, L., Liu, X., Zhao, H.: Point transformer V2: grouped vector attention and partition-based pooling. In: NeurIPS (2022)

    Google Scholar 

  67. Wu, Y., Cheng, X., Zhang, R., Cheng, Z., Zhang, J.: EDA: explicit text-decoupling and dense alignment for 3D visual grounding. In: CVPR (2023)

    Google Scholar 

  68. Xiao, Z., Zhang, W., Wang, T., Loy, C.C., Lin, D., Pang, J.: Position-guided point cloud panoptic segmentation transformer. arXiv preprint arXiv:2303.13509 (2023)

  69. Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: CVPR (2023)

    Google Scholar 

  70. Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: PointLLM: empowering large language models to understand point clouds. In: ECCV (2024)

    Google Scholar 

  71. Yang, J., Ding, R., Deng, W., Wang, Z., Qi, X.: RegioNPLC: regional point-language contrastive learning for open-world 3D scene understanding. In: CVPR (2024)

    Google Scholar 

  72. Yang, Y.Q., et al.: Swin3D: a pretrained transformer backbone for 3D indoor scene understanding. arXiv preprint arXiv:2304.06906 (2023)

  73. Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv:2304.14178 (2023)

  74. Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: ScanNet++: a high-fidelity dataset of 3D indoor scenes. In: ICCV (2023)

    Google Scholar 

  75. You, H., et al.: FERRET: refer and ground anything anywhere at any granularity. In: ICLR (2024)

    Google Scholar 

  76. Zhang, H., et al.: LLAVA-grounding: grounded visual chat with large multimodal models. arXiv preprint arXiv:2312.02949 (2023)

  77. Zhang, H., Ding, H.: Prototypical matching and open set rejection for zero-shot semantic segmentation. In: ICCV (2021)

    Google Scholar 

  78. Zhang, S., et al.: GPT4ROI: instruction tuning large language model on region-of-interest. arXiv:2307.03601 (2023)

  79. Zhang, Y., Gong, Z., Chang, A.X.: Multi3Drefer: grounding text description to multiple 3d objects. In: ICCV (2023)

    Google Scholar 

  80. Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: ICCV (2021)

    Google Scholar 

  81. Zhou, J., Wang, J., Ma, B., Liu, Y.S., Huang, T., Wang, X.: Uni3D: exploring unified 3D representation at scale. In: ICLR (2024)

    Google Scholar 

  82. Zhou, Z., Zhang, Y., Foroosh, H.: Panoptic-PolarNet: proposal-free Lidar point cloud panoptic segmentation. In: CVPR (2021)

    Google Scholar 

  83. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. In: ICLR (2024)

    Google Scholar 

  84. Zhu, Z., Ma, X., Chen, Y., Deng, Z., Huang, S., Li, Q.: 3D-Vista: pre-trained transformer for 3D vision and text alignment. In: ICCV (2023)

    Google Scholar 

Download references

Acknowledgements

We thank the anonymous reviewers for their constructive suggestions. Following their advice, we have incorporated diverse location and view descriptions into our Instruct3D. This work was partially supported by the National Research Foundation Singapore Competitive Research Program (CRP29-2022-0003).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bihan Wen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

He, S., Ding, H., Jiang, X., Wen, B. (2025). SegPoint: Segment Any Point Cloud via Large Language Model. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15080. Springer, Cham. https://doi.org/10.1007/978-3-031-72670-5_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72670-5_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72669-9

  • Online ISBN: 978-3-031-72670-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics