SegPoint: Segment Any Point Cloud via Large Language Model

He, Shuting; Ding, Henghui; Jiang, Xudong; Wen, Bihan

doi:10.1007/978-3-031-72670-5_20

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15080))

Included in the following conference series:

European Conference on Computer Vision

899 Accesses

Abstract

Despite significant progress in 3D point cloud segmentation, existing methods primarily address specific tasks and depend on explicit instructions to identify targets, lacking the capability to infer and understand implicit user intentions in a unified framework. In this work, we propose a model, called SegPoint, that leverages the reasoning capabilities of a multi-modal Large Language Model (LLM) to produce point-wise segmentation masks across a diverse range of tasks: 1) 3D instruction segmentation, 2) 3D referring segmentation, 3) 3D semantic segmentation, and 4) 3D open-vocabulary semantic segmentation. To advance 3D instruction research, we introduce a new benchmark, Instruct3D, designed to evaluate segmentation performance from complex and implicit instructional texts, featuring 2,565 point cloud-instruction pairs. Our experimental results demonstrate that SegPoint achieves competitive performance on established benchmarks such as ScanRefer for referring segmentation and ScanNet for semantic segmentation, while delivering outstanding outcomes on the Instruct3D dataset. To our knowledge, SegPoint is the first model to address these varied segmentation tasks within a single framework, achieving satisfactory performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

PointLLM: Empowering Large Language Models to Understand Point Clouds

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

VISA: Reasoning Video Object Segmentation via Large Language Models

References

Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., Guibas, L.: Referit3D: neural listeners for fine-grained 3D object identification in real-world scenes. In: ECCV (2020)
Google Scholar
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)
Google Scholar
Armeni, I., et al.: 3D semantic parsing of large-scale indoor spaces. In: CVPR (2016)
Google Scholar
Chen, D.Z., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. In: ECCV (2020)
Google Scholar
Chen, Z., et al.: Vision transformer adapter for dense predictions. In: ICLR (2023)
Google Scholar
Choy, C., Gwak, J., Savarese, S.: 4D spatio-temporal convnets: Minkowski convolutional neural networks. In: CVPR (2019)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3d reconstructions of indoor scenes. In: CVPR (2017)
Google Scholar
Ding, H., Liu, C., He, S., Jiang, X., Loy, C.C.: MeViS: a large-scale benchmark for video segmentation with motion expressions. In: ICCV (2023)
Google Scholar
Ding, H., Liu, C., He, S., Jiang, X., Torr, P.H., Bai, S.: MOSE: a new dataset for video object segmentation in complex scenes. In: ICCV (2023)
Google Scholar
Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query generation for referring segmentation. In: ICCV (2021)
Google Scholar
Ding, H., Liu, C., Wang, S., Jiang, X.: VLT: Vision-language transformer and query generation for referring segmentation. IEEE TPAMI (2023)
Google Scholar
Ding, R., Yang, J., Xue, C., Zhang, W., Bai, S., Qi, X.: PLA: language-driven open-vocabulary 3d scene understanding. In: CVPR (2023)
Google Scholar
Ding, Z., Wang, J., Tu, Z.: Open-vocabulary panoptic segmentation with MaskCLIP. In: ICLR (2023)
Google Scholar
Girdhar, R., et al.: ImageBind: one embedding space to bind them all. In: CVPR (2023)
Google Scholar
Guo, Z., et al.: Point-bind & point-LLM: aligning point cloud with multi-modality for 3D understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615 (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
He, S., Ding, H.: Decoupling static and hierarchical motion perception for referring video segmentation. In: CVPR (2024)
Google Scholar
He, S., Ding, H.: RefMask3D: Language-guided transformer for 3D referring segmentation. In: ACM MM (2024)
Google Scholar
He, S., Jiang, X., Jiang, W., Ding, H.: Prototype adaption and projection for few-and zero-shot 3D point cloud semantic segmentation. IEEE TIP (2023)
Google Scholar
Hong, Y., et al.: 3D-LLM: injecting the 3d world into large language models. In: NeurIPS (2023)
Google Scholar
Hu, E.J., et al.: LORA: low-rank adaptation of large language models. In: ICLR (2022)
Google Scholar
Huang, P.H., Lee, H.H., Chen, H.T., Liu, T.L.: Text-guided graph neural networks for referring 3D instance segmentation. In: AAAI (2021)
Google Scholar
Jain, A., Gkanatsios, N., Mediratta, I., Fragkiadaki, K.: Bottom up top down detection transformers for language grounding in images and point clouds. In: ECCV (2022)
Google Scholar
Jia, B., et al.: SceneVerse: scaling 3D vision-language learning for grounded scene understanding. In: ECCV (2024)
Google Scholar
Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.W., Jia, J.: PointGroup: dual-set point grouping for 3d instance segmentation. In: CVPR (2020)
Google Scholar
Kirillov, A., et al.: Segment anything. In: ICCV (2023)
Google Scholar
Kolodiazhnyi, M., Vorontsova, A., Konushin, A., Rukhovich, D.: Oneformer3D: one transformer for unified point cloud segmentation. In: CVPR (2024)
Google Scholar
Lai, X., et al.: LISA: reasoning segmentation via large language model. In: CVPR (2024)
Google Scholar
Lai, X., Yuan, Y., Chu, R., Chen, Y., Hu, H., Jia, J.: Mask-attention-free transformer for 3D instance segmentation. In: ICCV (2023)
Google Scholar
Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: a multi-modal model with in-context instruction tuning. arXiv:2305.03726 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: NeurIPS (2023)
Google Scholar
Lin, Z., et al.: Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575 (2023)
Liu, C., Ding, H., Jiang, X.: GRES: Generalized referring expression segmentation. In: CVPR (2023)
Google Scholar
Liu, C., Ding, H., Zhang, Y., Jiang, X.: Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE TIP (2023)
Google Scholar
Liu, C., Jiang, X., Ding, H.: Instance-specific feature propagation for referring segmentation. IEEE TMM (2022)
Google Scholar
Liu, C., Jiang, X., Ding, H.: PrimitiveNet: decomposing the global constraints for referring segmentation. Visual Intell. 2(1), 16 (2024)
Article Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
Google Scholar
Liu, Y., et al.: Segment any point cloud sequences by distilling vision foundation models. In: NeurIPS (2023)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Google Scholar
Nguyen, P.D., et al.: Open3Dis: open-vocabulary 3D instance segmentation with 2D mask guidance. In: CVPR (2024)
Google Scholar
Park, N., Kim, S.: How do vision transformers work? arXiv preprint arXiv:2202.06709 (2022)
Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T., et al.: Openscene: 3D scene understanding with open vocabularies. In: CVPR (2023)
Google Scholar
Peng, Z., et al.: KOSMOS-2: grounding multimodal large language models to the world. In: ICLR (2024)
Google Scholar
Pi, R., et al.: DetGPT: detect what you need via reasoning. In: EMNLP (2023)
Google Scholar
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: CVPR (2017)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: NeurIPS (2017)
Google Scholar
Qi, Z., et al.: GPT4Point: a unified framework for point-language understanding and generation. In: CVPR (2024)
Google Scholar
Qian, G., et al.: PointNext: revisiting PointNet++ with improved training and scaling strategies. In: NeurIPS (2022)
Google Scholar
Qian, Z., Ma, Y., Ji, J., Sun, X.: X-refseg3D: enhancing referring 3D instance segmentation via structured cross-modal graph neural networks. In: AAAI (2024)
Google Scholar
Rasheed, H., et al.: GLAMM: pixel grounding large multimodal model. In: CVPR (2024)
Google Scholar
Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In: KDD (2020)
Google Scholar
Ren, Z., et al.: PixeLLM: pixel reasoning with large multimodal model. In: CVPR (2024)
Google Scholar
Rozenberszki, D., Litany, O., Dai, A.: Language-grounded indoor 3D semantic segmentation in the wild. In: ECCV (2022)
Google Scholar
Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3D: mask transformer for 3D semantic instance segmentation. In: ICRA (2023)
Google Scholar
Shuai, X., Ding, H., Ma, X., Tu, R., Jiang, Y.G., Tao, D.: A survey of multimodal-guided image editing with text-to-image diffusion models. arXiv preprint arXiv:2406.14555 (2024)
Sun, J., Qing, C., Tan, J., Xu, X.: Superpoint transformer for 3D scene instance segmentation. In: AAAI (2023)
Google Scholar
Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: Openmask3D: open-vocabulary 3D instance segmentation. In: NeurIPS (2023)
Google Scholar
Thomas, H., Qi, C.R., Deschaud, J.E., Marcotegui, B., Goulette, F., Guibas, L.J.: KPCONV: flexible and deformable convolution for point clouds. In: ICCV (2019)
Google Scholar
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Wang, P.S.: OctFormer: octree-based transformers for 3D point clouds. In: SIGGRAPH (2023)
Google Scholar
Wang, W., et al.: VisionLLM: large language model is also an open-ended decoder for vision-centric tasks. In: NeurIPS (2023)
Google Scholar
Wang, W., et al.: PVT V2: improved baselines with pyramid vision transformer. Computational Visual Media (2022)
Google Scholar
Wu, C., et al.: 3D-STMN: dependency-driven superpoint-text matching network for end-to-end 3D referring expression segmentation. In: AAAI (2024)
Google Scholar
Wu, H., et al.: CVT: introducing convolutions to vision transformers. In: ICCV (2021)
Google Scholar
Wu, J., et al.: Towards open vocabulary learning: a survey. IEEE TPAMI (2024)
Google Scholar
Wu, X., Lao, Y., Jiang, L., Liu, X., Zhao, H.: Point transformer V2: grouped vector attention and partition-based pooling. In: NeurIPS (2022)
Google Scholar
Wu, Y., Cheng, X., Zhang, R., Cheng, Z., Zhang, J.: EDA: explicit text-decoupling and dense alignment for 3D visual grounding. In: CVPR (2023)
Google Scholar
Xiao, Z., Zhang, W., Wang, T., Loy, C.C., Lin, D., Pang, J.: Position-guided point cloud panoptic segmentation transformer. arXiv preprint arXiv:2303.13509 (2023)
Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: CVPR (2023)
Google Scholar
Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: PointLLM: empowering large language models to understand point clouds. In: ECCV (2024)
Google Scholar
Yang, J., Ding, R., Deng, W., Wang, Z., Qi, X.: RegioNPLC: regional point-language contrastive learning for open-world 3D scene understanding. In: CVPR (2024)
Google Scholar
Yang, Y.Q., et al.: Swin3D: a pretrained transformer backbone for 3D indoor scene understanding. arXiv preprint arXiv:2304.06906 (2023)
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv:2304.14178 (2023)
Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: ScanNet++: a high-fidelity dataset of 3D indoor scenes. In: ICCV (2023)
Google Scholar
You, H., et al.: FERRET: refer and ground anything anywhere at any granularity. In: ICLR (2024)
Google Scholar
Zhang, H., et al.: LLAVA-grounding: grounded visual chat with large multimodal models. arXiv preprint arXiv:2312.02949 (2023)
Zhang, H., Ding, H.: Prototypical matching and open set rejection for zero-shot semantic segmentation. In: ICCV (2021)
Google Scholar
Zhang, S., et al.: GPT4ROI: instruction tuning large language model on region-of-interest. arXiv:2307.03601 (2023)
Zhang, Y., Gong, Z., Chang, A.X.: Multi3Drefer: grounding text description to multiple 3d objects. In: ICCV (2023)
Google Scholar
Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: ICCV (2021)
Google Scholar
Zhou, J., Wang, J., Ma, B., Liu, Y.S., Huang, T., Wang, X.: Uni3D: exploring unified 3D representation at scale. In: ICLR (2024)
Google Scholar
Zhou, Z., Zhang, Y., Foroosh, H.: Panoptic-PolarNet: proposal-free Lidar point cloud panoptic segmentation. In: CVPR (2021)
Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. In: ICLR (2024)
Google Scholar
Zhu, Z., Ma, X., Chen, Y., Deng, Z., Huang, S., Li, Q.: 3D-Vista: pre-trained transformer for 3D vision and text alignment. In: ICCV (2023)
Google Scholar

Download references

Acknowledgements

We thank the anonymous reviewers for their constructive suggestions. Following their advice, we have incorporated diverse location and view descriptions into our Instruct3D. This work was partially supported by the National Research Foundation Singapore Competitive Research Program (CRP29-2022-0003).

Author information

Authors and Affiliations

Nanyang Technological University, Singapore, Singapore
Shuting He, Xudong Jiang & Bihan Wen
Institute of Big Data, Fudan University, Shanghai, China
Henghui Ding

Authors

Shuting He
View author publications
You can also search for this author in PubMed Google Scholar
Henghui Ding
View author publications
You can also search for this author in PubMed Google Scholar
Xudong Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Bihan Wen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bihan Wen .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, S., Ding, H., Jiang, X., Wen, B. (2025). SegPoint: Segment Any Point Cloud via Large Language Model. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15080. Springer, Cham. https://doi.org/10.1007/978-3-031-72670-5_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-72670-5_20
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72669-9
Online ISBN: 978-3-031-72670-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SegPoint: Segment Any Point Cloud via Large Language Model