Skip to main content

Uni3DL: A Unified Model for 3D Vision-Language Understanding

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

We present Uni3DL, a unified model for 3D Vision-Language understanding. Distinct from existing unified 3D vision-language models that mostly rely on projected multi-view images and support limited tasks, Uni3DL operates directly on point clouds and significantly broadens the spectrum of tasks in the 3D domain, encompassing both vision and vision-language tasks. At the core of Uni3DL, a query transformer is designed to learn task-agnostic semantic and mask outputs by attending to 3D visual features, and a task router is employed to selectively produce task-specific outputs required for diverse tasks. With a unified architecture, our Uni3DL model enjoys seamless task decomposition and substantial parameter sharing across tasks. Uni3DL has been rigorously evaluated across diverse 3D vision-language understanding tasks, including semantic segmentation, object detection, instance segmentation, visual grounding, 3D captioning, and text-3D cross-modal retrieval. It demonstrates performance on par with or surpassing state-of-the-art (SOTA) task-specific models. We hope our benchmark and Uni3DL model will serve as a solid step to ease future research in unified models in the realm of 3D vision-language understanding. Project page: https://uni3dl.github.io/.

X. Li and J. Ding—Equal contribution.

Z. Chen: This work was done when Zhaoyang Chen was an intern at KAUST.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    To ensure a fair comparison with PointLLM, we filter out 200 objects used for benchmark evaluation from our training set and report the performance on the same 200 objects.

References

  1. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. Adv. Neural. Inf. Process. Syst. 35, 23716–23736 (2022)

    Google Scholar 

  2. Armeni, I., et al.: 3D semantic parsing of large-scale indoor spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1534–1543 (2016)

    Google Scholar 

  3. Baez, J., Huerta, J.: The algebra of grand unified theories. Bull. Am. Math. Soc. 47(3), 483–552 (2010)

    Article  MathSciNet  Google Scholar 

  4. Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  5. Cai, D., Zhao, L., Zhang, J., Sheng, L., Xu, D.: 3djcg: a unified framework for joint dense captioning and visual grounding on 3D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16464–16473 (2022)

    Google Scholar 

  6. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020). https://doi.org/10.1007/978-3-030-58452-8_13

  7. Chen, D.Z., Chang, A.X., Nießner, M.: Scanrefer: 3D object localization in RGB-D scans using natural language. In: European Conference On Computer Vision, pp. 202–221. Springer (2020). https://doi.org/10.1007/978-3-030-58565-5_13

  8. Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)

  9. Chen, K., Choy, C.B., Savva, M., Chang, A.X., Funkhouser, T., Savarese, S.: Text2shape: generating shapes from natural language by learning joint embeddings. In: Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, pp. 100–116. Springer (2019). https://doi.org/10.1007/978-3-030-20893-6_7

  10. Chen, S., Fang, J., Zhang, Q., Liu, W., Wang, X.: Hierarchical aggregation for 3D instance segmentation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15447–15456 (2021). https://doi.org/10.1109/ICCV48922.2021.01518

  11. Chen, T., Saxena, S., Li, L., Lin, T.Y., Fleet, D.J., Hinton, G.E.: A unified sequence interface for vision tasks. Adv. Neural. Inf. Process. Syst. 35, 31333–31346 (2022)

    Google Scholar 

  12. Chen, Z., Hu, R., Chen, X., Nießner, M., Chang, A.X.: Unit3D: a unified transformer for 3D dense captioning and visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18109–18119 (2023)

    Google Scholar 

  13. Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1290–1299 (2022)

    Google Scholar 

  14. Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. Adv. Neural. Inf. Process. Syst. 34, 17864–17875 (2021)

    Google Scholar 

  15. Chiang, W.L.,et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* chatgpt quality. See https://vicuna. lmsys. Accessed 14 April 2023 (2023)

    Google Scholar 

  16. Choy, C., Gwak, J., Savarese, S.: 4D spatio-temporal convnets: minkowski convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3075–3084 (2019)

    Google Scholar 

  17. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839 (2017)

    Google Scholar 

  18. Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning (2023)

    Google Scholar 

  19. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  20. Guo, Z., et al.: Point-bind & point-LLM: aligning point cloud with multi-modality for 3D understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615 (2023)

  21. Han, L., Zheng, T., Xu, L., Fang, L.: Occuseg: occupancy-aware 3D instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2940–2949 (2020)

    Google Scholar 

  22. Han, Z., Shang, M., Wang, X., Liu, Y.S., Zwicker, M.: Y2seq2seq: cross-modal representation learning for 3D shape and text by joint reconstruction and prediction of view and word sequences. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 126–133 (2019)

    Google Scholar 

  23. Hong, Y., Zhen, H., Chen, P., Zheng, S., Du, Y., Chen, Z., Gan, C.: 3d-llm: Injecting the 3d world into large language models. arXiv preprint arXiv:2307.12981 (2023)

  24. Hou, J., Dai, A., Nießner, M.: 3d-sis: 3D semantic instance segmentation of RGB-D scans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4421–4430 (2019)

    Google Scholar 

  25. Huang, P.H., Lee, H.H., Chen, H.T., Liu, T.L.: Text-guided graph neural networks for referring 3D instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 1610–1618 (2021)

    Google Scholar 

  26. Huang, T., et al.: Clip2point: transfer clip to point cloud classification with image-depth pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22157–22167 (2023)

    Google Scholar 

  27. Ilharco, G., et al.: Openclip (2021). https://doi.org/10.5281/zenodo.5143773. if you use this software, please cite it as below

  28. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference On Machine Learning, pp. 4904–4916. PMLR (2021)

    Google Scholar 

  29. Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.W., Jia, J.: Pointgroup: dual-set point grouping for 3D instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4867–4876 (2020)

    Google Scholar 

  30. Lai, X., Liu, J., Jiang, L., Wang, L., Zhao, H., Liu, S., Qi, X., Jia, J.: Stratified transformer for 3d point cloud segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8500–8509 (2022)

    Google Scholar 

  31. Lai, X., Yuan, Y., Chu, R., Chen, Y., Hu, H., Jia, J.: Mask-attention-free transformer for 3D instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3693–3703 (2023)

    Google Scholar 

  32. Langacker, P.: Grand unified theories and proton decay. Phys. Rep. 72(4), 185–385 (1981)

    Article  Google Scholar 

  33. Li, C., Gan, Z., Yang, Z., Yang, J., Li, L., Wang, L., Gao, J.: Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.100201 (2023)

  34. Li, H., et al.: Uni-perceiver v2: a generalist model for large-scale vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2691–2700 (2023)

    Google Scholar 

  35. Liang, Z., Li, Z., Xu, S., Tan, M., Jia, K.: Instance segmentation in 3D scenes using semantic superpoint tree networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2783–2792 (2021)

    Google Scholar 

  36. Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out. pp. 74–81 (2004)

    Google Scholar 

  37. Liu, S.H., Yu, S.Y., Wu, S.C., Chen, H.T., Liu, T.L.: Learning gaussian instance segmentation in point clouds. arXiv preprint arXiv:2007.09860 (2020)

  38. Locatello, F., et al.: Object-centric learning with slot attention. Adv. Neural. Inf. Process. Syst. 33, 11525–11538 (2020)

    Google Scholar 

  39. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  40. Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022)

  41. Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pretrained models. arXiv preprint arXiv:2306.07279 (2023)

  42. Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2906–2917 (2021)

    Google Scholar 

  43. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of The Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  44. Park, C., Jeong, Y., Cho, M., Park, J.: Fast point transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16949–16958 (2022)

    Google Scholar 

  45. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532–1543 (2014)

    Google Scholar 

  46. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  47. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Proce. Syst. 30 (2017)

    Google Scholar 

  48. Qian, G., et al.: Pointnext: revisiting pointnet++ with improved training and scaling strategies. Adv. Neural. Inf. Process. Syst. 35, 23192–23204 (2022)

    Google Scholar 

  49. Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)

    Google Scholar 

  50. Radford, A., et al.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  51. Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3d: mask transformer for 3D semantic instance segmentation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 8216–8223. IEEE (2023)

    Google Scholar 

  52. Sun, J., Qing, C., Tan, J., Xu, X.: Superpoint transformer for 3D scene instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 2393–2401 (2023)

    Google Scholar 

  53. Tang, C., Yang, X., Wu, B., Han, Z., Chang, Y.: Parts2words: Learning joint embedding of point clouds and texts by bidirectional matching between parts and words. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6884–6893 (2023)

    Google Scholar 

  54. Vu, T., Kim, K., Luu, T.M., Nguyen, T., Yoo, C.D.: Softgroup for 3D instance segmentation on point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2708–2717 (2022)

    Google Scholar 

  55. Wang, H., et al.: Cagroup3D: class-aware grouping for 3d object detection on point clouds. Adv. Neural. Inf. Process. Syst. 35, 29975–29988 (2022)

    Google Scholar 

  56. Wang, J., et al.: Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100 (2022)

  57. Wang, P., et al.: OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning, pp. 23318–23340. PMLR (2022)

    Google Scholar 

  58. Wang, W., et al.: Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175 (2023)

  59. Wu, X., Lao, Y., Jiang, L., Liu, X., Zhao, H.: Point transformer v2: grouped vector attention and partition-based pooling. Adv. Neural. Inf. Process. Syst. 35, 33330–33342 (2022)

    Google Scholar 

  60. Wu, Z., et al.: 3D shapenets: a deep representation for volumetric shapes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920 (2015)

    Google Scholar 

  61. Xie, Q., et al.: Venet: voting enhancement network for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3712–3721 (2021)

    Google Scholar 

  62. Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: Pointllm: Empowering large language models to understand point clouds. arXiv preprint arXiv:2308.16911 (2023)

  63. Xue, L., et.: Ulip: learning a unified representation of language, images, and point clouds for 3d understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1179–1189 (2023)

    Google Scholar 

  64. Xue, L., et al.: Ulip-2: Towards scalable multimodal pre-training for 3d understanding. arXiv preprint arXiv:2305.08275 (2023)

  65. Yang, B., Wang, J., Clark, R., Hu, Q., Wang, S., Markham, A., Trigoni, N.: Learning object bounding boxes for 3d instance segmentation on point clouds. Adv. Neural Inf. Proce. Syst. 32 (2019)

    Google Scholar 

  66. Yang, Y.Q., et al.: Swin3d: A pretrained transformer backbone for 3d indoor scene understanding. arXiv preprint arXiv:2304.06906 (2023)

  67. Yang, Z., Jiang, L., Sun, Y., Schiele, B., Jia, J.: A unified query-based paradigm for point cloud understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8541–8551 (2022)

    Google Scholar 

  68. Yang, Z., et al.: Unitab: unifying text and box outputs for grounded vision-language modeling. In: European Conference on Computer Vision, pp. 521–539. Springer (2022). https://doi.org/10.1007/978-3-031-20059-5_30

  69. Yi, L., Zhao, W., Wang, H., Sung, M., Guibas, L.J.: GSPN: generative shape proposal network for 3d instance segmentation in point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3947–3956 (2019)

    Google Scholar 

  70. Yuan, L., et al.: Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)

  71. Zhang, R., et al.: Pointclip: point cloud understanding by clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8552–8562 (2022)

    Google Scholar 

  72. Zhao, L., Cai, D., Sheng, L., Xu, D.: 3dvg-transformer: relation modeling for visual grounding on point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2928–2937 (2021)

    Google Scholar 

  73. Zheng, J., Zhang, J., Li, J., Tang, R., Gao, S., Zhou, Z.: Structured3d: a large photo-realistic dataset for structured 3D modeling. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pp. 519–535. Springer (2020). https://doi.org/10.1007/978-3-030-58545-7_30

  74. Zhong, M., Chen, X., Chen, X., Zeng, G., Wang, Y.: Maskgroup: hierarchical point grouping and masking for 3D instance segmentation. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2022)

    Google Scholar 

  75. Zhu, X., et al.: Pointclip v2: prompting clip and GPT for powerful 3D open-world learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2639–2650 (2023)

    Google Scholar 

  76. Zhu, Z., Ma, X., Chen, Y., Deng, Z., Huang, S., Li, Q.: 3D-vista: pre-trained transformer for 3D vision and text alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2911–2921 (2023)

    Google Scholar 

  77. Zou, X., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15116–15127 (2023)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiang Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5273 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, X., Ding, J., Chen, Z., Elhoseiny, M. (2025). Uni3DL: A Unified Model for 3D Vision-Language Understanding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15081. Springer, Cham. https://doi.org/10.1007/978-3-031-73337-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73337-6_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73336-9

  • Online ISBN: 978-3-031-73337-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics