Uni3DL: A Unified Model for 3D Vision-Language Understanding

Li, Xiang; Ding, Jian; Chen, Zhaoyang; Elhoseiny, Mohamed

doi:10.1007/978-3-031-73337-6_5

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15081))

Included in the following conference series:

European Conference on Computer Vision

354 Accesses

Abstract

We present Uni3DL, a unified model for 3D Vision-Language understanding. Distinct from existing unified 3D vision-language models that mostly rely on projected multi-view images and support limited tasks, Uni3DL operates directly on point clouds and significantly broadens the spectrum of tasks in the 3D domain, encompassing both vision and vision-language tasks. At the core of Uni3DL, a query transformer is designed to learn task-agnostic semantic and mask outputs by attending to 3D visual features, and a task router is employed to selectively produce task-specific outputs required for diverse tasks. With a unified architecture, our Uni3DL model enjoys seamless task decomposition and substantial parameter sharing across tasks. Uni3DL has been rigorously evaluated across diverse 3D vision-language understanding tasks, including semantic segmentation, object detection, instance segmentation, visual grounding, 3D captioning, and text-3D cross-modal retrieval. It demonstrates performance on par with or surpassing state-of-the-art (SOTA) task-specific models. We hope our benchmark and Uni3DL model will serve as a solid step to ease future research in unified models in the realm of 3D vision-language understanding. Project page: https://uni3dl.github.io/.

X. Li and J. Ding—Equal contribution.

Z. Chen: This work was done when Zhaoyang Chen was an intern at KAUST.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

UMG-CLIP: A Unified Multi-granularity Vision Generalist for Open-World Understanding

Unifying 3D Vision-Language Understanding via Promptable Queries

Contrastive Region Guidance: Improving Grounding in Vision-Language Models Without Training

Notes

1.
To ensure a fair comparison with PointLLM, we filter out 200 objects used for benchmark evaluation from our training set and report the performance on the same 200 objects.

References

Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. Adv. Neural. Inf. Process. Syst. 35, 23716–23736 (2022)
Google Scholar
Armeni, I., et al.: 3D semantic parsing of large-scale indoor spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1534–1543 (2016)
Google Scholar
Baez, J., Huerta, J.: The algebra of grand unified theories. Bull. Am. Math. Soc. 47(3), 483–552 (2010)
Article MathSciNet Google Scholar
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Cai, D., Zhao, L., Zhang, J., Sheng, L., Xu, D.: 3djcg: a unified framework for joint dense captioning and visual grounding on 3D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16464–16473 (2022)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, D.Z., Chang, A.X., Nießner, M.: Scanrefer: 3D object localization in RGB-D scans using natural language. In: European Conference On Computer Vision, pp. 202–221. Springer (2020). https://doi.org/10.1007/978-3-030-58565-5_13
Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
Chen, K., Choy, C.B., Savva, M., Chang, A.X., Funkhouser, T., Savarese, S.: Text2shape: generating shapes from natural language by learning joint embeddings. In: Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, pp. 100–116. Springer (2019). https://doi.org/10.1007/978-3-030-20893-6_7
Chen, S., Fang, J., Zhang, Q., Liu, W., Wang, X.: Hierarchical aggregation for 3D instance segmentation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15447–15456 (2021). https://doi.org/10.1109/ICCV48922.2021.01518
Chen, T., Saxena, S., Li, L., Lin, T.Y., Fleet, D.J., Hinton, G.E.: A unified sequence interface for vision tasks. Adv. Neural. Inf. Process. Syst. 35, 31333–31346 (2022)
Google Scholar
Chen, Z., Hu, R., Chen, X., Nießner, M., Chang, A.X.: Unit3D: a unified transformer for 3D dense captioning and visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18109–18119 (2023)
Google Scholar
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1290–1299 (2022)
Google Scholar
Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. Adv. Neural. Inf. Process. Syst. 34, 17864–17875 (2021)
Google Scholar
Chiang, W.L.,et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* chatgpt quality. See https://vicuna. lmsys. Accessed 14 April 2023 (2023)
Google Scholar
Choy, C., Gwak, J., Savarese, S.: 4D spatio-temporal convnets: minkowski convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3075–3084 (2019)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839 (2017)
Google Scholar
Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning (2023)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Guo, Z., et al.: Point-bind & point-LLM: aligning point cloud with multi-modality for 3D understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615 (2023)
Han, L., Zheng, T., Xu, L., Fang, L.: Occuseg: occupancy-aware 3D instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2940–2949 (2020)
Google Scholar
Han, Z., Shang, M., Wang, X., Liu, Y.S., Zwicker, M.: Y2seq2seq: cross-modal representation learning for 3D shape and text by joint reconstruction and prediction of view and word sequences. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 126–133 (2019)
Google Scholar
Hong, Y., Zhen, H., Chen, P., Zheng, S., Du, Y., Chen, Z., Gan, C.: 3d-llm: Injecting the 3d world into large language models. arXiv preprint arXiv:2307.12981 (2023)
Hou, J., Dai, A., Nießner, M.: 3d-sis: 3D semantic instance segmentation of RGB-D scans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4421–4430 (2019)
Google Scholar
Huang, P.H., Lee, H.H., Chen, H.T., Liu, T.L.: Text-guided graph neural networks for referring 3D instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 1610–1618 (2021)
Google Scholar
Huang, T., et al.: Clip2point: transfer clip to point cloud classification with image-depth pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22157–22167 (2023)
Google Scholar
Ilharco, G., et al.: Openclip (2021). https://doi.org/10.5281/zenodo.5143773. if you use this software, please cite it as below
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference On Machine Learning, pp. 4904–4916. PMLR (2021)
Google Scholar
Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.W., Jia, J.: Pointgroup: dual-set point grouping for 3D instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4867–4876 (2020)
Google Scholar
Lai, X., Liu, J., Jiang, L., Wang, L., Zhao, H., Liu, S., Qi, X., Jia, J.: Stratified transformer for 3d point cloud segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8500–8509 (2022)
Google Scholar
Lai, X., Yuan, Y., Chu, R., Chen, Y., Hu, H., Jia, J.: Mask-attention-free transformer for 3D instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3693–3703 (2023)
Google Scholar
Langacker, P.: Grand unified theories and proton decay. Phys. Rep. 72(4), 185–385 (1981)
Article Google Scholar
Li, C., Gan, Z., Yang, Z., Yang, J., Li, L., Wang, L., Gao, J.: Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.100201 (2023)
Li, H., et al.: Uni-perceiver v2: a generalist model for large-scale vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2691–2700 (2023)
Google Scholar
Liang, Z., Li, Z., Xu, S., Tan, M., Jia, K.: Instance segmentation in 3D scenes using semantic superpoint tree networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2783–2792 (2021)
Google Scholar
Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out. pp. 74–81 (2004)
Google Scholar
Liu, S.H., Yu, S.Y., Wu, S.C., Chen, H.T., Liu, T.L.: Learning gaussian instance segmentation in point clouds. arXiv preprint arXiv:2007.09860 (2020)
Locatello, F., et al.: Object-centric learning with slot attention. Adv. Neural. Inf. Process. Syst. 33, 11525–11538 (2020)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022)
Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pretrained models. arXiv preprint arXiv:2306.07279 (2023)
Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2906–2917 (2021)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of The Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Park, C., Jeong, Y., Cho, M., Park, J.: Fast point transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16949–16958 (2022)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532–1543 (2014)
Google Scholar
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Proce. Syst. 30 (2017)
Google Scholar
Qian, G., et al.: Pointnext: revisiting pointnet++ with improved training and scaling strategies. Adv. Neural. Inf. Process. Syst. 35, 23192–23204 (2022)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)
Google Scholar
Radford, A., et al.: Improving language understanding by generative pre-training (2018)
Google Scholar
Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3d: mask transformer for 3D semantic instance segmentation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 8216–8223. IEEE (2023)
Google Scholar
Sun, J., Qing, C., Tan, J., Xu, X.: Superpoint transformer for 3D scene instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 2393–2401 (2023)
Google Scholar
Tang, C., Yang, X., Wu, B., Han, Z., Chang, Y.: Parts2words: Learning joint embedding of point clouds and texts by bidirectional matching between parts and words. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6884–6893 (2023)
Google Scholar
Vu, T., Kim, K., Luu, T.M., Nguyen, T., Yoo, C.D.: Softgroup for 3D instance segmentation on point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2708–2717 (2022)
Google Scholar
Wang, H., et al.: Cagroup3D: class-aware grouping for 3d object detection on point clouds. Adv. Neural. Inf. Process. Syst. 35, 29975–29988 (2022)
Google Scholar
Wang, J., et al.: Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100 (2022)
Wang, P., et al.: OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning, pp. 23318–23340. PMLR (2022)
Google Scholar
Wang, W., et al.: Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175 (2023)
Wu, X., Lao, Y., Jiang, L., Liu, X., Zhao, H.: Point transformer v2: grouped vector attention and partition-based pooling. Adv. Neural. Inf. Process. Syst. 35, 33330–33342 (2022)
Google Scholar
Wu, Z., et al.: 3D shapenets: a deep representation for volumetric shapes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920 (2015)
Google Scholar
Xie, Q., et al.: Venet: voting enhancement network for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3712–3721 (2021)
Google Scholar
Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: Pointllm: Empowering large language models to understand point clouds. arXiv preprint arXiv:2308.16911 (2023)
Xue, L., et.: Ulip: learning a unified representation of language, images, and point clouds for 3d understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1179–1189 (2023)
Google Scholar
Xue, L., et al.: Ulip-2: Towards scalable multimodal pre-training for 3d understanding. arXiv preprint arXiv:2305.08275 (2023)
Yang, B., Wang, J., Clark, R., Hu, Q., Wang, S., Markham, A., Trigoni, N.: Learning object bounding boxes for 3d instance segmentation on point clouds. Adv. Neural Inf. Proce. Syst. 32 (2019)
Google Scholar
Yang, Y.Q., et al.: Swin3d: A pretrained transformer backbone for 3d indoor scene understanding. arXiv preprint arXiv:2304.06906 (2023)
Yang, Z., Jiang, L., Sun, Y., Schiele, B., Jia, J.: A unified query-based paradigm for point cloud understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8541–8551 (2022)
Google Scholar
Yang, Z., et al.: Unitab: unifying text and box outputs for grounded vision-language modeling. In: European Conference on Computer Vision, pp. 521–539. Springer (2022). https://doi.org/10.1007/978-3-031-20059-5_30
Yi, L., Zhao, W., Wang, H., Sung, M., Guibas, L.J.: GSPN: generative shape proposal network for 3d instance segmentation in point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3947–3956 (2019)
Google Scholar
Yuan, L., et al.: Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)
Zhang, R., et al.: Pointclip: point cloud understanding by clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8552–8562 (2022)
Google Scholar
Zhao, L., Cai, D., Sheng, L., Xu, D.: 3dvg-transformer: relation modeling for visual grounding on point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2928–2937 (2021)
Google Scholar
Zheng, J., Zhang, J., Li, J., Tang, R., Gao, S., Zhou, Z.: Structured3d: a large photo-realistic dataset for structured 3D modeling. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pp. 519–535. Springer (2020). https://doi.org/10.1007/978-3-030-58545-7_30
Zhong, M., Chen, X., Chen, X., Zeng, G., Wang, Y.: Maskgroup: hierarchical point grouping and masking for 3D instance segmentation. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2022)
Google Scholar
Zhu, X., et al.: Pointclip v2: prompting clip and GPT for powerful 3D open-world learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2639–2650 (2023)
Google Scholar
Zhu, Z., Ma, X., Chen, Y., Deng, Z., Huang, S., Li, Q.: 3D-vista: pre-trained transformer for 3D vision and text alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2911–2921 (2023)
Google Scholar
Zou, X., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15116–15127 (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Xiang Li, Jian Ding & Mohamed Elhoseiny
École Polytechnique, Palaiseau, France
Zhaoyang Chen

Authors

Xiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Jian Ding
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoyang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Elhoseiny
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiang Li .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5273 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X., Ding, J., Chen, Z., Elhoseiny, M. (2025). Uni3DL: A Unified Model for 3D Vision-Language Understanding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15081. Springer, Cham. https://doi.org/10.1007/978-3-031-73337-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-73337-6_5
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73336-9
Online ISBN: 978-3-031-73337-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Uni3DL: A Unified Model for 3D Vision-Language Understanding