Any2Point: Empowering Any-Modality Large Models for Efficient 3D Understanding

Tang, Yiwen; Zhang, Ray; Liu, Jiaming; Guo, Zoey; Zhao, Bin; Wang, Zhigang; Gao, Peng; Li, Hongsheng; Wang, Dong; Li, Xuelong

doi:10.1007/978-3-031-72764-1_26

Yiwen Tang^13,14,
Ray Zhang¹⁵,
Jiaming Liu¹⁶,
Zoey Guo¹⁵,
Bin Zhao^13,14,
Zhigang Wang¹³,
Peng Gao¹³,
Hongsheng Li¹⁵,
Dong Wang¹³ &
…
Xuelong Li¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15094))

Included in the following conference series:

European Conference on Computer Vision

775 Accesses
4 Citations

Abstract

Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantly, their frameworks are mainly designed for 2D models, lacking a general any-to-3D paradigm. In this paper, we introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. Given a frozen transformer from any source modality, we propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality. This mechanism enables us to assign each 3D token with a positional encoding paired with the pre-trained model, which avoids 3D geometry loss caused by the true projection and better motivates the transformer for 3D learning with 1D/2D positional priors. Then, within each transformer block, we insert an any-to-3D guided adapter module for parameter-efficient fine-tuning. The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers. We conduct extensive experiments to showcase the effectiveness and efficiency of our method. The code is released at https://github.com/Ivan-Tang-3D/Any2Point.

Y. Tang, R. Zhang, J. Liu and Z. Guo—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Monocular Surface Reconstruction Using 3D Deformable Part Models

Divide and Conquer: Efficient Density-Based Tracking of 3D Sensors in Manhattan Worlds

Enhanced Anomaly Detection in 3D Motion Through Language-Inspired Occlusion-Aware Modeling

References

Chen, S., Ge, C., Tong, Z., Wang, J., Song, Y., Wang, J., Luo, P.: Adaptformer: adapting vision transformers for scalable visual recognition. Adv. Neural. Inf. Process. Syst. 35, 16664–16678 (2022)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839 (2017)
Google Scholar
Dehghani, M., et al.: Scaling vision transformers to 22 billion parameters. In: International Conference on Machine Learning, pp. 7480–7512. PMLR (2023)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dong, R., et al.: Autoencoders as cross-modal teachers: Can pretrained 2d image transformers help 3d representation learning? arXiv preprint arXiv:2212.08320 (2022)
Floridi, L., Chiriatti, M.: Gpt-3: its nature, scope, limits, and consequences. Mind. Mach. 30, 681–694 (2020)
Article Google Scholar
Girdhar, R., et al.: Imagebind: one embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190 (2023)
Google Scholar
Gong, Y., Lai, C.I., Chung, Y.A., Glass, J.: Ssast: self-supervised audio spectrogram transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 10699–10709 (2022)
Google Scholar
Guo, Z., Zhang, R., Qiu, L., Li, X., Heng, P.A.: Joint-mae: 2d-3d joint masked autoencoders for 3d point cloud pre-training. arXiv preprint arXiv:2302.14007 (2023)
Guo, Z., et al.: Viewrefer: grasp the multi-view knowledge for 3d visual grounding with gpt and prototype guidance. arXiv preprint arXiv:2303.16894 (2023)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Google Scholar
Houlsby, N., et al.: Parameter-efficient transfer learning for nlp. In: International Conference on Machine Learning, pp. 2790–2799. PMLR (2019)
Google Scholar
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Jia, M., et al.: Visual prompt tuning. In: European Conference on Computer Vision, pp. 709–727. Springer (2022)
Google Scholar
Jing, L., et al.: X4d-sceneformer: enhanced scene understanding on 4d point cloud videos through cross-modal knowledge transfer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 2670–2678 (2024)
Google Scholar
Li, F., et al.: Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895 (2024)
Li, X., et al.: Manipllm: embodied multimodal large language model for object-centric robotic manipulation. arXiv preprint arXiv:2312.16217 (2023)
Liu, X., et al.: P-tuning v2: prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602 (2021)
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Ma, C., Liu, Y.L., Wang, Z., Liu, W., Liu, X., Wang, Z.: Humannerf-se: a simple yet effective approach to animate humannerf with diverse poses. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1460–1470 (2024)
Google Scholar
Ma, X., Qin, C., You, H., Ran, H., Fu, Y.: Rethinking network design and local geometry in point cloud: a simple residual mlp framework. arXiv preprint arXiv:2202.07123 (2022)
OpenAI: GPT-4 technical report (2023)
Google Scholar
Oquab, M., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Pang, Y., Wang, W., Tay, F.E., Liu, W., Tian, Y., Yuan, L.: Masked autoencoders for point cloud self-supervised learning. In: European Conference on Computer Vision, pp. 604–621. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_35
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems 30 (2017)
Google Scholar
Qi, Z., et al.: Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining. arXiv preprint arXiv:2302.02318 (2023)
Qian, G., Li, Y., Peng, H., Mai, J., Hammoud, H., Elhoseiny, M., Ghanem, B.: Pointnext: revisiting pointnet++ with improved training and scaling strategies. Adv. Neural. Inf. Process. Syst. 35, 23192–23204 (2022)
Google Scholar
Qu, D., et al.: Livescene: language embedding interactive radiance fields for physical scene rendering and control. arXiv preprint arXiv:2406.16038 (2024)
Qu, D., et al.: Implicit event-rgbd neural slam. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19584–19594 (2024)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
MathSciNet Google Scholar
Tang, Y., et al.: Point-peft: parameter-efficient fine-tuning for 3d pre-trained models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 5171–5179 (2024)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Google Scholar
Uy, M.A., Pham, Q.H., Hua, B.S., Nguyen, T., Yeung, S.K.: Revisiting point cloud classification: a new benchmark dataset and classification model on real-world data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1588–1597 (2019)
Google Scholar
Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. (tog) 38(5), 1–12 (2019)
Article Google Scholar
Wang, Z., Yu, X., Rao, Y., Zhou, J., Lu, J.: P2p: tuning pre-trained image models for point cloud analysis with point-to-pixel prompting. Adv. Neural. Inf. Process. Syst. 35, 14388–14402 (2022)
Google Scholar
Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14668–14678 (2022)
Google Scholar
Wu, Z., et al.: 3d shapenets: a deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1912–1920 (2015)
Google Scholar
Xie, Z., et al.: Simmim: a simple framework for masked image modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9653–9663 (2022)
Google Scholar
Xue, L., et al.: Ulip: learning a unified representation of language, images, and point clouds for 3d understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1179–1189 (2023)
Google Scholar
Yang, S., et al.: Lidar-llm: exploring the potential of large language models for 3d lidar understanding. arXiv preprint arXiv:2312.14074 (2023)
Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., Lu, J.: Point-bert: pre-training 3d point cloud transformers with masked point modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19313–19322 (2022)
Google Scholar
Zha, Y., Wang, J., Dai, T., Chen, B., Wang, Z., Xia, S.T.: Instance-aware dynamic prompt tuning for pre-trained point cloud models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Zhang, D., et al.: Fm-ov3d: foundation model-based cross-modal knowledge blending for open-vocabulary 3d detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 16723–16731 (2024)
Google Scholar
Zhang, R., Guo, Z., Gao, P., Fang, R., Zhao, B., Wang, D., Qiao, Y., Li, H.: Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training. Adv. Neural. Inf. Process. Syst. 35, 27061–27074 (2022)
Google Scholar
Zhang, R., et al.: Pointclip: point cloud understanding by clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8552–8562 (2022)
Google Scholar
Zhang, R., et al.: Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023)
Zhang, R., et al.: Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048 (2023)
Zhang, R., Wang, L., Qiao, Y., Gao, P., Li, H.: Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21769–21780 (2023)
Google Scholar
Zhang, R., Wang, L., Wang, Y., Gao, P., Li, H., Shi, J.: Starting from non-parametric networks for 3d point cloud analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5344–5353 (2023)
Google Scholar
Zhu, X., et al.: No time to train: Empowering non-parametric networks for few-shot 3d scene segmentation. arXiv preprint arXiv:2404.04050 (2024)
Zhu, X., et al.: Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2639–2650 (2023)
Google Scholar

Download references

Acknowledgements

This work is partially supported by the Shanghai AI Laboratory, National Key R&D Program of China (2022ZD0160101), the National Natural Science Foundation of China (62376222), and Young Elite Scientists Sponsorship Program by CAST (2023QNRC001).

Author information

Authors and Affiliations

Shanghai AI Laboratory, Hangzhou, China
Yiwen Tang, Bin Zhao, Zhigang Wang, Peng Gao & Dong Wang
Northwestern Polytechnical University, Xi’an, China
Yiwen Tang & Bin Zhao
The Chinese University of Hong Kong, Hong Kong, Hong Kong
Ray Zhang, Zoey Guo & Hongsheng Li
Peking University, Beijing, China
Jiaming Liu
TeleAI, Xingchen, China
Xuelong Li

Authors

Yiwen Tang
View author publications
Search author on:PubMed Google Scholar
Ray Zhang
View author publications
Search author on:PubMed Google Scholar
Jiaming Liu
View author publications
Search author on:PubMed Google Scholar
Zoey Guo
View author publications
Search author on:PubMed Google Scholar
Bin Zhao
View author publications
Search author on:PubMed Google Scholar
Zhigang Wang
View author publications
Search author on:PubMed Google Scholar
Peng Gao
View author publications
Search author on:PubMed Google Scholar
Hongsheng Li
View author publications
Search author on:PubMed Google Scholar
Dong Wang
View author publications
Search author on:PubMed Google Scholar
Xuelong Li
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Dong Wang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 675 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tang, Y. et al. (2025). Any2Point: Empowering Any-Modality Large Models for Efficient 3D Understanding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15094. Springer, Cham. https://doi.org/10.1007/978-3-031-72764-1_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-72764-1_26
Published: 25 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72763-4
Online ISBN: 978-3-031-72764-1
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics

Any2Point: Empowering Any-Modality Large Models for Efficient 3D Understanding