Depth-Enhanced Alignment for Label-Free 3D Semantic Segmentation

Xie, Shangjin; Feng, Jiawei; Chen, Zibo; Liu, Zhixuan; Zheng, Wei-Shi

doi:10.1007/978-3-031-78456-9_1

Shangjin Xie¹³,
Jiawei Feng¹³,
Zibo Chen¹³,
Zhixuan Liu¹³ &
…
Wei-Shi Zheng^13,14,15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15318))

Included in the following conference series:

International Conference on Pattern Recognition

255 Accesses

Abstract

Labeling every point in a scene is a laborious journey for 3D understanding. To achieve annotation-free training, existing works introduce Contrastive Language-Image Pre-training (CLIP) to transfer the pre-trained capability of visual-linguistic correspondence to 3D-linguistic matching. However, directly adopting this CLIP-driven strategy can inevitably introduce bias: The overrated roles of the color and texture from an RGB image could overshadow the geometric nature of the corresponding 3D scene, resulting in a sub-optimal alignment. We note that different from RGB images, a depth map contains rich geometric information. Inspired by this, we propose Depth-Enhanced Alignment (D-EA) for label-free 3D semantic segmentation. D-EA aims to explore the rich geometric cues in depth maps and mitigate the color and texture biases rooted in the original CLIP-driven strategy. Specifically, we first tune a geometry-enhanced CLIP by aligning its depth prediction to the paired RGB prediction given by the original CLIP. Next, the point cloud feature space is matched with the RGB-Depth aggregated CLIP space by aligning point prediction to RGB and depth predictions. Moreover, to mitigate the semantic ambiguity caused by view-specific noise, we propose a View-Integrated Pseudo Label Generation paradigm. Experiments demonstrate the effectiveness of the proposed D-EA on the ScanNet (indoor) and GraspNet-1Billion (desktop) datasets in the label-free setting. Our method is also competitive in limited annotation semantic segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.99; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LASS3D: Language-Assisted Semi-Supervised 3D Semantic Segmentation with Progressive Unreliable Data Exploitation

Language-Grounded Indoor 3D Semantic Segmentation in the Wild

Cross-Modal and Cross-Domain Knowledge Transfer for Label-Free 3D Segmentation

Notes

1.
Under the assumption of CLIP, we use ‘text feature space’, ‘linguistic feature space’, and ‘semantic feature space’ interchangeably.

References

Xu, J., Zhang, R., Dou, J., Zhu, Y., Sun, J., Pu, S.: Rpvnet: a deep and efficient range-point-voxel fusion network for lidar point cloud segmentation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16004–16013 (2021)
Google Scholar
Ando, A., Gidaris, S., Bursuc, A., Puy, G., Boulch, A., Marlet, R.: Rangevit: towards vision transformers for 3D semantic segmentation in autonomous driving. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5240–5250 (2023)
Google Scholar
Ückermann, A., Haschke, R., Ritter, H.: Real-time 3D segmentation of cluttered scenes for robot grasping. In: 2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012), pp. 198–203. IEEE (2012)
Google Scholar
Ückermann, A., Elbrechter, C., Haschke, R., Ritter, H.: 3D scene segmentation for autonomous robot grasping. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1734–1740. IEEE (2012)
Google Scholar
Ückermann, A., Haschke, R., Ritter, H.: Realtime 3D segmentation for human-robot interaction. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2136–2143. IEEE (2013)
Google Scholar
Guo, Y., Wang, H., Hu, Q., Liu, H., Liu, L., Bennamoun, M.: Deep learning for 3D point clouds: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2020)
Google Scholar
Yang, Y.-Q., et al.: Swin3d: a pretrained transformer backbone for 3D indoor scene understanding. arXiv, vol. abs/2304.06906 (2023)
Google Scholar
Engel, N., Belagiannis, V., Dietmayer, K.C.J.: Point transformer. IEEE Access 9, 134826–134840 (2020)
Article Google Scholar
Chen, R., et al.: Clip2scene: towards label-efficient 3D scene understanding by clip. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7020–7030 (2023)
Google Scholar
Zhang, J., Dong, R., Ma, K.: Clip-fo3d: learning free open-world 3D scene representations from 2D dense clip. arXiv, vol. abs/2303.04748 (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Zhang, Z., Liu, Q., Wang, Y.: Road extraction by deep residual u-net. IEEE Geosci. Remote Sens. Lett. 15, 749–753 (2017)
Article Google Scholar
Hu, X., Zhang, C., Zhang, Y., Hai, B., Yu, K., He, Z.: Learning to adapt clip for few-shot monocular depth estimation. arXiv, vol. abs/2311.01034 (2023)
Google Scholar
Zhang, R., et al.: Pointclip: point cloud understanding by clip. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8542–8552 (2021)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: richly-annotated 3D reconstructions of indoor scenes. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2432–2443 (2017)
Google Scholar
Fang, H., Wang, C., Gou, M., Lu, C.: Graspnet-1billion: a large-scale benchmark for general object grasping. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11441–11450 (2020)
Google Scholar
Ding, J., Xue, N., Xia, G., Dai, D.: Decoupling zero-shot semantic segmentation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11573–11582 (2021)
Google Scholar
Xu, M., et al.: A simple baseline for zero-shot semantic segmentation with pre-trained vision-language model. arXiv, vol. abs/2112.14757 (2021)
Google Scholar
Zhou, Z., Lei, Y., Zhang, B., Liu, L., Liu, Y.: Zegclip: towards adapting clip for zero-shot semantic segmentation. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11175–11185 (2022)
Google Scholar
Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: European Conference on Computer Vision (2021)
Google Scholar
Liu, X., et al.: Delving into shape-aware zero-shot semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2999–3009 (2023)
Google Scholar
Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.-W., Jia, J.: Pointgroup: dual-set point grouping for 3D instance segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4866–4875 (2020)
Google Scholar
Qi, C. R., Chen, X., Litany, O., Guibas, L.J.: Imvotenet: boosting 3D object detection in point clouds with image votes. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4403–4412 (2020)
Google Scholar
Thomas, H., Qi, C., Deschaud, J.-E., Marcotegui, B., Goulette, F., Guibas, L.J.: Kpconv: flexible and deformable convolution for point clouds. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6410–6419 (2019)
Google Scholar
Maturana, D., Scherer, S.A.: Voxnet: a 3D convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928 (2015)
Google Scholar
Choy, C.B., Gwak, J., Savarese, S.: 4D spatio-temporal convnets: minkowski convolutional neural networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3070–3079 (2019)
Google Scholar
Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud based 3D object detection. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4490–4499 (2017)
Google Scholar
Hou, J., Graham, B., Nießner, M., Xie, S.: Exploring data-efficient 3D scene understanding with contrastive scene contexts. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15582–15592 (2020)
Google Scholar
Rozenberszki, D., Litany, O., Dai, A.: Language-grounded indoor 3D semantic segmentation in the wild, arXiv, vol. abs/2204.07761 (2022)
Google Scholar
Tian, B., Luo, L., Zhao, H., Zhou, G.: Vibus: data-efficient 3D scene parsing with viewpoint bottleneck and uncertainty-spectrum modeling. ISPRS J. Photogramm. Remote. Sens. 194, 302–318 (2022)
Article Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv, vol. abs/2010.11929 (2020)
Google Scholar
Liu, Z., Qi, X., Fu, C.-W.: One thing one click: a self-training approach for weakly supervised 3D semantic segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1726–1736 (2021)
Google Scholar
Wei, J., Lin, G., Yap, K.-H., Hung, T.-Y., Xie, L.: Multi-path region mining for weakly supervised 3D semantic segmentation on point clouds. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4383–4392 (2020)
Google Scholar

Download references

Acknowledgements

This work was supported partially by the Guangdong NSF Project (No. 2023B1515040025).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Shangjin Xie, Jiawei Feng, Zibo Chen, Zhixuan Liu & Wei-Shi Zheng
Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, Guangzhou, China
Wei-Shi Zheng
Guangdong Province Key Laboratory of Information Security Technology, Sun Yat-sen University, Guangzhou, China
Wei-Shi Zheng

Authors

Shangjin Xie
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Feng
View author publications
You can also search for this author in PubMed Google Scholar
Zibo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zhixuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Shi Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei-Shi Zheng .

Editor information

Editors and Affiliations

University of Salford, Salford, Lancashire, UK
Apostolos Antonacopoulos
Indian Institute of Technology Bombay, Mumbai, Maharashtra, India
Subhasis Chaudhuri
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
IIT Kharagpur, Kharagpur, West Bengal, India
Saumik Bhattacharya
Indian Statistical Institute Kolkata, Kolkata, West Bengal, India
Umapada Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xie, S., Feng, J., Chen, Z., Liu, Z., Zheng, WS. (2025). Depth-Enhanced Alignment for Label-Free 3D Semantic Segmentation. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15318. Springer, Cham. https://doi.org/10.1007/978-3-031-78456-9_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-78456-9_1
Published: 03 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78455-2
Online ISBN: 978-3-031-78456-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Depth-Enhanced Alignment for Label-Free 3D Semantic Segmentation