Abstract
Labeling every point in a scene is a laborious journey for 3D understanding. To achieve annotation-free training, existing works introduce Contrastive Language-Image Pre-training (CLIP) to transfer the pre-trained capability of visual-linguistic correspondence to 3D-linguistic matching. However, directly adopting this CLIP-driven strategy can inevitably introduce bias: The overrated roles of the color and texture from an RGB image could overshadow the geometric nature of the corresponding 3D scene, resulting in a sub-optimal alignment. We note that different from RGB images, a depth map contains rich geometric information. Inspired by this, we propose Depth-Enhanced Alignment (D-EA) for label-free 3D semantic segmentation. D-EA aims to explore the rich geometric cues in depth maps and mitigate the color and texture biases rooted in the original CLIP-driven strategy. Specifically, we first tune a geometry-enhanced CLIP by aligning its depth prediction to the paired RGB prediction given by the original CLIP. Next, the point cloud feature space is matched with the RGB-Depth aggregated CLIP space by aligning point prediction to RGB and depth predictions. Moreover, to mitigate the semantic ambiguity caused by view-specific noise, we propose a View-Integrated Pseudo Label Generation paradigm. Experiments demonstrate the effectiveness of the proposed D-EA on the ScanNet (indoor) and GraspNet-1Billion (desktop) datasets in the label-free setting. Our method is also competitive in limited annotation semantic segmentation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Under the assumption of CLIP, we use ‘text feature space’, ‘linguistic feature space’, and ‘semantic feature space’ interchangeably.
References
Xu, J., Zhang, R., Dou, J., Zhu, Y., Sun, J., Pu, S.: Rpvnet: a deep and efficient range-point-voxel fusion network for lidar point cloud segmentation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16004–16013 (2021)
Ando, A., Gidaris, S., Bursuc, A., Puy, G., Boulch, A., Marlet, R.: Rangevit: towards vision transformers for 3D semantic segmentation in autonomous driving. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5240–5250 (2023)
Ückermann, A., Haschke, R., Ritter, H.: Real-time 3D segmentation of cluttered scenes for robot grasping. In: 2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012), pp. 198–203. IEEE (2012)
Ückermann, A., Elbrechter, C., Haschke, R., Ritter, H.: 3D scene segmentation for autonomous robot grasping. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1734–1740. IEEE (2012)
Ückermann, A., Haschke, R., Ritter, H.: Realtime 3D segmentation for human-robot interaction. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2136–2143. IEEE (2013)
Guo, Y., Wang, H., Hu, Q., Liu, H., Liu, L., Bennamoun, M.: Deep learning for 3D point clouds: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2020)
Yang, Y.-Q., et al.: Swin3d: a pretrained transformer backbone for 3D indoor scene understanding. arXiv, vol. abs/2304.06906 (2023)
Engel, N., Belagiannis, V., Dietmayer, K.C.J.: Point transformer. IEEE Access 9, 134826–134840 (2020)
Chen, R., et al.: Clip2scene: towards label-efficient 3D scene understanding by clip. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7020–7030 (2023)
Zhang, J., Dong, R., Ma, K.: Clip-fo3d: learning free open-world 3D scene representations from 2D dense clip. arXiv, vol. abs/2303.04748 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Zhang, Z., Liu, Q., Wang, Y.: Road extraction by deep residual u-net. IEEE Geosci. Remote Sens. Lett. 15, 749–753 (2017)
Hu, X., Zhang, C., Zhang, Y., Hai, B., Yu, K., He, Z.: Learning to adapt clip for few-shot monocular depth estimation. arXiv, vol. abs/2311.01034 (2023)
Zhang, R., et al.: Pointclip: point cloud understanding by clip. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8542–8552 (2021)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: richly-annotated 3D reconstructions of indoor scenes. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2432–2443 (2017)
Fang, H., Wang, C., Gou, M., Lu, C.: Graspnet-1billion: a large-scale benchmark for general object grasping. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11441–11450 (2020)
Ding, J., Xue, N., Xia, G., Dai, D.: Decoupling zero-shot semantic segmentation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11573–11582 (2021)
Xu, M., et al.: A simple baseline for zero-shot semantic segmentation with pre-trained vision-language model. arXiv, vol. abs/2112.14757 (2021)
Zhou, Z., Lei, Y., Zhang, B., Liu, L., Liu, Y.: Zegclip: towards adapting clip for zero-shot semantic segmentation. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11175–11185 (2022)
Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: European Conference on Computer Vision (2021)
Liu, X., et al.: Delving into shape-aware zero-shot semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2999–3009 (2023)
Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.-W., Jia, J.: Pointgroup: dual-set point grouping for 3D instance segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4866–4875 (2020)
Qi, C. R., Chen, X., Litany, O., Guibas, L.J.: Imvotenet: boosting 3D object detection in point clouds with image votes. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4403–4412 (2020)
Thomas, H., Qi, C., Deschaud, J.-E., Marcotegui, B., Goulette, F., Guibas, L.J.: Kpconv: flexible and deformable convolution for point clouds. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6410–6419 (2019)
Maturana, D., Scherer, S.A.: Voxnet: a 3D convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928 (2015)
Choy, C.B., Gwak, J., Savarese, S.: 4D spatio-temporal convnets: minkowski convolutional neural networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3070–3079 (2019)
Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud based 3D object detection. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4490–4499 (2017)
Hou, J., Graham, B., Nießner, M., Xie, S.: Exploring data-efficient 3D scene understanding with contrastive scene contexts. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15582–15592 (2020)
Rozenberszki, D., Litany, O., Dai, A.: Language-grounded indoor 3D semantic segmentation in the wild, arXiv, vol. abs/2204.07761 (2022)
Tian, B., Luo, L., Zhao, H., Zhou, G.: Vibus: data-efficient 3D scene parsing with viewpoint bottleneck and uncertainty-spectrum modeling. ISPRS J. Photogramm. Remote. Sens. 194, 302–318 (2022)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv, vol. abs/2010.11929 (2020)
Liu, Z., Qi, X., Fu, C.-W.: One thing one click: a self-training approach for weakly supervised 3D semantic segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1726–1736 (2021)
Wei, J., Lin, G., Yap, K.-H., Hung, T.-Y., Xie, L.: Multi-path region mining for weakly supervised 3D semantic segmentation on point clouds. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4383–4392 (2020)
Acknowledgements
This work was supported partially by the Guangdong NSF Project (No. 2023B1515040025).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xie, S., Feng, J., Chen, Z., Liu, Z., Zheng, WS. (2025). Depth-Enhanced Alignment for Label-Free 3D Semantic Segmentation. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15318. Springer, Cham. https://doi.org/10.1007/978-3-031-78456-9_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-78456-9_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78455-2
Online ISBN: 978-3-031-78456-9
eBook Packages: Computer ScienceComputer Science (R0)