Skip to main content

Depth-Enhanced Alignment for Label-Free 3D Semantic Segmentation

  • Conference paper
  • First Online:
Pattern Recognition (ICPR 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15318))

Included in the following conference series:

  • 114 Accesses

Abstract

Labeling every point in a scene is a laborious journey for 3D understanding. To achieve annotation-free training, existing works introduce Contrastive Language-Image Pre-training (CLIP) to transfer the pre-trained capability of visual-linguistic correspondence to 3D-linguistic matching. However, directly adopting this CLIP-driven strategy can inevitably introduce bias: The overrated roles of the color and texture from an RGB image could overshadow the geometric nature of the corresponding 3D scene, resulting in a sub-optimal alignment. We note that different from RGB images, a depth map contains rich geometric information. Inspired by this, we propose Depth-Enhanced Alignment (D-EA) for label-free 3D semantic segmentation. D-EA aims to explore the rich geometric cues in depth maps and mitigate the color and texture biases rooted in the original CLIP-driven strategy. Specifically, we first tune a geometry-enhanced CLIP by aligning its depth prediction to the paired RGB prediction given by the original CLIP. Next, the point cloud feature space is matched with the RGB-Depth aggregated CLIP space by aligning point prediction to RGB and depth predictions. Moreover, to mitigate the semantic ambiguity caused by view-specific noise, we propose a View-Integrated Pseudo Label Generation paradigm. Experiments demonstrate the effectiveness of the proposed D-EA on the ScanNet (indoor) and GraspNet-1Billion (desktop) datasets in the label-free setting. Our method is also competitive in limited annotation semantic segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Under the assumption of CLIP, we use ‘text feature space’, ‘linguistic feature space’, and ‘semantic feature space’ interchangeably.

References

  1. Xu, J., Zhang, R., Dou, J., Zhu, Y., Sun, J., Pu, S.: Rpvnet: a deep and efficient range-point-voxel fusion network for lidar point cloud segmentation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16004–16013 (2021)

    Google Scholar 

  2. Ando, A., Gidaris, S., Bursuc, A., Puy, G., Boulch, A., Marlet, R.: Rangevit: towards vision transformers for 3D semantic segmentation in autonomous driving. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5240–5250 (2023)

    Google Scholar 

  3. Ückermann, A., Haschke, R., Ritter, H.: Real-time 3D segmentation of cluttered scenes for robot grasping. In: 2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012), pp. 198–203. IEEE (2012)

    Google Scholar 

  4. Ückermann, A., Elbrechter, C., Haschke, R., Ritter, H.: 3D scene segmentation for autonomous robot grasping. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1734–1740. IEEE (2012)

    Google Scholar 

  5. Ückermann, A., Haschke, R., Ritter, H.: Realtime 3D segmentation for human-robot interaction. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2136–2143. IEEE (2013)

    Google Scholar 

  6. Guo, Y., Wang, H., Hu, Q., Liu, H., Liu, L., Bennamoun, M.: Deep learning for 3D point clouds: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2020)

    Google Scholar 

  7. Yang, Y.-Q., et al.: Swin3d: a pretrained transformer backbone for 3D indoor scene understanding. arXiv, vol. abs/2304.06906 (2023)

    Google Scholar 

  8. Engel, N., Belagiannis, V., Dietmayer, K.C.J.: Point transformer. IEEE Access 9, 134826–134840 (2020)

    Article  Google Scholar 

  9. Chen, R., et al.: Clip2scene: towards label-efficient 3D scene understanding by clip. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7020–7030 (2023)

    Google Scholar 

  10. Zhang, J., Dong, R., Ma, K.: Clip-fo3d: learning free open-world 3D scene representations from 2D dense clip. arXiv, vol. abs/2303.04748 (2023)

    Google Scholar 

  11. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  12. Zhang, Z., Liu, Q., Wang, Y.: Road extraction by deep residual u-net. IEEE Geosci. Remote Sens. Lett. 15, 749–753 (2017)

    Article  Google Scholar 

  13. Hu, X., Zhang, C., Zhang, Y., Hai, B., Yu, K., He, Z.: Learning to adapt clip for few-shot monocular depth estimation. arXiv, vol. abs/2311.01034 (2023)

    Google Scholar 

  14. Zhang, R., et al.: Pointclip: point cloud understanding by clip. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8542–8552 (2021)

    Google Scholar 

  15. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: richly-annotated 3D reconstructions of indoor scenes. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2432–2443 (2017)

    Google Scholar 

  16. Fang, H., Wang, C., Gou, M., Lu, C.: Graspnet-1billion: a large-scale benchmark for general object grasping. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11441–11450 (2020)

    Google Scholar 

  17. Ding, J., Xue, N., Xia, G., Dai, D.: Decoupling zero-shot semantic segmentation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11573–11582 (2021)

    Google Scholar 

  18. Xu, M., et al.: A simple baseline for zero-shot semantic segmentation with pre-trained vision-language model. arXiv, vol. abs/2112.14757 (2021)

    Google Scholar 

  19. Zhou, Z., Lei, Y., Zhang, B., Liu, L., Liu, Y.: Zegclip: towards adapting clip for zero-shot semantic segmentation. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11175–11185 (2022)

    Google Scholar 

  20. Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: European Conference on Computer Vision (2021)

    Google Scholar 

  21. Liu, X., et al.: Delving into shape-aware zero-shot semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2999–3009 (2023)

    Google Scholar 

  22. Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.-W., Jia, J.: Pointgroup: dual-set point grouping for 3D instance segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4866–4875 (2020)

    Google Scholar 

  23. Qi, C. R., Chen, X., Litany, O., Guibas, L.J.: Imvotenet: boosting 3D object detection in point clouds with image votes. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4403–4412 (2020)

    Google Scholar 

  24. Thomas, H., Qi, C., Deschaud, J.-E., Marcotegui, B., Goulette, F., Guibas, L.J.: Kpconv: flexible and deformable convolution for point clouds. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6410–6419 (2019)

    Google Scholar 

  25. Maturana, D., Scherer, S.A.: Voxnet: a 3D convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928 (2015)

    Google Scholar 

  26. Choy, C.B., Gwak, J., Savarese, S.: 4D spatio-temporal convnets: minkowski convolutional neural networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3070–3079 (2019)

    Google Scholar 

  27. Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud based 3D object detection. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4490–4499 (2017)

    Google Scholar 

  28. Hou, J., Graham, B., Nießner, M., Xie, S.: Exploring data-efficient 3D scene understanding with contrastive scene contexts. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15582–15592 (2020)

    Google Scholar 

  29. Rozenberszki, D., Litany, O., Dai, A.: Language-grounded indoor 3D semantic segmentation in the wild, arXiv, vol. abs/2204.07761 (2022)

    Google Scholar 

  30. Tian, B., Luo, L., Zhao, H., Zhou, G.: Vibus: data-efficient 3D scene parsing with viewpoint bottleneck and uncertainty-spectrum modeling. ISPRS J. Photogramm. Remote. Sens. 194, 302–318 (2022)

    Article  Google Scholar 

  31. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv, vol. abs/2010.11929 (2020)

    Google Scholar 

  32. Liu, Z., Qi, X., Fu, C.-W.: One thing one click: a self-training approach for weakly supervised 3D semantic segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1726–1736 (2021)

    Google Scholar 

  33. Wei, J., Lin, G., Yap, K.-H., Hung, T.-Y., Xie, L.: Multi-path region mining for weakly supervised 3D semantic segmentation on point clouds. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4383–4392 (2020)

    Google Scholar 

Download references

Acknowledgements

This work was supported partially by the Guangdong NSF Project (No. 2023B1515040025).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei-Shi Zheng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xie, S., Feng, J., Chen, Z., Liu, Z., Zheng, WS. (2025). Depth-Enhanced Alignment for Label-Free 3D Semantic Segmentation. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15318. Springer, Cham. https://doi.org/10.1007/978-3-031-78456-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-78456-9_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-78455-2

  • Online ISBN: 978-3-031-78456-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics