Abstract
Unsupervised domain adaptation (UDA) in 3D segmentation tasks presents a formidable challenge, primarily steming from the sparse and unordered nature of point clouds. Especially for LiDAR point clouds, the domain discrepancy becomes obvious across varying capture scenes, fluctuating weather conditions, and the diverse array of LiDAR devices in use. Inspired by the remarkable generalization capabilities exhibited by the vision foundation model, SAM, in the realm of image segmentation, our approach leverages the wealth of general knowledge embedded within SAM to unify feature representations across diverse 3D domains and further solves the 3D domain adaptation problem. Specifically, we harness the corresponding images associated with point clouds to facilitate knowledge transfer and propose an innovative hybrid feature augmentation methodology, which enhances the alignment between the 3D feature space and SAM’s feature space, operating at both the scene and instance levels. Our method is evaluated on many widely-recognized datasets and achieves state-of-the-art performance.
X. Zhu and Y. Ma—This work was supported by NSFC (No.62206173), Natural Science Foundation of Shanghai (No.22dz1201900), Shanghai Sailing Program (No.22YF1428700), MoE Key Laboratory of Intelligent Perception and Human-Machine Collaboration (ShanghaiTech University), Shanghai Frontiers Science Center of Human-centered Artificial Intelligence (ShangHAI).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bai, X., et al.: TransFusion: robust lidar-camera fusion for 3D object detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1090–1099 (2022)
Behley, J., et al.: SemanticKITTI: a dataset for semantic scene understanding of lidar sequences. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9297–9307 (2019)
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Cao, H., Xu, Y., Yang, J., Yin, P., Yuan, S., Xie, L.: MoPA: multi-modal prior aided domain adaptation for 3D semantic segmentation. arXiv preprint arXiv:2309.11839 (2023)
Cardace, A., Ramirez, P.Z., Salti, S., Di Stefano, L.: Exploiting the complementarity of 2D and 3D networks to address domain-shift in 3D semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 98–109 (2023)
Chang, W.L., Wang, H.P., Peng, W.H., Chiu, W.C.: All About Structure: adapting structural information across domains for boosting semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1909 (2019)
Chen, R., et al.: Towards label-free scene understanding by vision foundation models. In: Advances in Neural Information Processing Systems (2023)
Chen, R., et al.: Clip2scene: towards label-efficient 3D scene understanding by clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7020–7030 (2023)
Choy, C., Gwak, J., Savarese, S.: 4D Spatio-Temporal ConvNets: Minkowski convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3075–3084 (2019)
Cong, P., et al.: Weakly supervised 3D multi-person pose estimation for large-scale scenes based on monocular camera and single lidar. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 461–469 (2023)
Contributors, M.: MMDetection3D: OpenMMLab next-generation platform for general 3D object detection (2020). https://github.com/open-mmlab/mmdetection3d
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Fong, W.K., et al.: Panoptic nuScenes: a large-scale benchmark for lidar panoptic segmentation and tracking. IEEE Robot. Autom. Lett. 7(2), 3795–3802 (2022)
Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object tracking analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4340–4349 (2016)
Geyer, J., et al.: A2D2: audi autonomous driving dataset. arXiv preprint arXiv:2004.06320 (2020)
Graham, B., Engelcke, M., Van Der Maaten, L.: 3D semantic segmentation with submanifold sparse convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9224–9232 (2018)
Guo, Y., Wang, H., Hu, Q., Liu, H., Liu, L., Bennamoun, M.: Deep learning for 3D point clouds: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 43(12), 4338–4364 (2020)
He, D., Abid, F., Kim, J.H.: Multimodal fusion and data augmentation for 3D semantic segmentation. In: IEEE International Conference on Control, Automation and Systems, pp. 1143–1148 (2022)
Jaritz, M., Vu, T.H., Charette, R.d., Wirbel, E., Pérez, P.: xMUDA: cross-modal unsupervised domain adaptation for 3D semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12605–12614 (2020)
Jaritz, M., Vu, T.H., De Charette, R., Wirbel, É., Pérez, P.: Cross-modal learning for domain adaptation in 3D semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 1533–1544 (2022)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
Kim, M., Byun, H.: Learning texture invariant representation for domain adaptation of semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12975–12984 (2020)
Kirillov, A., et al.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
Kong, L., Ren, J., Pan, L., Liu, Z.: Lasermix for semi-supervised lidar semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21705–21715 (2023)
Krispel, G., Opitz, M., Waltner, G., Possegger, H., Bischof, H.: Fuseseg: Lidar point cloud segmentation fusing multi-modal data. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1874–1883 (2020)
Li, M., Zhang, Y., Ma, X., Qu, Y., Fu, Y.: BEV-DG: cross-modal learning under Bird’s-eye view for domain generalization of 3D semantic segmentation. arXiv preprint arXiv:2308.06530 (2023)
Li, M., et al.: Cross-domain and cross-modal knowledge distillation in domain adaptation for 3D semantic segmentation. In: Proceedings of the ACM International Conference on Multimedia, pp. 3829–3837 (2022)
Liu, Y., et al.: Segment any point cloud sequences by distilling vision foundation models. arXiv preprint arXiv:2306.09347 (2023)
Mei, J., et al.: Waymo Open Dataset: Panoramic video panoptic segmentation. In: European Conference on Computer Vision, pp. 53–72. Springer (2022). https://doi.org/10.1007/978-3-031-19818-2_4
Morerio, P., Cavazza, J., Murino, V.: Minimal-entropy correlation alignment for unsupervised deep domain adaptation. arXiv preprint arXiv:1711.10288 (2017)
Paszke, A., Gross, S., Massa, e.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Peng, D., Lei, Y., Li, W., Zhang, P., Guo, Y.: Sparse-to-dense feature matching: intra and inter domain cross-modal learning in domain adaptation for 3D semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7108–7117 (2021)
Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T., et al.: OpenScene: 3D scene understanding with open vocabularies. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 815–824 (2023)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Saltori, C., Galasso, F., Fiameni, G., Sebe, N., Poiesi, F., Ricci, E.: Compositional semantic mix for domain adaptation in point cloud segmentation. IEEE Trans. Pattern Anal. Mach. Intell. (2023)
Saltori, C., Galasso, F., Fiameni, G., Sebe, N., Ricci, E., Poiesi, F.: CoSMix: compositional semantic mix for domain adaptation in 3D lidar segmentation. In: European Conference on Computer Vision, pp. 586–602 (2022)
Shaban, A., Lee, J., Jung, S., Meng, X., Boots, B.: LiDAR-UDA: self-ensembling through time for unsupervised lidar domain adaptation (2023)
Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Wang, W., et al.: InternImage: exploring large-scale vision foundation models with deformable convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14408–14419 (2023)
Wang, X., Zhang, X., Cao, Y., Wang, W., Shen, C., Huang, T.: SegGPT: segmenting everything in context. arXiv preprint arXiv:2304.03284 (2023)
Xiao, A., Huang, J., Guan, D., Cui, K., Lu, S., Shao, L.: PolarMix: general data augmentation technique for lidar point clouds. Adv. Neural. Inf. Process. Syst. 35, 11035–11048 (2022)
Xu, Y., et al.: Human-centric scene understanding for 3D large-scale scenarios. arXiv preprint arXiv:2307.14392 (2023)
Yan, X., et al.: 2DPASS: 2D priors assisted semantic segmentation on LiDAR point clouds. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pp. 677–695. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_39
Yi, L., Gong, B., Funkhouser, T.: Complete & Label: a domain adaptation approach to semantic segmentation of lidar point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15363–15373 (2021)
Zhang, Y., Wang, Z.: Joint adversarial learning for domain adaptation in semantic segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 6877–6884 (2020)
Zhu, X., et al.: Cylindrical and asymmetrical 3D convolution networks for lidar segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9939–9948 (2021)
Zou, X., et al.: Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718 (2023)
Zou, Y., Yu, Z., Kumar, B., Wang, J.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: European Conference on Computer Vision, pp. 289–305 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Peng, X. et al. (2025). Learning to Adapt SAM for Segmenting Cross-Domain Point Clouds. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15101. Springer, Cham. https://doi.org/10.1007/978-3-031-72775-7_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-72775-7_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72774-0
Online ISBN: 978-3-031-72775-7
eBook Packages: Computer ScienceComputer Science (R0)