skip to main content
10.1145/3581783.3612320acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Cross-modal & Cross-domain Learning for Unsupervised LiDAR Semantic Segmentation

Published: 27 October 2023 Publication History

Abstract

In recent years, cross-modal domain adaptation has been studied on the paired 2D image and 3D LiDAR data to ease the labeling costs for 3D LiDAR semantic segmentation (3DLSS) in the target domain. However, in such a setting the paired 2D and 3D data in the source domain are still collected with additional effort. Since the 2D-3D projections can enable the 3D model to learn semantic information from the 2D counterpart, we ask whether we could further remove the need of source 3D data and only rely on the source 2D images. To answer it, this paper studies a new 3DLSS setting where a 2D dataset (source) with semantic annotations and a paired but unannotated 2D image and 3D LiDAR data (target) are available1. To achieve 3DLSS in this scenario, we propose Cross-Modal and Cross-Domain Learning (CoMoDaL). Specifically, our CoMoDaL aims at modeling 1) inter-modal cross-domain distillation between the unpaired source 2D image and target 3D LiDAR data, and 2) the intra-domain cross-modal guidance between the target 2D image and 3D LiDAR data pair. In CoMoDaL, we propose to apply several constraints, such as point-to-pixel and prototype-to pixel alignments, to associate the semantics in different modalities and domains by constructing mixed samples in two modalities. The experimental results on several datasets show that in the proposed setting, the developed CoMoDaL can achieve segmentation without the supervision of labeled LiDAR data. Ablations are also conducted to provide more analysis. Code will be available publicly2.

Supplemental Material

MP4 File
In this video, we will introduce the background, motivation, method, experimental results, and conclusion of our paper. Our work mainly involves LiDAR semantic segmentation, unsupervised segmentation, cross-modal learning, and cross-domain learning.

References

[1]
Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and Jürgen Gall. 2019. Semantickitti: A dataset for semantic scene understanding of lidar sequences, 9296--9306. https://doi.org/10.1109/ICCV.201 9.00939.
[2]
Holger Caesar et al. 2019. Nuscenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027.
[3]
Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. CoRR, abs/1706.05587. http://arxiv.org/abs/1706.05587.
[4]
Yunlu Chen, Vincent Tao Hu, Efstratios Gavves, Thomas Mensink, Pascal Mettes, Pengwan Yang, and Cees G. M. Snoek. 2020. Pointmixup: augmentation for point clouds. (2020). arXiv: 2008.06374 [cs.CV].
[5]
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2016).
[6]
Runyu Ding, Jihan Yang, Li Jiang, and Xiaojuan Qi. 2022. Doda: data-oriented sim-to-real domain adaptation for 3d semantic segmentation. (2022). arXiv: 2204.01599 [cs.CV].
[7]
Jakob Geyer et al. 2020. A2D2: Audi Autonomous Driving Dataset. https://ww w.a2d2.audi arXiv: 2004.06320 [cs.CV].
[8]
Ben Graham. 2015. Sparse 3d convolutional neural networks, 150.1-150.9. Xianghua Xie, Mark W. Jones, and Gary K. L. Tam, (Eds.) https://doi.org/10.524 4/C.29.150.
[9]
Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 2018. 3d semantic segmentation with submanifold sparse convolutional networks, 9224--9232. http://openaccess.thecvf.com/content%5C_cvpr%5C_2018/html/Graha m%5C_3D%5C_Semantic%5C_Segmentation%5C_CVPR%5C_2018%5C_paper.html.
[10]
Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 2018. 3d semantic segmentation with submanifold sparse convolutional networks. CVPR.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition, 770--778. https://doi.org/10.1109/CVPR.2016.90.
[12]
Lukas Hoyer, Dengxin Dai, Yuhua Chen, Adrian Koring, Suman Saha, and Luc Van Gool. 2021. Three ways to improve semantic segmentation with self-supervised depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (June 2021), 11130--11140.
[13]
Xinyue Huo, Lingxi Xie, Hengtong Hu, Wengang Zhou, Houqiang Li, and Qi Tian. 2022. Domain-agnostic prior for transfer semantic segmentation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7065--7075.
[14]
Maximilian Jaritz, Tuan-Hung Vu, Raoul de Charette, Emilie Wirbel, and Patrick Pérez. 2020. xMUDA: cross-modal unsupervised domain adaptation for 3D semantic segmentation. In CVPR.
[15]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. Yoshua Bengio and Yann LeCun, (Eds.) http://arxiv.org/abs/1412.6980.
[16]
Lingdong Kong, Jiawei Ren, Liang Pan, and Ziwei Liu. 2023. Lasermix for semi-supervised lidar semantic segmentation. (2023). arXiv: 2207.00026 [cs.CV].
[17]
Ferdinand Langer, Andres Milioto, Alexandre Haag, Jens Behley, and Cyrill Stachniss. 2020. Domain transfer for semantic segmentation of lidar data using deep neural networks. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 8263--8270.
[18]
Dong-Hyun Lee et al. 2013. Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML number 2. Vol. 3, 896.
[19]
Geon Lee, Chanho Eom, Wonkyung Lee, Hyekang Park, and Bumsub Ham. 2022. Bi-directional contrastive learning for domain adaptive semantic segmentation. In Computer Vision - ECCV 2022. Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, (Eds.) Springer Nature Switzerland, Cham, 38--55. isbn: 978-3-031-20056-4.
[20]
Yahao Liu, Jinhong Deng, Xinchen Gao, Wen Li, and Lixin Duan. 2021. Bapa-net: boundary adaptation and prototype alignment for cross-domain semantic segmentation. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 8781--8791.
[21]
Zhengzhe Liu, Xiaojuan Qi, and Chi-Wing Fu. 2021. 3d-to-2d distillation for indoor scene parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (June 2021), 4464--4474.
[22]
Alexey Nekrasov, Jonas Schult, Or Litany, Bastian Leibe, and Francis Engelmann. 2021. Mix3d: out-of-context data augmentation for 3d scenes. (2021). arXiv: 2110.02210 [cs.CV].
[23]
Viktor Olsson, Wilhelm Tranheden, Juliano Pinto, and Lennart Svensson. 2020. Classmix: segmentation-based data augmentation for semi-supervised learning. (2020). arXiv: 2007.07936 [cs.CV].
[24]
Duo Peng, Yinjie Lei, Wen Li, Pingping Zhang, and Yulan Guo. 2021. Sparse-to-dense feature matching: intra and inter domain cross-modal learning in domain adaptation for 3d semantic segmentation. In Proceedings of the International Conference on Computer Vision (ICCV). IEEE.
[25]
Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2017. Pointnet: deep learning on point sets for 3d classification and segmentation, 77--85. https://doi.org/10.1109/CVPR.2017.16.
[26]
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. 2017. Pointnet: deep hierarchical feature learning on point sets in a metric space, 5099--5108. Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, (Eds.) https://proceedings.neurips.cc/paper/2017/hash/d8bf84be3800d12f74d8b05e9b89836f-Abstract.h tml.
[27]
Stephan R. Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. 2016. Playing for data: ground truth from computer games. In Computer Vision - ECCV 2016. Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, (Eds.) Springer International Publishing, Cham, 102--118. isbn: 978-3-319-46475-6.
[28]
Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. 2017. Octnet: learning deep 3d representations at high resolutions, 6620--6629. https://doi.org/10.1109 /CVPR.2017.701.
[29]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: convolutional networks for biomedical image segmentation. Lecture Notes in Computer Science 9351, 234--241. Nassir Navab, Joachim Hornegger, William M. Wells III, and Alejandro F. Frangi, (Eds.) https://doi.org/10.1007/978-3-319-24574-4%5 C_28.
[30]
Cristiano Saltori, Fabio Galasso, Giuseppe Fiameni, Nicu Sebe, Elisa Ricci, and Fabio Poiesi. 2022. Cosmix: compositional semantic mix for domain adaptation in 3d lidar segmentation. In Computer Vision - ECCV 2022. Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, (Eds.) Springer Nature Switzerland, Cham, 586--602. isbn: 978-3-031-19827-4.
[31]
Cristiano Saltori, Evgeny Krivosheev, Stéphane Lathuiliére, Nicu Sebe, Fabio Galasso, Giuseppe Fiameni, Elisa Ricci, and Fabio Poiesi. 2022. Gipso: geometrically informed propagation for online adaptation in 3d lidar segmentation. In Computer Vision - ECCV 2022. Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, (Eds.) Springer Nature Switzerland, Cham, 567--585. isbn: 978-3-031-19827-4.
[32]
Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz. 2018. Splatnet: sparse lattice networks for point cloud processing, 2530--2539. http://openaccess.thecvf.com/content %5C_cvpr%5C_2018/html/Su%5C_SPLATNet%5C_Sparse%5C_Lattice%5 C_CVPR%5C_2018%5C_paper.html.
[33]
Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems. I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, (Eds.) Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_fil es/paper/2017/file/68053af2923e00204c3ca7c6a3150cf7-Paper.pdf.
[34]
Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, François Goulette, and Leonidas J. Guibas. 2019. Kpconv: flexible and deformable convolution for point clouds. arXiv: 1904.08889 [cs.CV].
[35]
Wilhelm Tranheden, Viktor Olsson, Juliano Pinto, and Lennart Svensson. 2021. Dacs: domain adaptation via cross-domain mixed sampling. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 1378--1388. 9/WACV48630.2021.00142.
[36]
Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Pérez. 2019. Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2517--2526.
[37]
Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. 2019. Dynamic graph CNN for learning on point clouds. ACM Trans. Graph., 38, 5, 146:1--146:12. https://doi.org/10.1145/3326362.
[38]
Bichen Wu, Alvin Wan, Xiangyu Yue, and Kurt Keutzer. 2018. Squeezeseg: convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 1887--1893.
[39]
Bichen Wu, Xuanyu Zhou, Sicheng Zhao, Xiangyu Yue, and Kurt Keutzer. 2019. Squeezesegv2: improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 4376--4382.
[40]
Chenfeng Xu, Bichen Wu, Zining Wang, Wei Zhan, Peter Vajda, Kurt Keutzer, and Masayoshi Tomizuka. 2020. Squeezesegv3: spatially-adaptive convolution for efficient point-cloud segmentation. In Computer Vision - ECCV 2020. Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, (Eds.) Springer International Publishing, Cham, 1--19. isbn: 978-3-030-58604-1.
[41]
Li Yi, Boqing Gong, and Thomas Funkhouser. 2021. Complete & label: a domain adaptation approach to semantic segmentation of lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (June 2021), 15363--15373
[42]
Li Yi, Boqing Gong, and Thomas Funkhouser. 2021. Complete & label: a domain adaptation approach to semantic segmentation of lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15363--15373.
[43]
Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon Oh, Youngjoon Yoo, and Junsuk Choe. 2019. Cutmix: regularization strategy to train strong classifiers with localizable features. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 6022--6031.
[44]
Yachao Zhang, Miaoyu Li, Yuan Xie, Cuihua Li, Cong Wang, Zhizhong Zhang, and Yanyun Qu. 2022. Self-supervised exclusive learning for 3d segmentation with cross-modal unsupervised domain adaptation. In Proceedings of the 30th ACM International Conference on Multimedia (MM '22). Association for Computing Machinery, Lisboa, Portugal, 3338--3346. isbn: 9781450392037.
[45]
Hui Zhou, Xinge Zhu, Xiao Song, Yuexin Ma, Zhe Wang, Hongsheng Li, and Dahua Lin. 2020. Cylinder3d: an effective 3d framework for driving-scene lidar semantic segmentation. arXiv: 2008.01550 [cs.CV].

Index Terms

  1. Cross-modal & Cross-domain Learning for Unsupervised LiDAR Semantic Segmentation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-domain learning
    2. cross-modal learning
    3. lidar semantic segmentation
    4. unsupervised segmentation

    Qualifiers

    • Research-article

    Funding Sources

    • NSFC
    • Guangdong Basic and Applied Basic Research Foundation
    • Guangdong Provincial Key Laboratory of Human Digital Twin
    • CAAI-Huawei MindSpore Open Fund

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 140
      Total Downloads
    • Downloads (Last 12 months)69
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 15 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media