Decoupling Common and Unique Representations for Multimodal Self-supervised Learning

Wang, Yi; Albrecht, Conrad M.; Braham, Nassim Ait Ali; Liu, Chenying; Xiong, Zhitong; Zhu, Xiao Xiang

doi:10.1007/978-3-031-73397-0_17

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15087))

Included in the following conference series:

European Conference on Computer Vision

331 Accesses

Abstract

The increasing availability of multi-sensor data sparks wide interest in multimodal self-supervised learning. However, most existing approaches learn only common representations across modalities while ignoring intra-modal training and modality-unique representations. We propose Decoupling Common and Unique Representations (DeCUR), a simple yet effective method for multimodal self-supervised learning. By distinguishing inter- and intra-modal embeddings through multimodal redundancy reduction, DeCUR can integrate complementary information across different modalities. We evaluate DeCUR in three common multimodal scenarios (radar-optical, RGB-elevation, and RGB-depth), and demonstrate its consistent improvement regardless of architectures and for both multimodal and modality-missing settings. With thorough experiments and comprehensive analysis, we hope this work can provide valuable insights and raise more interest in researching the hidden relationships of multimodal representations (https://github.com/zhu-xlab/DeCUR).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

OmniSat: Self-supervised Modality Fusion for Earth Observation

Broad Learning System Based on Fusion Features

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Article 10 June 2021

References

Akbari, H., et al.: VATT: transformers for multimodal self-supervised learning from raw video, audio and text. In: Advances in Neural Information Processing Systems, vol. 34, pp. 24206–24221 (2021)
Google Scholar
Baier, G., Deschemps, A., Schmitt, M., Yokoya, N.: GeoNRW (2020). https://doi.org/10.21227/s5xq-b822
Bardes, A., Ponce, J., LeCun, Y.: VICReg: variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906 (2021)
Cao, J., Leng, H., Lischinski, D., Cohen-Or, D., Tu, C., Li, Y.: ShapeConv: shape-aware convolutional layer for indoor RGB-D semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7088–7097 (2021)
Google Scholar
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: Advances in Neural Information Processing Systems, vol. 33, pp. 9912–9924 (2020)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Google Scholar
Chen, L.Z., Lin, Z., Wang, Z., Yang, Y.L., Cheng, M.M.: Spatial information guided convolution for real-time RGBD semantic segmentation. IEEE Trans. Image Process. 30, 2313–2324 (2021)
Article Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Chen, X., et al.: Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 561–577. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_33
Chapter Google Scholar
Chen, X., He, K.: Exploring simple Siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)
Google Scholar
Cong, Y., et al.: SatMAE: pre-training transformers for temporal and multi-spectral satellite imagery. In: Advances in Neural Information Processing Systems, vol. 35, pp. 197–211 (2022)
Google Scholar
Dunteman, G.H.: Principal Components Analysis, vol. 69. Sage (1989)
Google Scholar
Ericsson, L., Gouk, H., Loy, C.C., Hospedales, T.M.: Self-supervised representation learning: introduction, advances, and challenges. IEEE Signal Process. Mag. 39(3), 42–62 (2022)
Article Google Scholar
Fuller, A., Millard, K., Green, J.: Croma: remote sensing representations with contrastive radar-optical masked autoencoders. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018)
Girdhar, R., et al.: ImageBind: one embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15180–15190, June 2023
Google Scholar
Girdhar, R., Singh, M., Ravi, N., van der Maaten, L., Joulin, A., Misra, I.: OMNIVORE: a single model for many visual modalities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16102–16112, June 2022
Google Scholar
Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 21271–21284 (2020)
Google Scholar
Guo, X., et al.: SkySense: a multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery. arXiv preprint arXiv:2312.10115 (2023)
Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_23
Chapter Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hong, D., et al.: SpectralGPT: spectral foundation model. arXiv preprint arXiv:2311.07113 (2023)
Krishnan, R., Rajpurkar, P., Topol, E.J.: Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng. 6(12), 1346–1352 (2022)
Article Google Scholar
Liang, P.P., Deng, Z., Ma, M.Q., Zou, J.Y., Morency, L.P., Salakhutdinov, R.: Factorized contrastive learning: going beyond multi-view redundancy. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
Google Scholar
Manas, O., Lacoste, A., Giró-i Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021)
Google Scholar
Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023)
Google Scholar
Mu, N., Kirillov, A., Wagner, D., Xie, S.: SLIP: self-supervision meets language-image pre-training. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13686, pp. 529–544. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19809-0_30
Chapter Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Chapter Google Scholar
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Peng, X., Wei, Y., Deng, A., Wang, D., Hu, D.: Balanced multimodal learning via on-the-fly gradient modulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8238–8247 (2022)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Scheibenreif, L., Hanna, J., Mommert, M., Borth, D.: Self-supervised vision transformers for land-cover segmentation and classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1422–1431 (2022)
Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)
Google Scholar
Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: a RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)
Google Scholar
Sumbul, G., et al.: BigEarthNet-MM: a large-scale, multimodal, multilabel benchmark archive for remote sensing image classification and retrieval [software and data sets]. IEEE Geosci. Remote Sens. Mag. 9(3), 174–180 (2021)
Article Google Scholar
Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: International Conference on Machine Learning, pp. 3319–3328. PMLR (2017)
Google Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A., Bottou, L.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(12) (2010)
Google Scholar
Wang, L., Luc, P., Recasens, A., Alayrac, J.B., Oord, A.V.D.: Multimodal self-supervised learning of general audio representations. arXiv preprint arXiv:2104.12807 (2021)
Wang, Y., Albrecht, C.M., Braham, N.A.A., Mou, L., Zhu, X.X.: Self-supervised learning in remote sensing: a review. arXiv preprint arXiv:2206.13188 (2022)
Wang, Y., Albrecht, C.M., Zhu, X.X.: Self-supervised vision transformers for joint SAR-optical representation learning. In: IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium, pp. 139–142. IEEE (2022)
Google Scholar
Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: a large-scale multi-modal, multi-temporal dataset for self-supervised learning in earth observation. arXiv preprint arXiv:2211.07044 (2022)
Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: a large-scale multimodal, multitemporal dataset for self-supervised learning in earth observation [software and data sets]. IEEE Geosci. Remote Sens. Mag. 11(3), 98–106 (2023)
Article Google Scholar
Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023)
Wei, L., Xie, L., Zhou, W., Li, H., Tian, Q.: MVP: multimodality-guided visual pre-training. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13690, pp. 337–353. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20056-4_20
Chapter Google Scholar
Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Vision transformer with deformable attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4794–4803 (2022)
Google Scholar
Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: DAT++: spatially dynamic vision transformer with deformable attention. arXiv preprint arXiv:2309.01430 (2023)
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: Advances in Neural Information Processing Systems, vol. 34, pp. 12077–12090 (2021)
Google Scholar
Xiong, Z., et al.: Neural plasticity-inspired foundation model for observing the earth crossing modalities. arXiv preprint arXiv:2403.15356 (2024)
Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: toward unified foundation models for earth vision. arXiv preprint arXiv:2401.07527 (2024)
Xiong, Z., Yuan, Y., Wang, Q.: MSN: modality separation networks for RGB-D scene recognition. Neurocomputing 373, 81–89 (2020)
Article Google Scholar
Xiong, Z., Yuan, Y., Wang, Q.: ASK: adaptively selecting key local features for RGB-D scene recognition. IEEE Trans. Image Process. 30, 2722–2733 (2021)
Article Google Scholar
Yang, J., et al.: Vision-language pre-training with triple contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15671–15680 (2022)
Google Scholar
Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow Twins: self-supervised learning via redundancy reduction. In: International Conference on Machine Learning, pp. 12310–12320. PMLR (2021)
Google Scholar
Zhang, J., Liu, H., Yang, K., Hu, X., Liu, R., Stiefelhagen, R.: CMX: cross-modal fusion for RGB-X semantic segmentation with transformers. arXiv preprint arXiv:2203.04838 (2022)
Zhou, J., Yu, Q., Luo, C., Zhang, J.: Feature decomposition for reducing negative transfer: a novel multi-task learning method for recommender system. arXiv preprint arXiv:2302.05031 (2023)

Download references

Acknowledgement

The main work of Y. Wang, C. Liu, and C. Albrecht was funded by the Helmholtz Association through the Framework of Helmholtz AI, grant ID: ZT-I-PF-5-01 - Local Unit Munich Unit @Aeronautics, Space and Transport (MASTr). The compute was supported by the Helmholtz Association’s Initiative and Networking Fund on the HAICORE@FZJ partition. The work of N. Ait Ali Braham was supported by the European Commission through the project “EvoLand” under the Horizon 2020 Research and Innovation program (Grant Agreement No. 101082130). The work of X. Zhu was supported by the German Federal Ministry of Education and Research (BMBF) in the framework of the international future AI lab “AI4EO – Artificial Intelligence for Earth Observation: Reasoning, Uncertainties, Ethics and Beyond” (grant number: 01DD20001) and by the Munich Center for Machine Learning. The work of Z. Xiong was supported by the German Federal Ministry for the Environment, Nature Conservation, Nuclear Safety and Consumer Protection (BMUV) based on a resolution of the German Bundestag (grant number: 67KI32002B; Acronym: EKAPEx). Y. Wang’s work on rebuttal and camera-ready paper preparation was supported by the European Commission through the project “ThinkingEarth-Copernicus Foundation Models for a Thinking Earth” under the Horizon 2020 Research and Innovation program (Grant Agreement No. 101130544).

Author information

Authors and Affiliations

Data Science in Earth Observation, Technical University of Munich, Munich, Germany
Yi Wang, Nassim Ait Ali Braham, Chenying Liu, Zhitong Xiong & Xiao Xiang Zhu
Remote Sensing Technology Institute, German Aerospace Center, Weßling, Germany
Yi Wang, Conrad M. Albrecht & Nassim Ait Ali Braham
Munich Center for Machine Learning, Munich, Germany
Xiao Xiang Zhu

Authors

Yi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Conrad M. Albrecht
View author publications
You can also search for this author in PubMed Google Scholar
Nassim Ait Ali Braham
View author publications
You can also search for this author in PubMed Google Scholar
Chenying Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhitong Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Xiang Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Wang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 401 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X. (2025). Decoupling Common and Unique Representations for Multimodal Self-supervised Learning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15087. Springer, Cham. https://doi.org/10.1007/978-3-031-73397-0_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-73397-0_17
Published: 03 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73396-3
Online ISBN: 978-3-031-73397-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Decoupling Common and Unique Representations for Multimodal Self-supervised Learning