MMEarth: Exploring Multi-modal Pretext Tasks for Geospatial Representation Learning

Nedungadi, Vishal; Kariryaa, Ankit; Oehmcke, Stefan; Belongie, Serge; Igel, Christian; Lang, Nico

doi:10.1007/978-3-031-73039-9_10

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15122))

Included in the following conference series:

European Conference on Computer Vision

313 Accesses

Abstract

The volume of unlabelled Earth observation (EO) data is huge, but many important applications lack labelled training data. However, EO data offers the unique opportunity to pair data from different modalities and sensors automatically based on geographic location and time, at virtually no human labor cost. We seize this opportunity to create MMEarth, a diverse multi-modal pretraining dataset at global scale. Using this new corpus of 1.2 million locations, we propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images. Our approach builds on the ConvNeXt V2 architecture, a fully convolutional masked autoencoder (MAE). Drawing upon a suite of multi-modal pretext tasks, we demonstrate that our MP-MAE approach outperforms both MAEs pretrained on ImageNet and MAEs pretrained on domain-specific satellite images. This is shown on several downstream tasks including image classification and semantic segmentation. We find that pretraining with multi-modal pretext tasks notably improves the linear probing performance compared to pretraining on optical satellite images only. This also leads to better label efficiency and parameter efficiency which are crucial aspects in global scale applications. (The MMEarth dataset is available on the project page: vishalned.github.io/mmearth. The dataset construction code is available here: github.com/vishalned/MMEarth-data. The MP-MAE code for training and evaluation is available here: github.com/vishalned/MMEarth-train).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Semi-supervised semantic segmentation in Earth Observation: the MiniFrance suite, dataset analysis and multi-task network study

Article 14 April 2021

Automated Machine Learning for Satellite Data: Integrating Remote Sensing Pre-trained Models into AutoML Systems

Learning Representations of Satellite Images From Metadata Supervision

Notes

1.
We use the latest GEO-Bench version v1.0 in which datasets were class-balanced.

References

Argaw, D.M., Lee, J.Y., Woodson, M., Kweon, I.S., Caba Heilbron, F.: Long-range multimodal pretraining for movie understanding. In: International Conference on Computer Vision (ICCV). IEEE (2023)
Google Scholar
Assran, M., et al.: Self-supervised learning from images with a joint-embedding predictive architecture. In: Computer Vision and Pattern Recognition (CVPR), pp. 15619–15629 (2023)
Google Scholar
Ayush, K., et al.: Geography-aware self-supervised learning. In: International Conference on Computer Vision (ICCV), pp. 10181–10190 (2021)
Google Scholar
Bachmann, R., Mizrahi, D., Atanov, A., Zamir, A.: MultiMAE: multi-modal multi-task masked autoencoders. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision (ECCV), pp. 348–367. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19836-6_20
Balestriero, R., et al.: A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210 (2023)
Bardes, A., et al.: Revisiting feature prediction for learning visual representations from video. arXiv preprint (2024)
Google Scholar
Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: a large-scale dataset for remote sensing image understanding. In: International Conference on Computer Vision (ICCV), pp. 16772–16782 (2023)
Google Scholar
Brown, C.F., et al.: Dynamic world, near real-time global 10 m land use land cover mapping. Sci. Data 9(1), 251 (2022)
Article Google Scholar
Choy, C., Gwak, J., Savarese, S.: 4D spatio-temporal convnets: Minkowski convolutional neural networks. In: Computer Vision and Pattern Recognition (CVPR), pp. 3075–3084 (2019)
Google Scholar
Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Computer Vision and Pattern Recognition (CVPR), pp. 6172–6180 (2018)
Google Scholar
Cong, Y., et al.: SatMAE: pre-training transformers for temporal and multi-spectral satellite imagery. Adv. Neural Inf. Process. Syst. (NeurIPS) 35, 197–211 (2022)
Google Scholar
Daudt, R.C., Wulf, H., Hafner, E.D., Bühler, Y., Schindler, K., Wegner, J.D.: Snow depth estimation at country-scale with high spatial and temporal resolution. ISPRS J. Photogramm. Remote. Sens. 197, 105–121 (2023)
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Computer Vision and Pattern Recognition (CVPR), pp. 248–255. IEEE (2009)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019)
Google Scholar
Dinerstein, E., et al.: An ecoregion-based approach to protecting half the terrestrial realm. Bioscience 67(6), 534–545 (2017)
Article Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 $\times $ 16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021)
Google Scholar
Dubayah, R., et al.: The global ecosystem dynamics investigation: high-resolution laser ranging of the earth’s forests and topography. Sci. Remote Sens. 1, 100002 (2020)
Article Google Scholar
Feichtenhofer, C., Fan, H., Li, Y., He, K.: Masked autoencoders as spatiotemporal learners. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Geng, X., Liu, H., Lee, L., Schuurmans, D., Levine, S., Abbeel, P.: Multimodal masked autoencoders learn transferable representations (2022)
Google Scholar
Ghiasi, G., Zoph, B., Cubuk, E.D., Le, Q.V., Lin, T.Y.: Multi-task self-training for learning general representations. In: International Conference on Computer Vision (ICCV), pp. 8856–8865 (2021)
Google Scholar
Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., Moore, R.: Google earth engine: planetary-scale geospatial analysis for everyone. Remote Sens. Environ. 202, 18–27 (2017)
Article Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Computer Vision and Pattern Recognition (CVPR), pp. 16000–16009 (2022)
Google Scholar
Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 12(7), 2217–2226 (2019)
Article Google Scholar
Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Computer Vision and Pattern Recognition (CVPR), pp. 7482–7491 (2018)
Google Scholar
Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023)
Kruitwagen, L., Story, K., Friedrich, J., Byers, L., Skillman, S., Hepburn, C.: A global inventory of photovoltaic solar energy generating units. Nature 598(7882), 604–610 (2021)
Article Google Scholar
Lacoste, A., et al.: GEO-Bench: toward foundation models for earth monitoring. In: Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track (2023)
Google Scholar
Lang, N., Jetz, W., Schindler, K., Wegner, J.D.: A high-resolution canopy height model of the earth. Nat. Ecol. Evol. 7(11), 1778–1789 (2023)
Article Google Scholar
Manas, O., Lacoste, A., Giró-i Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: unsupervised pre-training from uncurated remote sensing data. In: International Conference on Computer Vision (ICCV), pp. 9414–9423 (2021)
Google Scholar
Microsoft Open Source, McFarland, M., Emanuele, R., Morris, D., Augspurger, T.: microsoft/planetarycomputer, October 2022. https://doi.org/10.5281/zenodo.7261897
Mizrahi, D., et al.: 4M: massively multimodal masked modeling. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Mohamed, A., et al.: Self-supervised speech representation learning: a review. IEEE J. Sel. Topics Signal Process. 16(6), 1179–1210 (2022)
Article Google Scholar
Mommert, M., Kesseli, N., Hanna, J., Scheibenreif, L., Borth, D., Demir, B.: Ben-Ge: extending BigEarthNet with geographical and environmental data. In: IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, pp. 1016–1019. IEEE (2023)
Google Scholar
Oquab, M., et al.: DINOv2: learning robust visual features without supervision (2023)
Google Scholar
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Computer Vision and Pattern Recognition (CVPR), pp. 2536–2544 (2016)
Google Scholar
Planet, Radiant Earth Foundation, Western Cape Department of Agriculture, German Aerospace Center (DLR): A fusion dataset for crop type classification in Western Cape, South Africa (2021). https://doi.org/10.34911/RDNT.GQY868
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML), pp. 8748–8763. PMLR (2021)
Google Scholar
Reed, C.J., et al.: Scale-MAE: a scale-aware masked autoencoder for multiscale geospatial representation learning. In: International Conference on Computer Vision (ICCV), pp. 4088–4099 (2023)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A. (eds.) Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Rußwurm, M., Venkatesa, S.J., Tuia, D.: Large-scale detection of marine debris in coastal areas with Sentinel-2. IScience 26(12), 108402 (2023)
Google Scholar
de Sa, V.R., Ballard, D.H.: Category learning through multimodality sensing. Neural Comput. 10(5), 1097–1117 (1998)
Article Google Scholar
Sumbul, G., Charfuelan, M., Demir, B., Markl, V.: BigEarthNet: a large-scale benchmark archive for remote sensing image understanding. In: IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, pp. 5901–5904. IEEE (2019)
Google Scholar
Sumbul, G., et al.: BigEarthNet-MM: a large-scale, multimodal, multilabel benchmark archive for remote sensing image classification and retrieval [software and data sets]. IEEE Geosci. Remote Sens. Mag. 9(3), 174–180 (2021)
Article Google Scholar
Tolan, J., et al.: Very high resolution canopy height maps from RGB imagery using self-supervised vision transformer and convolutional decoder trained on aerial lidar. Remote Sens. Environ. 300, 113888 (2024)
Article Google Scholar
Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Tseng, G., Zvonkov, I., Purohit, M., Rolnick, D., Kerner, H.: Lightweight, pre-trained transformers for remote sensing timeseries. arXiv preprint arXiv:2304.14065 (2023)
Tucker, C., et al.: Sub-continental-scale carbon stocks of individual trees in African drylands. Nature 615(7950), 80–86 (2023)
Article Google Scholar
Van Horn, G., Cole, E., Beery, S., Wilber, K., Belongie, S., Mac Aodha, O.: Benchmarking representation learning for natural world image collections. In: Computer Vision and Pattern Recognition (CVPR), pp. 12884–12893 (2021)
Google Scholar
Vandenhende, S., Georgoulis, S., Van Gansbeke, W., Proesmans, M., Dai, D., Van Gool, L.: Multi-task learning for dense prediction tasks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(7), 3614–3633 (2021)
Google Scholar
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: International Conference on Machine Learning (ICML), pp. 1096–1103. ACM (2008)
Google Scholar
Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: a large-scale multi-modal, multi-temporal dataset for self-supervised learning in earth observation. arXiv preprint arXiv:2211.07044 (2022)
Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. In: Computer Vision and Pattern Recognition (CVPR), pp. 14668–14678 (2022)
Google Scholar
Woo, S., et al.: ConvNeXt V2: co-designing and scaling convnets with masked autoencoders. In: Computer Vision and Pattern Recognition (CVPR), pp. 16133–16142 (2023)
Google Scholar
Xie, Z., et al.: SimMIM: a simple framework for masked image modeling. In: Computer Vision and Pattern Recognition (CVPR), pp. 9653–9663 (2022)
Google Scholar
Yin, L., et al.: Mapping smallholder cashew plantations to inform sustainable tree crop expansion in Benin. Remote Sens. Environ. 295, 113695 (2023)
Article Google Scholar
Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., Lu, J.: Point-BERT: pre-training 3D point cloud transformers with masked point modeling. In: Computer Vision and Pattern Recognition (CVPR), pp. 19313–19322 (2022)
Google Scholar
Zamir, A.R., Sax, A., Shen, W., Guibas, L.J., Malik, J., Savarese, S.: Taskonomy: disentangling task transfer learning. In: Computer Vision and Pattern Recognition (CVPR), pp. 3712–3722 (2018)
Google Scholar
Zhu, X.X., et al.: So2Sat LCZ42: a benchmark data set for the classification of global local climate zones [software and data sets]. IEEE Geosci. Remote Sens. Mag. 8(3), 76–89 (2020)
Article MathSciNet Google Scholar

Download references

Acknowledgments

We thank Lucia Gordon for the valuable feedback. We greatly appreciate the open data policies of the Copernicus program and its partners ESA and ECMWF. We thank Google Earth Engine for hosting the data and providing free access. This work was supported in part by the Pioneer Centre for AI, DNRF grant number P1. The authors AK, CI, and NL acknowledge support by the research grant DeReEco (grant number 34306) from Villum Foundation. SO and CI acknowledge support by the research grant Global Wetland Center (grant number NNF23OC0081089) from Novo Nordisk Foundation. CI and SB acknowledge support by the European Union project ELIAS (grant agreement number 101120237). We thank the Danish e-Infrastructure Consortium (DeiC), Martin Brandt, and Konrad Schindler for their support with computing resources.

Author information

Authors and Affiliations

University of Copenhagen, Copenhagen, Denmark
Vishal Nedungadi, Ankit Kariryaa, Stefan Oehmcke, Serge Belongie, Christian Igel & Nico Lang

Authors

Vishal Nedungadi
View author publications
You can also search for this author in PubMed Google Scholar
Ankit Kariryaa
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Oehmcke
View author publications
You can also search for this author in PubMed Google Scholar
Serge Belongie
View author publications
You can also search for this author in PubMed Google Scholar
Christian Igel
View author publications
You can also search for this author in PubMed Google Scholar
Nico Lang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nico Lang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1183 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nedungadi, V., Kariryaa, A., Oehmcke, S., Belongie, S., Igel, C., Lang, N. (2025). MMEarth: Exploring Multi-modal Pretext Tasks for Geospatial Representation Learning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15122. Springer, Cham. https://doi.org/10.1007/978-3-031-73039-9_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-73039-9_10
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73038-2
Online ISBN: 978-3-031-73039-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics