Spatial Cross-Validation for Globally Distributed Data

Beigaitė, Rita; Mechenich, Michael; Žliobaitė, Indrė

doi:10.1007/978-3-031-18840-4_10

Rita Beigaitė⁹,
Michael Mechenich⁹ &
Indrė Žliobaitė^9,10

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13601))

Included in the following conference series:

International Conference on Discovery Science

1419 Accesses
2 Citations

Abstract

Increasing amounts of large scale georeferenced data produced by Earth observation missions present new challenges for training and testing machine-learned predictive models. Most of this data is spatially auto-correlated, which violates the classical i.i.d. assumption (identically and independently distributed data) commonly used in machine learning. One of the largest challenges in relation to spatial auto-correlation is how to generate testing sets that are sufficiently independent of the training data. In the geoscience and ecological literature, spatially stratified cross-validation is increasingly used as an alternative to standard random cross-validation. Spatial cross-validation, however, is not yet widely studied in the machine learning setting, and theoretical and empirical support is largely lacking. Our study aims at formally introducing spatial cross-validation to the machine learning community. We present experiments on data sets from two different domains (mammalian ecology and agriculture), which include globally distributed multi-target data, and show how standard cross-validation may lead to over-optimistic evaluation. We propose how to use tailored spatial cross-validation in this context to achieve more realistic assessment of performance and prudent model selection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The spatial leave-pair-out cross-validation method for reliable AUC estimation of spatial classifiers

Article Open access 20 December 2018

Spatial dependence between training and test sets: another pitfall of classification accuracy assessment in remote sensing

Article 26 April 2021

Geographical-XGBoost: a new ensemble model for spatially local regression based on gradient-boosted trees

Article 07 April 2025

Notes

1.
The distance between two georeferenced points can be calculated using the Harvesine distance formula, which gives shortest-path spherical distances between two points from their longitude and latitude coordinates [7].
2.
We made the data sets and our code publicly available at https://github.com/ritabei/Spatial-cross-validation.

References

Adams, M.D., Massey, F., Chastko, K., Cupini, C.: Spatial modelling of particulate matter air pollution sensor measurements collected by community scientists while cycling, land use regression with spatial cross-validation, and applications of machine learning for data correction. Atmos. Environ. 230, 117479 (2020)
Article Google Scholar
Airola, A., et al.: The spatial leave-pair-out cross-validation method for reliable auc estimation of spatial classifiers. Data Min. Knowl. Disc. 33(3), 730–747 (2019)
Article Google Scholar
Arlot, S., Celisse, A.: A survey of cross-validation procedures for model selection. Stat. Surv. 4, 40–79 (2010)
Article MathSciNet MATH Google Scholar
Bahn, V., McGill, B.J.: Testing the predictive performance of distribution models. Oikos 122(3), 321–331 (2013)
Article Google Scholar
Batjes, N.: Harmonized soil profile data for applications at global and continental scales: updates to the wise database. Soil Use Manag. 25(2), 124–127 (2009)
Article Google Scholar
Channan, S., Collins, K., Emanuel, W.: Global mosaics of the standard modis land cover type data. University of Maryland and the Pacific Northwest National Laboratory, College Park, Maryland, USA 30 (2014)
Google Scholar
Chopde, N.R., Nichat, M.: Landmark based shortest path detection by using a* and haversine formula. Int. J. Innov. Res. Comput. Commun. Eng. 1(2), 298–302 (2013)
Google Scholar
Feluch, W., Koronacki, J.: A note on modified cross-validation in density estimation. Comput. Stat. Data Analysis 13(2), 143–151 (1992)
Article MathSciNet MATH Google Scholar
Galbrun, E., Tang, H., Fortelius, M., Žliobaitė, I.: Computational biomes: The ecometrics of large mammal teeth. Palaeontol. Electron. 21(21.1. 3A), 1–31 (2018)
Google Scholar
Getis, A.: A history of the concept of spatial autocorrelation: a geographer’s perspective. Geogr. Anal. 40(3), 297–309 (2008)
Article Google Scholar
Hijmans, R.J.: Cross-validation of species distribution models: removing spatial sorting bias and calibration with a null model. Ecology 93(3), 679–688 (2012)
Article Google Scholar
Karasiak, N., Dejoux, J.-F., Monteil, C., Sheeren, D.: Spatial dependence between training and test sets: another pitfall of classification accuracy assessment in remote sensing. Mach. Learn. 111 1–26 (2021). https://doi.org/10.1007/s10994-021-05972-1
Lary, D., et al.: Machine learning applications for earth observation. In: Mathieu, P.-P., Aubrecht, C. (eds.) Earth Observation Open Science and Innovation. ISRS, vol. 15, pp. 165–218. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-65633-5_8
Chapter Google Scholar
Le Rest, K., Pinaud, D., Monestiez, P., Chadoeuf, J., Bretagnolle, V.: Spatial leave-one-out cross-validation for variable selection in the presence of spatial autocorrelation. Glob. Ecol. Biogeogr. 23(7), 811–820 (2014)
Article Google Scholar
Meyer, H., Pebesma, E.: Machine learning-based global maps of ecological variables and the challenge of assessing them. Nat. Commun. 13(1), 1–4 (2022)
Article Google Scholar
Miller, H.J.: Tobler’s first law and spatial analysis. Ann. Assoc. Am. Geogr. 94(2), 284–289 (2004)
Article Google Scholar
Ploton, P., et al.: Spatial validation reveals poor predictive performance of large-scale ecological mapping models. Nat. Commun. 11(1), 1–11 (2020)
Article Google Scholar
Pohjankukka, J., Pahikkala, T., Nevalainen, P., Heikkonen, J.: Estimating the prediction performance of spatial models via spatial k-fold cross validation. Int. J. Geogr. Inf. Sci. 31(10), 2001–2019 (2017)
Article Google Scholar
Roberts, D.R., et al.: Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40(8), 913–929 (2017)
Article Google Scholar
Schratz, P., Muenchow, J., Iturritxa, E., Richter, J., Brenning, A.: Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecol. Model. 406, 109–120 (2019)
Article Google Scholar
Trachsel, M., Telford, R.J.: Estimating unbiased transfer-function performances in spatially structured environments. Climate of the Past 12(5), 1215–1223 (2016)
Article Google Scholar
Valavi, R., Elith, J., Lahoz-Monfort, J.J., Guillera-Arroita, G.: blockCV: an R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods Ecol. Evol. 10(2), 225–232 (2019)
Article Google Scholar
Wadoux, A.M.C., Heuvelink, G.B., De Bruin, S., Brus, D.J.: Spatial cross-validation is not the right way to evaluate map accuracy. Ecol. Model. 457, 109692 (2021)
Article Google Scholar
Žliobaitė, I., et al.: Herbivore teeth predict climatic limits in kenyan ecosystems. Proc. Natl. Acad. Sci. 113(45), 12751–12756 (2016)
Article Google Scholar

Download references

Acknowledgements

We thank Tang Hui for the initial pre-processing of the mammalian ecology data set. Research leading to these results was supported by the Academy of Finland (grants no. 314803 and 341623).

Author information

Authors and Affiliations

Department of Computer Science, University of Helsinki, Helsinki, Finland
Rita Beigaitė, Michael Mechenich & Indrė Žliobaitė
Department of Geosciences and Geography, University of Helsinki, Helsinki, Finland
Indrė Žliobaitė

Authors

Rita Beigaitė
View author publications
You can also search for this author in PubMed Google Scholar
Michael Mechenich
View author publications
You can also search for this author in PubMed Google Scholar
Indrė Žliobaitė
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rita Beigaitė .

Editor information

Editors and Affiliations

University of Montpellier, Montpellier, France
Poncelet Pascal
INRAE, Montpellier, France
Dino Ienco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Beigaitė, R., Mechenich, M., Žliobaitė, I. (2022). Spatial Cross-Validation for Globally Distributed Data. In: Pascal, P., Ienco, D. (eds) Discovery Science. DS 2022. Lecture Notes in Computer Science(), vol 13601. Springer, Cham. https://doi.org/10.1007/978-3-031-18840-4_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-18840-4_10
Published: 06 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18839-8
Online ISBN: 978-3-031-18840-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Spatial Cross-Validation for Globally Distributed Data