Abstract
Increasing amounts of large scale georeferenced data produced by Earth observation missions present new challenges for training and testing machine-learned predictive models. Most of this data is spatially auto-correlated, which violates the classical i.i.d. assumption (identically and independently distributed data) commonly used in machine learning. One of the largest challenges in relation to spatial auto-correlation is how to generate testing sets that are sufficiently independent of the training data. In the geoscience and ecological literature, spatially stratified cross-validation is increasingly used as an alternative to standard random cross-validation. Spatial cross-validation, however, is not yet widely studied in the machine learning setting, and theoretical and empirical support is largely lacking. Our study aims at formally introducing spatial cross-validation to the machine learning community. We present experiments on data sets from two different domains (mammalian ecology and agriculture), which include globally distributed multi-target data, and show how standard cross-validation may lead to over-optimistic evaluation. We propose how to use tailored spatial cross-validation in this context to achieve more realistic assessment of performance and prudent model selection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The distance between two georeferenced points can be calculated using the Harvesine distance formula, which gives shortest-path spherical distances between two points from their longitude and latitude coordinates [7].
- 2.
We made the data sets and our code publicly available at https://github.com/ritabei/Spatial-cross-validation.
References
Adams, M.D., Massey, F., Chastko, K., Cupini, C.: Spatial modelling of particulate matter air pollution sensor measurements collected by community scientists while cycling, land use regression with spatial cross-validation, and applications of machine learning for data correction. Atmos. Environ. 230, 117479 (2020)
Airola, A., et al.: The spatial leave-pair-out cross-validation method for reliable auc estimation of spatial classifiers. Data Min. Knowl. Disc. 33(3), 730–747 (2019)
Arlot, S., Celisse, A.: A survey of cross-validation procedures for model selection. Stat. Surv. 4, 40–79 (2010)
Bahn, V., McGill, B.J.: Testing the predictive performance of distribution models. Oikos 122(3), 321–331 (2013)
Batjes, N.: Harmonized soil profile data for applications at global and continental scales: updates to the wise database. Soil Use Manag. 25(2), 124–127 (2009)
Channan, S., Collins, K., Emanuel, W.: Global mosaics of the standard modis land cover type data. University of Maryland and the Pacific Northwest National Laboratory, College Park, Maryland, USA 30 (2014)
Chopde, N.R., Nichat, M.: Landmark based shortest path detection by using a* and haversine formula. Int. J. Innov. Res. Comput. Commun. Eng. 1(2), 298–302 (2013)
Feluch, W., Koronacki, J.: A note on modified cross-validation in density estimation. Comput. Stat. Data Analysis 13(2), 143–151 (1992)
Galbrun, E., Tang, H., Fortelius, M., Žliobaitė, I.: Computational biomes: The ecometrics of large mammal teeth. Palaeontol. Electron. 21(21.1. 3A), 1–31 (2018)
Getis, A.: A history of the concept of spatial autocorrelation: a geographer’s perspective. Geogr. Anal. 40(3), 297–309 (2008)
Hijmans, R.J.: Cross-validation of species distribution models: removing spatial sorting bias and calibration with a null model. Ecology 93(3), 679–688 (2012)
Karasiak, N., Dejoux, J.-F., Monteil, C., Sheeren, D.: Spatial dependence between training and test sets: another pitfall of classification accuracy assessment in remote sensing. Mach. Learn. 111 1–26 (2021). https://doi.org/10.1007/s10994-021-05972-1
Lary, D., et al.: Machine learning applications for earth observation. In: Mathieu, P.-P., Aubrecht, C. (eds.) Earth Observation Open Science and Innovation. ISRS, vol. 15, pp. 165–218. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-65633-5_8
Le Rest, K., Pinaud, D., Monestiez, P., Chadoeuf, J., Bretagnolle, V.: Spatial leave-one-out cross-validation for variable selection in the presence of spatial autocorrelation. Glob. Ecol. Biogeogr. 23(7), 811–820 (2014)
Meyer, H., Pebesma, E.: Machine learning-based global maps of ecological variables and the challenge of assessing them. Nat. Commun. 13(1), 1–4 (2022)
Miller, H.J.: Tobler’s first law and spatial analysis. Ann. Assoc. Am. Geogr. 94(2), 284–289 (2004)
Ploton, P., et al.: Spatial validation reveals poor predictive performance of large-scale ecological mapping models. Nat. Commun. 11(1), 1–11 (2020)
Pohjankukka, J., Pahikkala, T., Nevalainen, P., Heikkonen, J.: Estimating the prediction performance of spatial models via spatial k-fold cross validation. Int. J. Geogr. Inf. Sci. 31(10), 2001–2019 (2017)
Roberts, D.R., et al.: Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40(8), 913–929 (2017)
Schratz, P., Muenchow, J., Iturritxa, E., Richter, J., Brenning, A.: Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecol. Model. 406, 109–120 (2019)
Trachsel, M., Telford, R.J.: Estimating unbiased transfer-function performances in spatially structured environments. Climate of the Past 12(5), 1215–1223 (2016)
Valavi, R., Elith, J., Lahoz-Monfort, J.J., Guillera-Arroita, G.: blockCV: an R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods Ecol. Evol. 10(2), 225–232 (2019)
Wadoux, A.M.C., Heuvelink, G.B., De Bruin, S., Brus, D.J.: Spatial cross-validation is not the right way to evaluate map accuracy. Ecol. Model. 457, 109692 (2021)
Žliobaitė, I., et al.: Herbivore teeth predict climatic limits in kenyan ecosystems. Proc. Natl. Acad. Sci. 113(45), 12751–12756 (2016)
Acknowledgements
We thank Tang Hui for the initial pre-processing of the mammalian ecology data set. Research leading to these results was supported by the Academy of Finland (grants no. 314803 and 341623).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Beigaitė, R., Mechenich, M., Žliobaitė, I. (2022). Spatial Cross-Validation for Globally Distributed Data. In: Pascal, P., Ienco, D. (eds) Discovery Science. DS 2022. Lecture Notes in Computer Science(), vol 13601. Springer, Cham. https://doi.org/10.1007/978-3-031-18840-4_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-18840-4_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18839-8
Online ISBN: 978-3-031-18840-4
eBook Packages: Computer ScienceComputer Science (R0)