Skip to main content

Spatial Cross-Validation for Globally Distributed Data

  • Conference paper
  • First Online:
Discovery Science (DS 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13601))

Included in the following conference series:

Abstract

Increasing amounts of large scale georeferenced data produced by Earth observation missions present new challenges for training and testing machine-learned predictive models. Most of this data is spatially auto-correlated, which violates the classical i.i.d. assumption (identically and independently distributed data) commonly used in machine learning. One of the largest challenges in relation to spatial auto-correlation is how to generate testing sets that are sufficiently independent of the training data. In the geoscience and ecological literature, spatially stratified cross-validation is increasingly used as an alternative to standard random cross-validation. Spatial cross-validation, however, is not yet widely studied in the machine learning setting, and theoretical and empirical support is largely lacking. Our study aims at formally introducing spatial cross-validation to the machine learning community. We present experiments on data sets from two different domains (mammalian ecology and agriculture), which include globally distributed multi-target data, and show how standard cross-validation may lead to over-optimistic evaluation. We propose how to use tailored spatial cross-validation in this context to achieve more realistic assessment of performance and prudent model selection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The distance between two georeferenced points can be calculated using the Harvesine distance formula, which gives shortest-path spherical distances between two points from their longitude and latitude coordinates [7].

  2. 2.

    We made the data sets and our code publicly available at https://github.com/ritabei/Spatial-cross-validation.

References

  1. Adams, M.D., Massey, F., Chastko, K., Cupini, C.: Spatial modelling of particulate matter air pollution sensor measurements collected by community scientists while cycling, land use regression with spatial cross-validation, and applications of machine learning for data correction. Atmos. Environ. 230, 117479 (2020)

    Article  Google Scholar 

  2. Airola, A., et al.: The spatial leave-pair-out cross-validation method for reliable auc estimation of spatial classifiers. Data Min. Knowl. Disc. 33(3), 730–747 (2019)

    Article  Google Scholar 

  3. Arlot, S., Celisse, A.: A survey of cross-validation procedures for model selection. Stat. Surv. 4, 40–79 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  4. Bahn, V., McGill, B.J.: Testing the predictive performance of distribution models. Oikos 122(3), 321–331 (2013)

    Article  Google Scholar 

  5. Batjes, N.: Harmonized soil profile data for applications at global and continental scales: updates to the wise database. Soil Use Manag. 25(2), 124–127 (2009)

    Article  Google Scholar 

  6. Channan, S., Collins, K., Emanuel, W.: Global mosaics of the standard modis land cover type data. University of Maryland and the Pacific Northwest National Laboratory, College Park, Maryland, USA 30 (2014)

    Google Scholar 

  7. Chopde, N.R., Nichat, M.: Landmark based shortest path detection by using a* and haversine formula. Int. J. Innov. Res. Comput. Commun. Eng. 1(2), 298–302 (2013)

    Google Scholar 

  8. Feluch, W., Koronacki, J.: A note on modified cross-validation in density estimation. Comput. Stat. Data Analysis 13(2), 143–151 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  9. Galbrun, E., Tang, H., Fortelius, M., Žliobaitė, I.: Computational biomes: The ecometrics of large mammal teeth. Palaeontol. Electron. 21(21.1. 3A), 1–31 (2018)

    Google Scholar 

  10. Getis, A.: A history of the concept of spatial autocorrelation: a geographer’s perspective. Geogr. Anal. 40(3), 297–309 (2008)

    Article  Google Scholar 

  11. Hijmans, R.J.: Cross-validation of species distribution models: removing spatial sorting bias and calibration with a null model. Ecology 93(3), 679–688 (2012)

    Article  Google Scholar 

  12. Karasiak, N., Dejoux, J.-F., Monteil, C., Sheeren, D.: Spatial dependence between training and test sets: another pitfall of classification accuracy assessment in remote sensing. Mach. Learn. 111 1–26 (2021). https://doi.org/10.1007/s10994-021-05972-1

  13. Lary, D., et al.: Machine learning applications for earth observation. In: Mathieu, P.-P., Aubrecht, C. (eds.) Earth Observation Open Science and Innovation. ISRS, vol. 15, pp. 165–218. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-65633-5_8

    Chapter  Google Scholar 

  14. Le Rest, K., Pinaud, D., Monestiez, P., Chadoeuf, J., Bretagnolle, V.: Spatial leave-one-out cross-validation for variable selection in the presence of spatial autocorrelation. Glob. Ecol. Biogeogr. 23(7), 811–820 (2014)

    Article  Google Scholar 

  15. Meyer, H., Pebesma, E.: Machine learning-based global maps of ecological variables and the challenge of assessing them. Nat. Commun. 13(1), 1–4 (2022)

    Article  Google Scholar 

  16. Miller, H.J.: Tobler’s first law and spatial analysis. Ann. Assoc. Am. Geogr. 94(2), 284–289 (2004)

    Article  Google Scholar 

  17. Ploton, P., et al.: Spatial validation reveals poor predictive performance of large-scale ecological mapping models. Nat. Commun. 11(1), 1–11 (2020)

    Article  Google Scholar 

  18. Pohjankukka, J., Pahikkala, T., Nevalainen, P., Heikkonen, J.: Estimating the prediction performance of spatial models via spatial k-fold cross validation. Int. J. Geogr. Inf. Sci. 31(10), 2001–2019 (2017)

    Article  Google Scholar 

  19. Roberts, D.R., et al.: Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40(8), 913–929 (2017)

    Article  Google Scholar 

  20. Schratz, P., Muenchow, J., Iturritxa, E., Richter, J., Brenning, A.: Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecol. Model. 406, 109–120 (2019)

    Article  Google Scholar 

  21. Trachsel, M., Telford, R.J.: Estimating unbiased transfer-function performances in spatially structured environments. Climate of the Past 12(5), 1215–1223 (2016)

    Article  Google Scholar 

  22. Valavi, R., Elith, J., Lahoz-Monfort, J.J., Guillera-Arroita, G.: blockCV: an R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods Ecol. Evol. 10(2), 225–232 (2019)

    Article  Google Scholar 

  23. Wadoux, A.M.C., Heuvelink, G.B., De Bruin, S., Brus, D.J.: Spatial cross-validation is not the right way to evaluate map accuracy. Ecol. Model. 457, 109692 (2021)

    Article  Google Scholar 

  24. Žliobaitė, I., et al.: Herbivore teeth predict climatic limits in kenyan ecosystems. Proc. Natl. Acad. Sci. 113(45), 12751–12756 (2016)

    Article  Google Scholar 

Download references

Acknowledgements

We thank Tang Hui for the initial pre-processing of the mammalian ecology data set. Research leading to these results was supported by the Academy of Finland (grants no. 314803 and 341623).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rita Beigaitė .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Beigaitė, R., Mechenich, M., Žliobaitė, I. (2022). Spatial Cross-Validation for Globally Distributed Data. In: Pascal, P., Ienco, D. (eds) Discovery Science. DS 2022. Lecture Notes in Computer Science(), vol 13601. Springer, Cham. https://doi.org/10.1007/978-3-031-18840-4_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-18840-4_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-18839-8

  • Online ISBN: 978-3-031-18840-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics