Elsevier

Ecological Informatics

Volume 55, January 2020, 101037
Ecological Informatics

Use of geospatial methods to characterize dispersion of the Emerald ash borer in southern Ontario, Canada

https://doi.org/10.1016/j.ecoinf.2019.101037Get rights and content

Highlights

  • Hybrid model known as RGLM had the best transferability compared to stand-alone statistical or machine learning models

  • Random Forest is insensitive to multicollinear predictor variables

  • The emerald ash borer (EAB) thrive in areas with stagnant winds and close proximity to medium and large sized population centres

  • Using a 1:1 ratio of species presence to absence points achieved the most robust distribution models

Abstract

Since the introduction of the Asian Emerald Ash Borer beetle (EAB, Agrilus planipennis) to Southern Ontario in 2002, all species of ash trees (Fraxinus) in the province are currently at risk. Due to the aggressive nature of this beetle, early detection is critical in its eradication and can be facilitated by species distribution modelling. That said, several issues need to be addressed in order to increase the predictive accuracy. In this study, the effects of sampling bias such as positive spatial autocorrelation and data prevalence (i.e., proportion of presence to absence points) were investigated in an EAB dataset. A filtering distance threshold approximating the EAB's dispersal range was used to minimize the effects of autocorrelation and the most appropriate prevalence was determined during the modelling process. To analyze the impact of environmental and anthropogenic predictors on the distribution of the EAB, logistic regression, Random Forest (RF) and a hybrid of Random Forest and generalized linear models known as the Random Generalized Linear Model (RGLM) were applied to EAB data from 2006 to 2012 across Ontario. Approximately 80% of the EAB samples were used as training, with 20% as validation. Ultimately, three risk maps were created from the 2006–2012 EAB data by using the coefficients from logistic regression as weights and an automated risk map tool for the RF and RGLM models. High-risk areas were identified from the risk maps for species prevalence and distribution monitoring. From these, precautionary measures can be implemented to stem the expansion of the beetle and thus reduce the destruction of the Ash tree species. All models identified June wind speed as the most important predictor variable followed by population centres. Lastly, Random Forest had the best sensitivity (86%), followed by stepwise backward logistic regression (82%), and RGLM (77%) for the 2013 prediction dataset.

Introduction

In the initial stages of the Agrilus planipennis (EAB) outbreak in Canada, the stealthy nature of the beetle helped it stay undetected for about 10 years prior to its discovery in 2002 (de Groot et al., 2006; Fairmaire and Parsons, 2008). The EAB is an extremely aggressive pest as it they can obliterate ash trees in as few as two growing seasons (Gaetz and Hildebrand, 2012). In order to eradicate the beetle, efforts in the United States have focused mainly extensive tree-removal strategies, while in Canada visual surveys and selective branch sampling of the trees have been the dominant strategies (Marchant, 2012). However, visual surveys are not the most reliable form of identifying species presence as ash trees only exhibit visible symptoms until after a year following their initial infestation (BenDor et al., 2006; Pontius et al., 2008). For this reason, the early detection of the EAB is imperative and species distribution models (SDMs) offer a means to identify relevant predictor factors and draw attention to high-risk areas based on the model predictions.

Since the main goal of an SDM is to predict the suitability of a landscape for a species, the chosen scale for modelling species-environmental interactions should reflect the corresponding scale at which the environmental drivers affect population distribution (Cushman and Huettmann, 2010). However, the optimal spatial scale that represents the relationship between the given species data and its surroundings may be unknown to researchers and is somewhat restricted based on the scale of the species data and the available predictor variables. Traditionally, SDMs have mainly used climactic variables for predicting invasive species distributions as they are able to reveal habitat preferences of species across relatively large geographic area. However, the incorporation of fine-scale anthropogenic variables such as distance to nearest highways can provide an insight into the long-distance propagation of invasive species (Gallardo et al., 2015) and potentially improve the predictive power of SDMs. The importance of the inclusion of anthropogenic variables was addressed in an earlier paper on the EAB by BenDor et al. (2006), where a system dynamics model was used to assess the implementation of various hypothetical anthropogenic influences on the EAB presence and spread. Similarly, Prasad et al. (2010) included various human-influenced variables in their sub-model known as the insect ride model (IRM), and Huset (2013) combined climactic and anthropogenic variables for inclusion into logistic regression and Maxent models. Although these studies provide detailed modelling procedures, the effects of sampling bias on the EAB dataset was not addressed.

Sampling bias is expected during the sampling of any species, especially if the samples are only taken near areas such as along highways and population centres (Marchant, 2012). This poses a challenge for SDMs for attempting to capture the true extent of a species and effectively monitor the spread. From a modelling perspective, the presence of sampling bias in the species dataset can potentially artificially increase spatial autocorrelation of the species clusters (Boria et al., 2014), resulting in an overfitted model that lacks the ability to predict an independent dataset (Hijmans, 2012; Veloz, 2009). A simple method to reduce spatial autocorrelation in a species dataset prior to modelling is to filter the localities by a specified distance to produce spatially independent points (Boria et al., 2014; Veloz, 2009). The second issue associated with manual sampling is the collection of a greater proportion of absence-to-presence species data which affects model calibration and accuracy (Fukuda and De Baets, 2016). The appropriate prevalence required to maximize model performance depends on the species-predictor variables dynamics and should be determined empirically (Santika, 2011). For instance, predictive models by McPherson et al. (2004) and Barbet-Massin et al. (2012) concluded that using a prevalence of 0.5 achieved the greatest accuracy whereas rare species modelled with a low prevalence produced a high accuracy for Franklin et al. (2009).

Three modelling approaches appropriate for binomial data were employed: logistic regression, Random Forest and Random Generalized Linear Model (RGLM). Given the complex nature of habitat selection for a species, the relationship between variables may be nonlinear or scale-dependent (Drew et al., 2011) which may not be adequately captured by logistic regression and thus the non-parametric model Random Forest was subsequently explored. The last model was a hybrid between the two aforementioned models which harnesses the interpretability of logistic regression and the high predictive capability of Random Forest. On the whole, the ultimate output from species distribution models are predictive risk maps which can provide conservation authorities a visual aid into the potential distribution of a species. Although it is a straightforward process to create risk maps using the coefficients as weights from a statistical model such as logistic regression, the same cannot be concluded for machine learning methods, where the results are more difficult to display in a cartographic geographic information systems (GIS) platform. To help overcome this constraint, this study aimed to develop a workflow to mitigate the effects of sampling bias in species data, reduce multicollinearity among predictor variables and characterize the spread of the EAB in Southern Ontario using logistic regression, Random Forest (RF) and Random Generalized Linear Model (RGLM) (Song et al., 2013). Lastly, an automated risk map creation tool was designed which streamlined the process of obtaining probabilities from machine learning models and allowed them to be visualized in map form.

Section snippets

Methodology

The species data were pre-processed to reduce the effects of spatial autocorrelation and the predictor variables were assessed for multicollinearity prior to their inclusion into three species distribution models. Although a comparison of the performance of the models is important for future distribution modelling of the EAB, ultimately predictive risk maps were created which conservation and management authorities can utilize to identify at-risk areas. The processes involved in developing the

Multicollinearity of the predictor variables

The correlation matrix indicated that three variables (June solar radiation, camps, and forest processing facilities) exhibited strong and statistically significant correlations with two or more other variables. While pair-wise correlations of the predictor variables are informative, a more robust form of quantifying the degree of multicollinearity of variables as a single value are variance inflation factors (VIFs) which use a least square regression for each variable as a function of the

Conclusions

The workflow presented in this research provides a guideline on pre-processing EAB species data and addressing the effects of spatial autocorrelation and prevalence instigated by sampling bias. Model comparisons with regards to the transferability of logistic regression, RF, and RGLM using the classification accuracies and AUC values indicate superior performance of the validation dataset but a poor performance of the spatio-temporally varying prediction dataset. Correspondingly, the high

Declaration of Competing Interest

The authors have no conflicts of interest.

Acknowledgements

We would like to sincerely thank the Natural Sciences and Engineering Research Council of Canada (NSERC) and Esri Canada for their financial support. In addition, Haydn Lawrence deserves a special recognition for his contribution in the creation of the automated risk map generation tool. The staff at the Canadian Food Inspection Agency (CFIA), Mireille Marcotte and Cameron Duff deserve acknowledgement for their steadfast preparation of the EAB species data. We would also like to thank the

References (65)

  • E. Appleton et al.

    Surveillance guidelines for emerald ash Borer

    Canad. Food. Inspect. Agency

    (2017)
  • M. Barbet-Massin et al.

    Selecting pseudo-absences for species distribution models: how, where and how many?

    Methods Ecol. Evol.

    (2012)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • J.L. Brown

    SDMtoolbox: a python-based GIS toolkit for landscape genetic, biogeographic and species distribution model analyses

    Methods Ecol. Evol.

    (2014)
  • S.A. Cushman et al.

    Spatial complexity, informatics, and wildlife conservation

  • B.F. Darst et al.

    Using recursive feature elimination in random forest to account for correlated variables in high dimensional data

    BMC Genet.

    (2018)
  • P. de Groot et al.

    A visual guide to detecting emerald ash borer damage

  • C.F. Dormann et al.

    Collinearity: A review of methods to deal with it and a simulation study evaluating their performance

    Ecography

    (2013)
  • C.A. Drew et al.

    Predictive species and habitat modeling in landscape ecology: Concepts and applications

  • J. Elith et al.

    Species distribution models: ecological explanation and prediction across space and time

    Annu. Rev. Ecol. Evol. Syst.

    (2009)
  • Emerald Ash Borer

    United States department of agriculture

    (2009)
  • Emerald Ash Borer

    Invasive Species Centre

    (2012)
  • A. Fairmaire et al.

    Emerald Ash Borer

    (2008)
  • S.E. Fick et al.

    Worldclim 2: new 1-km spatial resolution climate surfaces for global land areas

    Int. J. Climatol.

    (2017)
  • D. Firth

    Bias reduction of maximum likelihood

    Biometrika.

    (1993)
  • J.B. Fisher et al.

    An analysis of spatial clustering and implications for wildlife management: a burrowing owl example

    Environ. Manag.

    (2007)
  • J. Franklin et al.

    Effect of species rarity on the accuracy of species distribution models for reptiles and amphibians in southern California

    Divers. Distrib.

    (2009)
  • N. Gaetz et al.

    Recommended Approach for the Management of Emerald Ash Borer

  • B. Gallardo et al.

    The importance of the human footprint in shaping the global distribution of terrestrial, freshwater and marine invaders

    PLoS One

    (2015)
  • R.F. Graf et al.

    The importance of spatial scale in habitat models: capercaillie in the Swiss Alps

    Landsc. Ecol.

    (2005)
  • B. Gregorutti et al.

    Correlation and variable importance in random forests

    Stat. Comput.

    (2017)
  • I. Guyon et al.

    Gene selection for cancer classification using support vector machines

    Mach. Learn.

    (2002)
  • Cited by (0)

    View full text