Use of geospatial methods to characterize dispersion of the Emerald ash borer in southern Ontario, Canada
Introduction
In the initial stages of the Agrilus planipennis (EAB) outbreak in Canada, the stealthy nature of the beetle helped it stay undetected for about 10 years prior to its discovery in 2002 (de Groot et al., 2006; Fairmaire and Parsons, 2008). The EAB is an extremely aggressive pest as it they can obliterate ash trees in as few as two growing seasons (Gaetz and Hildebrand, 2012). In order to eradicate the beetle, efforts in the United States have focused mainly extensive tree-removal strategies, while in Canada visual surveys and selective branch sampling of the trees have been the dominant strategies (Marchant, 2012). However, visual surveys are not the most reliable form of identifying species presence as ash trees only exhibit visible symptoms until after a year following their initial infestation (BenDor et al., 2006; Pontius et al., 2008). For this reason, the early detection of the EAB is imperative and species distribution models (SDMs) offer a means to identify relevant predictor factors and draw attention to high-risk areas based on the model predictions.
Since the main goal of an SDM is to predict the suitability of a landscape for a species, the chosen scale for modelling species-environmental interactions should reflect the corresponding scale at which the environmental drivers affect population distribution (Cushman and Huettmann, 2010). However, the optimal spatial scale that represents the relationship between the given species data and its surroundings may be unknown to researchers and is somewhat restricted based on the scale of the species data and the available predictor variables. Traditionally, SDMs have mainly used climactic variables for predicting invasive species distributions as they are able to reveal habitat preferences of species across relatively large geographic area. However, the incorporation of fine-scale anthropogenic variables such as distance to nearest highways can provide an insight into the long-distance propagation of invasive species (Gallardo et al., 2015) and potentially improve the predictive power of SDMs. The importance of the inclusion of anthropogenic variables was addressed in an earlier paper on the EAB by BenDor et al. (2006), where a system dynamics model was used to assess the implementation of various hypothetical anthropogenic influences on the EAB presence and spread. Similarly, Prasad et al. (2010) included various human-influenced variables in their sub-model known as the insect ride model (IRM), and Huset (2013) combined climactic and anthropogenic variables for inclusion into logistic regression and Maxent models. Although these studies provide detailed modelling procedures, the effects of sampling bias on the EAB dataset was not addressed.
Sampling bias is expected during the sampling of any species, especially if the samples are only taken near areas such as along highways and population centres (Marchant, 2012). This poses a challenge for SDMs for attempting to capture the true extent of a species and effectively monitor the spread. From a modelling perspective, the presence of sampling bias in the species dataset can potentially artificially increase spatial autocorrelation of the species clusters (Boria et al., 2014), resulting in an overfitted model that lacks the ability to predict an independent dataset (Hijmans, 2012; Veloz, 2009). A simple method to reduce spatial autocorrelation in a species dataset prior to modelling is to filter the localities by a specified distance to produce spatially independent points (Boria et al., 2014; Veloz, 2009). The second issue associated with manual sampling is the collection of a greater proportion of absence-to-presence species data which affects model calibration and accuracy (Fukuda and De Baets, 2016). The appropriate prevalence required to maximize model performance depends on the species-predictor variables dynamics and should be determined empirically (Santika, 2011). For instance, predictive models by McPherson et al. (2004) and Barbet-Massin et al. (2012) concluded that using a prevalence of 0.5 achieved the greatest accuracy whereas rare species modelled with a low prevalence produced a high accuracy for Franklin et al. (2009).
Three modelling approaches appropriate for binomial data were employed: logistic regression, Random Forest and Random Generalized Linear Model (RGLM). Given the complex nature of habitat selection for a species, the relationship between variables may be nonlinear or scale-dependent (Drew et al., 2011) which may not be adequately captured by logistic regression and thus the non-parametric model Random Forest was subsequently explored. The last model was a hybrid between the two aforementioned models which harnesses the interpretability of logistic regression and the high predictive capability of Random Forest. On the whole, the ultimate output from species distribution models are predictive risk maps which can provide conservation authorities a visual aid into the potential distribution of a species. Although it is a straightforward process to create risk maps using the coefficients as weights from a statistical model such as logistic regression, the same cannot be concluded for machine learning methods, where the results are more difficult to display in a cartographic geographic information systems (GIS) platform. To help overcome this constraint, this study aimed to develop a workflow to mitigate the effects of sampling bias in species data, reduce multicollinearity among predictor variables and characterize the spread of the EAB in Southern Ontario using logistic regression, Random Forest (RF) and Random Generalized Linear Model (RGLM) (Song et al., 2013). Lastly, an automated risk map creation tool was designed which streamlined the process of obtaining probabilities from machine learning models and allowed them to be visualized in map form.
Section snippets
Methodology
The species data were pre-processed to reduce the effects of spatial autocorrelation and the predictor variables were assessed for multicollinearity prior to their inclusion into three species distribution models. Although a comparison of the performance of the models is important for future distribution modelling of the EAB, ultimately predictive risk maps were created which conservation and management authorities can utilize to identify at-risk areas. The processes involved in developing the
Multicollinearity of the predictor variables
The correlation matrix indicated that three variables (June solar radiation, camps, and forest processing facilities) exhibited strong and statistically significant correlations with two or more other variables. While pair-wise correlations of the predictor variables are informative, a more robust form of quantifying the degree of multicollinearity of variables as a single value are variance inflation factors (VIFs) which use a least square regression for each variable as a function of the
Conclusions
The workflow presented in this research provides a guideline on pre-processing EAB species data and addressing the effects of spatial autocorrelation and prevalence instigated by sampling bias. Model comparisons with regards to the transferability of logistic regression, RF, and RGLM using the classification accuracies and AUC values indicate superior performance of the validation dataset but a poor performance of the spatio-temporally varying prediction dataset. Correspondingly, the high
Declaration of Competing Interest
The authors have no conflicts of interest.
Acknowledgements
We would like to sincerely thank the Natural Sciences and Engineering Research Council of Canada (NSERC) and Esri Canada for their financial support. In addition, Haydn Lawrence deserves a special recognition for his contribution in the creation of the automated risk map generation tool. The staff at the Canadian Food Inspection Agency (CFIA), Mireille Marcotte and Cameron Duff deserve acknowledgement for their steadfast preparation of the EAB species data. We would also like to thank the
References (65)
- et al.
Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality
J. Clin. Epidemiol.
(2004) - et al.
Modeling the spread of the emerald ash borer
Ecol. Model.
(2006) - et al.
Spatial filtering to reduce sampling bias can improve the performance of ecological niche models
Ecol. Model.
(2014) - et al.
A comparison of the performance of threshold criteria for binary classification in terms of predicted prevalence and kappa
Ecol. Model.
(2008) - et al.
Data prevalence matters when assessing species’ responses using data-driven species distribution models
Ecol. Informa.
(2016) - et al.
Comparing generalized linear models and random forest to model vascular plant species richness using LiDAR data in a natural forest in Central Chile
Remote Sens. Environ.
(2016) - et al.
Ash decline assessment in emerald ash borer-infested regions: a test of tree-level, hyperspectral technologies
Remote Sens. Environ.
(2008) - et al.
White ash (Fraxinus americana) decline and mortality: the role of site nutrition and stress history
For. Ecol. Manag.
(2012) - et al.
Model selection in logistic regression and performance of its predictive ability
Comput. Stat.
(2010) - et al.
Variance inflation factor: as a condition for the inclusion of suppressor variable(s) in regression analysis
Open J. Stat.
(2015)