A comparison of presence-only analytical techniques and their application in forest pest modeling
Introduction
Insect pests are prominent drivers of forest disturbances that alter ecosystem function, structure, and composition across the landscape (Flower and Gonzalez-Meler, 2015; Yang and Gratton, 2014). Insect pests can have ecological and economic impacts occasionally exceeding those of other disturbances (e.g., wildfire and windthrow) in many landscapes (Dale et al., 2001; Müller et al., 2008). Often interactions among disturbance agents amplifies their disturbance impacts beyond that of any single agent, particularly under climate changes (Bergeron and Leduc, 1998; Raffa et al., 2008). At low severity and/or infrequent and discrete temporal scales, disturbance agents can promote forest heterogeneity which provides diverse habitats to other species and enhances succession and energy flow (Gandhi et al., 2022). However, they may inhibit ecosystem services (e.g., carbon sequestration, timber values, and wildlife habitat) at higher population levels (at least in the short term) and may facilitate invasion by non-native species (Rosenberger et al., 2012).
Negative effects of insect pests on forest health and sustainability have fueled research assessing pest outbreak risk, pest distribution and range fluctuations, and the probability of invasion or spread of non-native species (Aoki et al., 2018; Bright et al., 2020; Cudmore et al., 2010; Gan, 2004; Lantschner et al., 2014; Munro et al., 2021; Senf et al., 2017; Tribe and Cillié, 2004). However, insect pest data, especially for non-native species, are often collected as presence-only data due to the nature of sampling (i.e., collection devices have limited coverage) and cryptic nature of most insect pests at various life-stages. These data are challenging to model, especially when extrapolating beyond the spatial-temporal boundaries of the data. These issues, such as spatial sampling bias, have been discussed in ecological species distribution models (SDMs) (Phillips et al., 2009). Species distribution models fall into two major categories, profile and group discrimination techniques (Wisz and Guisan, 2009). Profile techniques use presence-only data [i.e., ANUCLIM, BIOCLIM, DOMAIN, ecological niche factor analysis (ENFA), HABITAT, and Mahalanobis distance (MD)], but produce coarse predictions and are not as flexible as group discrimination techniques (Busby, 1991; Hirzel et al., 2001; Hirzel et al., 2002; Houlder et al., 2000; Pearce and Boyce, 2006). BIOCLIM is a software package that implements the simplest form of profile techniques. It is widely applied in the literature but is limited in that it does not account for correlations and interactions among predictors, does not handle species discontinuity in space and time, and often underperforms when compared to other profile techniques (limitations and proper application for increased performance further reviewed in Booth et al., 2014; Booth, 2018; Tsoar et al., 2007). Development of other profile methodologies, such as the Genetic Algorithm for Rule-set Prediction (GARP) and MD, have expanded to reduce some of these limitations and have increased model performance (Tsoar et al., 2007). Of the profile methods, GARP has shown to be the most flexible and accurate (Tsoar et al., 2007); however, Stockman et al. (2006) argue that accuracy may decline for cryptic species. GARP may not perform well in forest pest modeling, given that many species spend a large portion of their life history inside host trees [e.g., bark and woodboring insects (Hain et al., 2011; Ryan and Hurley, 2012; Safranyik and Carroll, 2007)]. GARP, while only requiring presence data (true absences are not considered in the model), is more of an intermediate between profile and group discrimination techniques, as the algorithm resamples the study area to allocate background points.
Group discrimination techniques contrast presences with background points (i.e., pseudo-absences) using traditional modeling techniques (e.g., generalized linear models and generalized additive models) and/or newer machine learning algorithms (e.g., maximum entropy, random forest, and gradient boosted decision trees) (Wisz and Guisan, 2009). Generalized linear models and generalized additive models are the most widely used applications but require a priori hypotheses on the possible relationships and interactions (Hastie and Tibshirani, 1990; Lütolf et al., 2006; McCullagh and Nelder, 1989; Pearce and Boyce, 2006). Machine learning techniques are relatively new as compared to the above-mentioned techniques, and their flexible nature, ease of handling big data, and ability to explore complex interactions make them attractive (Evans et al., 2011). Further, they do not require a priori hypotheses on the relationship between variables as is required by traditional modeling approaches, which allows for exploration of non-intuitive relationships (Evans et al., 2011). Background points can be selected: 1) without replacement randomly from the study area or known species range without considering presence locations in the data; 2) randomly with the exclusion of presence locations; 3) using geographic weighting and/or exclusion of environmentally similar locations as presences; 4) by selecting locations where similar species were recorded but the target species was not; or 5) through a combination of these approaches (Elith and Leathwick, 2007; Hirzel et al., 2001; Senay et al., 2013; Stockwell, 1999; VanDerWal et al., 2009; Zaniewski et al., 2002). The methodology used in the selection of background points and the quantity of background points influences model performance, prediction, and when extrapolating beyond the data either spatially or temporally (VanDerWal et al., 2009).
Species distribution models used in forest health are crucial for determining suitable habitat for pests as this allows limited detection and management resources to be utilized more efficiently. Species distribution models have primarily focused on maximum entropy (e.g., MaxEnt) and Genetic Algorithm for Rule-set Prediction (GARP) models and have had varying success (Barredo et al., 2015; Fischbein et al., 2019; Sobek-Swant et al., 2012; Srivastava et al., 2020; Venette, 2017). Data used in these models are often obtained through historical and museum records, field collections using insect traps baited with attractive lure combinations (e.g., host odors and/or insect pheromone components), eradication effort data (proxy for establishment), satellite data, or a combination of these sources (Dang et al., 2021; Epanchin-Niell et al., 2021; Munro et al., 2021; Ning et al., 2021; Sarikaya and Sen, 2020). Few studies have attempted to compare the different types of species distribution models (i.e., traditional modeling techniques with newer machine learning algorithms) used for cryptic forest insect species, nor have they thoroughly assessed more than two modeling techniques based on predictive accuracy. Life history has shown to be an important factor for other cryptic wildlife (Aubry et al., 2017) and, in general, forest insects are even more elusive than those previously tested due to many spending a portion of their life cycle inside host trees (e.g., bark and woodboring insects) (Fettig and Audley, 2021; Munro et al., 2019).
The goal of this study was therefore, to assess group discrimination modeling techniques in relation to forest insect pest prediction models. As such, we used Sirex woodwasp (Sirex noctilio Fabricius) (Hymenoptera: Siricidae) presence-absence data collected in Chile. Sirex woodwasp is not a major pest through their native range in Asia, Europe, and northern Africa, but causes significant damage to pine trees (Pinus spp.) in newly invaded areas, such as Australia, New Zealand, and South America (i.e., Argentina, Brazil, Chile, and Uruguay) (Ayres et al., 2014; Ciesla, 2003). Specifically, our objectives were to: 1) compare traditional modeling techniques used in species distribution models (i.e., generalized linear models and generalized additive models) with newer machine learning algorithms (i.e., MaxEnt, random forest, gradient boosted decision trees, and extreme gradient boosting); 2) compare the model accuracy of true presence-absence and presence-background point data; 3) assess model performance with varying background point numbers; and 4) compare prediction maps and uncertainty of Sirex woodwasp using the aforementioned techniques with regard to prediction accuracy (one model for the true presence-absence data and one for the presence-background point data). Prior research assessing habitat suitability generally discusses model uncertainty, but uncertainty maps are rarely presented despite this varying in space and time. Gaining knowledge on the spatial-temporal upper and lower bounds of predictions would greatly enhance our understanding of habitat suitability. Results from this study will help guide future research assessing forest suitability for insect pests, which is especially beneficial for modeling potential distribution and establishment of non-native species.
Section snippets
Data collection
Sirex woodwasp data were collected from Monterey pine (Pinus radiata D. Don) plantations in Chile during April and May of 2013. Data included 2405 unique 250 m2 plots (Supplementary file 1) established in 432 plantation stands within 101 farms in the following geographic limits: 667580 (minimum X), 789,165 (maximum X), 5,770,425 (minimum Y), and 5,959,550 (maximum Y) (EPSG:32718). All trees inside of the plots were characterized for diameter, height, and the presence or absence of S. noctilio
True presence-absence model comparisons
Our study investigated the performance of six different species distribution modeling techniques. These analyses are particularly important for assessing native pest outbreaks and invasive species distributions, two phenomena that have increased in recent decades and have caused extensive tree mortality (Kurz et al., 2008; Srivastava et al., 2020; Srivastava et al., 2021). These impacts are projected to continue increasing due to future global climate change (Kurz et al., 2008). In this work,
Concluding remarks
We identify three key gaps in knowledge that may be explored within forest ecological modeling. Firstly, sampling bias that is introduced due to presence-only modeling can lead to reduced model accuracy (Ranc et al., 2017), and we found that the inclusion of background points led to more conservative results. These effects may be further amplified by the cryptic nature of forest insect pests. While sampling bias was not specifically explored in this study, model performance for some models was
Funding
This work was supported by the Plantation Management Research Cooperative (PMRC) and Warnell School of Forestry and Natural Resources (University of Georgia).
Declaration of Competing Interest
None.
Acknowledgements
We would like to thank Bioforest, the research branch from ARAUCO company (Chile, South America) for providing Sirex woodwasp data. We also thank the members of the PMRC and Gandhi Forest Entomology Laboratory (University of Georgia) for project feedback.
References (111)
- et al.
Old pests in new places: effects of stand structure and forest type on susceptibility to a bark beetle on the edge of its native range
For. Ecol. Manag.
(2018) Species distribution modelling tools and databases to assist managing forests under climate change
For. Ecol. Manag.
(2018)- et al.
Predicting failure in the U.S. banking sector: an extreme gradient boosting approach
Int. Rev. Econ. Financ.
(2019) - et al.
Conifer bark beetles
Curr. Biol.
(2021) - et al.
Modelling the distribution of forest pest natural enemies across invaded areas: towards understanding the influence of climate on parasitoid establishment success
Biol. Control
(2019) - et al.
A comparison of the performance of threshold criteria for binary classification in terms of predicted prevalence and kappa
Ecol. Model.
(2008) Risk and damage of southern pine beetle outbreaks under global climate change
For. Ecol. Manag.
(2004)- et al.
Variable selection using random forests
Pattern Recogn. Lett.
(2010) - et al.
Predictive habitat distribution models in ecology
Ecol. Model.
(2000) - et al.
Assessing habitat-suitability models with a virtual species
Ecol. Model.
(2001)
Exploring the effects of the design and quantity of absence data on the performance of random forest-based landslide susceptibility mapping
Catena
Splines - more than just a smooth interpolator
Geoderma
Impact of model complexity on cross-temporal transferability in Maxent species distribution models: an assessment using paleobotanical data
Ecol. Model.
Through space and time: predicting numbers of an eruptive pine tree pest and its predator under changing climate conditions
For. Ecol. Manag.
Hazard ratings of pine forests to a pine wilt disease at two spatial scales (individual trees and stands) using self-organizing map and random forest
Ecol. Inform.
Potential distribution of emerald ash borer: what can we learn from ecological niche models using Maxent and GARP?
For. Ecol. Manag.
repr: Serializable representations. R package version 1.1.3
The importance of data quality for generating reliable distribution models for rare, elusive, and cryptic species
PLoS One
Host use patterns by the European woodwasp, Sirex noctilio, in its native and invaded range
PLoS One
Selecting pseudo-absences for species distribution models: how, where and how many?
Methods Ecol. Evol.
Assessing the potential distribution of insect pests: case studies on large pine weevil (Hylobius abietis L) and horse-chestnut leaf miner (Cameraria ohridella) under present and future climate conditions in European forests
EPPO Bull.
Relationships between change in fire frequency and mortality due to spruce budworm outbreak in the southeastern Canadian boreal forest
J. Veg. Sci.
rgdal: Bindings for the “Geospatial” data abstraction library. R package version 1.4–8
BIOCLIM: the first species distribution modelling package, its early applications and relevance to most current MAXENT studies
Divers. Distrib.
Model-specification uncertainty in future forest pest outbreak
Glob. Chang. Biol.
Random forests
Mach. Learn.
Mapping multiple insect outbreaks across large regions annually using Landsat time series data
Remote Sens.
BIOCLIM-A bioclimate analysis and prediction system
Plant Protect. Quart.
Xgboost: extreme gradient boosting. R package version 0.4–2
European woodwasp: a potential threat to North America’s conifer forests
J. For.
Climate change and range expansion of an aggressive bark beetle: evidence of higher beetle reproduction in naïve host tree populations
J. Appl. Ecol.
Random forests for classification in ecology
Ecology
Climate change and forest disturbances: climate change can affect forests by altering the frequency, intensity, duration, and timing of fire, drought, introduced species, insect and pathogen outbreaks, hurricanes, windstorms, ice storms, or landslides
BioScience
Retrospective analysis of factors affecting the distribution of an invasive wood-boring insect using native range data: the importance of host plants
J. Pest. Sci.
The impact of Sirex noctilio in Pinus resinosa and Pinus sylvestris stands in New York and Ontario
Can. J. For. Res.
Predicting species distributions from museum and herbarium records using multiresponse models fitted with multivariate adaptive regression splines
Divers. Distrib.
Novel methods improve prediction of species’ distributions from occurrence data
Ecography
A statistical explanation of MaxEnt for ecologists
Divers. Distrib.
Socio-environmental drivers of establishment of Lymantria dispar, a nonnative forest pest, in the United States
Biol. Invasions
Modeling species distribution and change using random forest
An evaluation of the effectiveness of environmental surrogates and modelling techniques in predicting the distribution of biological diversity
WorldClim 2: new 1-km spatial resolution climate surfaces for global land areas
Int. J. Climatol.
Responses of temperate forest productivity to insect and pathogen disturbances
Annu. Rev. Plant Biol.
Greedy function approximation: a gradient boosting machine
Ann. Stat.
Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors)
Ann. Stat.
Regularization paths for generalized linear models via coordinate descent
J. Stat. Softw.
Bark beetle outbreaks alter biotic components of forested landscapes
Natural history of the southern pine beetle
Generalized Additive Models
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
Cited by (5)
A new approach to evaluate the risk of bark beetle outbreaks using multi-step machine learning methods
2022, Forest Ecology and ManagementCitation Excerpt :This resulted in a binary target feature whereby an outbreak being present was indicated with a one and background points were indicated with a zero. This presence-only analytical technique is common in ecological modeling and is further reviewed by Munro et al. (2022). Each of the two models were tuned over a grid space that yielded 150 different parameter value combinations.
Comparison of different variable selection methods for predicting the occurrence of Metisa Plana in oil palm plantation using machine learning
2023, IOP Conference Series: Earth and Environmental ScienceA maximum entropy approach to defining geographic bounds on growth and yield model usage
2023, Frontiers in Forests and Global Change