A comparison of presence-only analytical techniques and their application in forest pest modeling

https://doi.org/10.1016/j.ecoinf.2021.101525Get rights and content

Highlights

  • Traditional species habitat suitability models were compared to machine learning.

  • Extreme gradient boosting outperformed all other techniques.

  • In comparison, generalized linear, random forest, and MAXENT models underperformed.

  • Model performance improved as background point quantities increased.

  • Extreme gradient boosting is proposed for complex presence-only data.

Abstract

Insect pests are natural disturbance agents that can significantly alter the structure and composition of forested landscapes, and thus impact their ability to provide critical ecosystem services. Predicting population levels of pest species has become crucial for the management of healthy forests, and species distribution modeling techniques may assist with predictions. Due to the nature of sampling in pest assessments there is often a lack of absence data which requires practitioners to rely on presence-only information. Modeling approaches have been developed for presence-only data but have not been tested for pest species that have major impacts on forest ecosystems. Our research objectives were to compare species distribution models for traditional techniques (i.e., generalized linear and additive models) and contemporary machine learning algorithms (i.e., maximum entropy, random forest, gradient boosted decision trees, and extreme gradient boosting), as well as assess how varying background points influence model performance. True presence-absence data and presences combined with background point data at one, two, three, and ten times the number of presences were compared. Comparisons were done using a comprehensive dataset from 2405 survey plots that assessed the presence and absence of non-native Sirex woodwasp (Sirex noctilio Fabricius) collected in pine plantations in Chile. Contemporary machine learning techniques (>84% average accuracy) outperformed traditional modeling techniques (<82% average accuracy) when utilizing true presence-absence data. For presence-background point models, accuracy tended to increase as the number of background points increased, except for generalized additive models and MaxEnt which had relatively similar performances. Generalized linear models, MaxEnt, and random forest substantially underperformed as compared to other modeling frameworks when using background point data. Gradient boosting and extreme gradient boosting had the highest prediction accuracies when combined with background points (74–81% depending on the number of background points) and may provide valuable alternative analyses to traditional techniques for presence-only data that contain complex correlations and interactions. Increasing the precision of these models, while reducing the inherent biases due to data structure, will allow for more informed forest pest management. This is becoming increasingly important, as changes in population and outbreak dynamics and the introduction of invasive species are projected to increase in the coming decades, partially due to global climate change and increased international trade and travel.

Introduction

Insect pests are prominent drivers of forest disturbances that alter ecosystem function, structure, and composition across the landscape (Flower and Gonzalez-Meler, 2015; Yang and Gratton, 2014). Insect pests can have ecological and economic impacts occasionally exceeding those of other disturbances (e.g., wildfire and windthrow) in many landscapes (Dale et al., 2001; Müller et al., 2008). Often interactions among disturbance agents amplifies their disturbance impacts beyond that of any single agent, particularly under climate changes (Bergeron and Leduc, 1998; Raffa et al., 2008). At low severity and/or infrequent and discrete temporal scales, disturbance agents can promote forest heterogeneity which provides diverse habitats to other species and enhances succession and energy flow (Gandhi et al., 2022). However, they may inhibit ecosystem services (e.g., carbon sequestration, timber values, and wildlife habitat) at higher population levels (at least in the short term) and may facilitate invasion by non-native species (Rosenberger et al., 2012).

Negative effects of insect pests on forest health and sustainability have fueled research assessing pest outbreak risk, pest distribution and range fluctuations, and the probability of invasion or spread of non-native species (Aoki et al., 2018; Bright et al., 2020; Cudmore et al., 2010; Gan, 2004; Lantschner et al., 2014; Munro et al., 2021; Senf et al., 2017; Tribe and Cillié, 2004). However, insect pest data, especially for non-native species, are often collected as presence-only data due to the nature of sampling (i.e., collection devices have limited coverage) and cryptic nature of most insect pests at various life-stages. These data are challenging to model, especially when extrapolating beyond the spatial-temporal boundaries of the data. These issues, such as spatial sampling bias, have been discussed in ecological species distribution models (SDMs) (Phillips et al., 2009). Species distribution models fall into two major categories, profile and group discrimination techniques (Wisz and Guisan, 2009). Profile techniques use presence-only data [i.e., ANUCLIM, BIOCLIM, DOMAIN, ecological niche factor analysis (ENFA), HABITAT, and Mahalanobis distance (MD)], but produce coarse predictions and are not as flexible as group discrimination techniques (Busby, 1991; Hirzel et al., 2001; Hirzel et al., 2002; Houlder et al., 2000; Pearce and Boyce, 2006). BIOCLIM is a software package that implements the simplest form of profile techniques. It is widely applied in the literature but is limited in that it does not account for correlations and interactions among predictors, does not handle species discontinuity in space and time, and often underperforms when compared to other profile techniques (limitations and proper application for increased performance further reviewed in Booth et al., 2014; Booth, 2018; Tsoar et al., 2007). Development of other profile methodologies, such as the Genetic Algorithm for Rule-set Prediction (GARP) and MD, have expanded to reduce some of these limitations and have increased model performance (Tsoar et al., 2007). Of the profile methods, GARP has shown to be the most flexible and accurate (Tsoar et al., 2007); however, Stockman et al. (2006) argue that accuracy may decline for cryptic species. GARP may not perform well in forest pest modeling, given that many species spend a large portion of their life history inside host trees [e.g., bark and woodboring insects (Hain et al., 2011; Ryan and Hurley, 2012; Safranyik and Carroll, 2007)]. GARP, while only requiring presence data (true absences are not considered in the model), is more of an intermediate between profile and group discrimination techniques, as the algorithm resamples the study area to allocate background points.

Group discrimination techniques contrast presences with background points (i.e., pseudo-absences) using traditional modeling techniques (e.g., generalized linear models and generalized additive models) and/or newer machine learning algorithms (e.g., maximum entropy, random forest, and gradient boosted decision trees) (Wisz and Guisan, 2009). Generalized linear models and generalized additive models are the most widely used applications but require a priori hypotheses on the possible relationships and interactions (Hastie and Tibshirani, 1990; Lütolf et al., 2006; McCullagh and Nelder, 1989; Pearce and Boyce, 2006). Machine learning techniques are relatively new as compared to the above-mentioned techniques, and their flexible nature, ease of handling big data, and ability to explore complex interactions make them attractive (Evans et al., 2011). Further, they do not require a priori hypotheses on the relationship between variables as is required by traditional modeling approaches, which allows for exploration of non-intuitive relationships (Evans et al., 2011). Background points can be selected: 1) without replacement randomly from the study area or known species range without considering presence locations in the data; 2) randomly with the exclusion of presence locations; 3) using geographic weighting and/or exclusion of environmentally similar locations as presences; 4) by selecting locations where similar species were recorded but the target species was not; or 5) through a combination of these approaches (Elith and Leathwick, 2007; Hirzel et al., 2001; Senay et al., 2013; Stockwell, 1999; VanDerWal et al., 2009; Zaniewski et al., 2002). The methodology used in the selection of background points and the quantity of background points influences model performance, prediction, and when extrapolating beyond the data either spatially or temporally (VanDerWal et al., 2009).

Species distribution models used in forest health are crucial for determining suitable habitat for pests as this allows limited detection and management resources to be utilized more efficiently. Species distribution models have primarily focused on maximum entropy (e.g., MaxEnt) and Genetic Algorithm for Rule-set Prediction (GARP) models and have had varying success (Barredo et al., 2015; Fischbein et al., 2019; Sobek-Swant et al., 2012; Srivastava et al., 2020; Venette, 2017). Data used in these models are often obtained through historical and museum records, field collections using insect traps baited with attractive lure combinations (e.g., host odors and/or insect pheromone components), eradication effort data (proxy for establishment), satellite data, or a combination of these sources (Dang et al., 2021; Epanchin-Niell et al., 2021; Munro et al., 2021; Ning et al., 2021; Sarikaya and Sen, 2020). Few studies have attempted to compare the different types of species distribution models (i.e., traditional modeling techniques with newer machine learning algorithms) used for cryptic forest insect species, nor have they thoroughly assessed more than two modeling techniques based on predictive accuracy. Life history has shown to be an important factor for other cryptic wildlife (Aubry et al., 2017) and, in general, forest insects are even more elusive than those previously tested due to many spending a portion of their life cycle inside host trees (e.g., bark and woodboring insects) (Fettig and Audley, 2021; Munro et al., 2019).

The goal of this study was therefore, to assess group discrimination modeling techniques in relation to forest insect pest prediction models. As such, we used Sirex woodwasp (Sirex noctilio Fabricius) (Hymenoptera: Siricidae) presence-absence data collected in Chile. Sirex woodwasp is not a major pest through their native range in Asia, Europe, and northern Africa, but causes significant damage to pine trees (Pinus spp.) in newly invaded areas, such as Australia, New Zealand, and South America (i.e., Argentina, Brazil, Chile, and Uruguay) (Ayres et al., 2014; Ciesla, 2003). Specifically, our objectives were to: 1) compare traditional modeling techniques used in species distribution models (i.e., generalized linear models and generalized additive models) with newer machine learning algorithms (i.e., MaxEnt, random forest, gradient boosted decision trees, and extreme gradient boosting); 2) compare the model accuracy of true presence-absence and presence-background point data; 3) assess model performance with varying background point numbers; and 4) compare prediction maps and uncertainty of Sirex woodwasp using the aforementioned techniques with regard to prediction accuracy (one model for the true presence-absence data and one for the presence-background point data). Prior research assessing habitat suitability generally discusses model uncertainty, but uncertainty maps are rarely presented despite this varying in space and time. Gaining knowledge on the spatial-temporal upper and lower bounds of predictions would greatly enhance our understanding of habitat suitability. Results from this study will help guide future research assessing forest suitability for insect pests, which is especially beneficial for modeling potential distribution and establishment of non-native species.

Section snippets

Data collection

Sirex woodwasp data were collected from Monterey pine (Pinus radiata D. Don) plantations in Chile during April and May of 2013. Data included 2405 unique 250 m2 plots (Supplementary file 1) established in 432 plantation stands within 101 farms in the following geographic limits: 667580 (minimum X), 789,165 (maximum X), 5,770,425 (minimum Y), and 5,959,550 (maximum Y) (EPSG:32718). All trees inside of the plots were characterized for diameter, height, and the presence or absence of S. noctilio

True presence-absence model comparisons

Our study investigated the performance of six different species distribution modeling techniques. These analyses are particularly important for assessing native pest outbreaks and invasive species distributions, two phenomena that have increased in recent decades and have caused extensive tree mortality (Kurz et al., 2008; Srivastava et al., 2020; Srivastava et al., 2021). These impacts are projected to continue increasing due to future global climate change (Kurz et al., 2008). In this work,

Concluding remarks

We identify three key gaps in knowledge that may be explored within forest ecological modeling. Firstly, sampling bias that is introduced due to presence-only modeling can lead to reduced model accuracy (Ranc et al., 2017), and we found that the inclusion of background points led to more conservative results. These effects may be further amplified by the cryptic nature of forest insect pests. While sampling bias was not specifically explored in this study, model performance for some models was

Funding

This work was supported by the Plantation Management Research Cooperative (PMRC) and Warnell School of Forestry and Natural Resources (University of Georgia).

Declaration of Competing Interest

None.

Acknowledgements

We would like to thank Bioforest, the research branch from ARAUCO company (Chile, South America) for providing Sirex woodwasp data. We also thank the members of the PMRC and Gandhi Forest Entomology Laboratory (University of Georgia) for project feedback.

References (111)

  • H. Hong et al.

    Exploring the effects of the design and quantity of absence data on the performance of random forest-based landslide susceptibility mapping

    Catena

    (2019)
  • M.F. Hutchinson et al.

    Splines - more than just a smooth interpolator

    Geoderma

    (1994)
  • E. Moreno-Amat et al.

    Impact of model complexity on cross-temporal transferability in Maxent species distribution models: an assessment using paleobotanical data

    Ecol. Model.

    (2015)
  • H.L. Munro et al.

    Through space and time: predicting numbers of an eruptive pine tree pest and its predator under changing climate conditions

    For. Ecol. Manag.

    (2021)
  • Y.-S. Park et al.

    Hazard ratings of pine forests to a pine wilt disease at two spatial scales (individual trees and stands) using self-organizing map and random forest

    Ecol. Inform.

    (2013)
  • S. Sobek-Swant et al.

    Potential distribution of emerald ash borer: what can we learn from ecological niche models using Maxent and GARP?

    For. Ecol. Manag.

    (2012)
  • P. Angerer et al.

    repr: Serializable representations. R package version 1.1.3

    (2021)
  • K.B. Aubry et al.

    The importance of data quality for generating reliable distribution models for rare, elusive, and cryptic species

    PLoS One

    (2017)
  • M.P. Ayres et al.

    Host use patterns by the European woodwasp, Sirex noctilio, in its native and invaded range

    PLoS One

    (2014)
  • M. Barbet-Massin et al.

    Selecting pseudo-absences for species distribution models: how, where and how many?

    Methods Ecol. Evol.

    (2012)
  • J.I. Barredo et al.

    Assessing the potential distribution of insect pests: case studies on large pine weevil (Hylobius abietis L) and horse-chestnut leaf miner (Cameraria ohridella) under present and future climate conditions in European forests

    EPPO Bull.

    (2015)
  • Y. Bergeron et al.

    Relationships between change in fire frequency and mortality due to spruce budworm outbreak in the southeastern Canadian boreal forest

    J. Veg. Sci.

    (1998)
  • R. Bivand et al.

    rgdal: Bindings for the “Geospatial” data abstraction library. R package version 1.4–8

    (2019)
  • T.H. Booth et al.

    BIOCLIM: the first species distribution modelling package, its early applications and relevance to most current MAXENT studies

    Divers. Distrib.

    (2014)
  • Y. Boulanger et al.

    Model-specification uncertainty in future forest pest outbreak

    Glob. Chang. Biol.

    (2016)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • B.C. Bright et al.

    Mapping multiple insect outbreaks across large regions annually using Landsat time series data

    Remote Sens.

    (2020)
  • J.R. Busby

    BIOCLIM-A bioclimate analysis and prediction system

    Plant Protect. Quart.

    (1991)
  • T. Chen et al.

    Xgboost: extreme gradient boosting. R package version 0.4–2

    (2015)
  • W.M. Ciesla

    European woodwasp: a potential threat to North America’s conifer forests

    J. For.

    (2003)
  • T.J. Cudmore et al.

    Climate change and range expansion of an aggressive bark beetle: evidence of higher beetle reproduction in naïve host tree populations

    J. Appl. Ecol.

    (2010)
  • D.R. Cutler et al.

    Random forests for classification in ecology

    Ecology

    (2007)
  • V.H. Dale et al.

    Climate change and forest disturbances: climate change can affect forests by altering the frequency, intensity, duration, and timing of fire, drought, introduced species, insect and pathogen outbreaks, hurricanes, windstorms, ice storms, or landslides

    BioScience

    (2001)
  • Y.-Q. Dang et al.

    Retrospective analysis of factors affecting the distribution of an invasive wood-boring insect using native range data: the importance of host plants

    J. Pest. Sci.

    (2021)
  • K.J. Dodds et al.

    The impact of Sirex noctilio in Pinus resinosa and Pinus sylvestris stands in New York and Ontario

    Can. J. For. Res.

    (2010)
  • J. Elith et al.

    Predicting species distributions from museum and herbarium records using multiresponse models fitted with multivariate adaptive regression splines

    Divers. Distrib.

    (2007)
  • J.H. Elith et al.

    Novel methods improve prediction of species’ distributions from occurrence data

    Ecography

    (2006)
  • J. Elith et al.

    A statistical explanation of MaxEnt for ecologists

    Divers. Distrib.

    (2011)
  • R. Epanchin-Niell et al.

    Socio-environmental drivers of establishment of Lymantria dispar, a nonnative forest pest, in the United States

    Biol. Invasions

    (2021)
  • J.S. Evans et al.

    Modeling species distribution and change using random forest

  • S. Ferrier et al.

    An evaluation of the effectiveness of environmental surrogates and modelling techniques in predicting the distribution of biological diversity

  • S.E. Fick et al.

    WorldClim 2: new 1-km spatial resolution climate surfaces for global land areas

    Int. J. Climatol.

    (2017)
  • C.E. Flower et al.

    Responses of temperate forest productivity to insect and pathogen disturbances

    Annu. Rev. Plant Biol.

    (2015)
  • J.H. Friedman

    Greedy function approximation: a gradient boosting machine

    Ann. Stat.

    (2001)
  • J. Friedman et al.

    Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors)

    Ann. Stat.

    (2000)
  • J. Friedman et al.

    Regularization paths for generalized linear models via coordinate descent

    J. Stat. Softw.

    (2010)
  • K.J.K. Gandhi et al.

    Bark beetle outbreaks alter biotic components of forested landscapes

  • F.P. Hain et al.

    Natural history of the southern pine beetle

  • T.J. Hastie et al.

    Generalized Additive Models

    (1990)
  • T. Hastie et al.

    The Elements of Statistical Learning: Data Mining, Inference, and Prediction

    (2009)
  • Cited by (5)

    • A new approach to evaluate the risk of bark beetle outbreaks using multi-step machine learning methods

      2022, Forest Ecology and Management
      Citation Excerpt :

      This resulted in a binary target feature whereby an outbreak being present was indicated with a one and background points were indicated with a zero. This presence-only analytical technique is common in ecological modeling and is further reviewed by Munro et al. (2022). Each of the two models were tuned over a grid space that yielded 150 different parameter value combinations.

    View full text