Classification and regression with random forests as a standard method for presence-only data SDMs: A future conservation example using China tree species
Graphical abstract
Introduction
Species distribution models (SDMs) have been increasingly applied to tackle a wide range of questions in ecology, evolution, biogeography, forestry, climate change, and conservation biology. For instance, they have been used to quantify the environmental niche or niche shifts of species (e.g. Baltensperger et al., 2017; Han et al., 2018; Petitpierre et al., 2012), to identify hierarchies of environmental drivers (e.g. Crimmins et al., 2011; Zhang et al., 2016), and to generate and test biogeographical and ecological hypotheses (e.g. Patsiou et al., 2014; Zhang et al., 2016). SDMs have also been used to inform the prospective design of surveys for rare species (e.g. Sarre et al., 2013) and to map suitable sites for species recovery and reintroduction (e.g. Bleyhl et al., 2015; Kandel et al., 2015). Others have used SDMs to assess the effects of environmental changes on species distribution (e.g. Nenzén and Araújo, 2011; see https://www.fs.fed.us/nrs/atlas/ for large-scale application) and for modeling of species assemblages (biodiversity) by stacking individual species predictions (e.g. Cao et al., 2013; Cooper and Soberón, 2018; Gavish et al., 2017), or for landscape management (Drew et al., 2011). Booth et al. (2014) and Guillera-Arroita et al. (2015) reviewed and detailed the current use of SDMs, whereas the pure data mining concept is widely ignored still (but see work by Craig and Huettmann, 2009; Huettmann and Ickert-Bond, 2017).
Machine learning is a growing and leading approach for a wide range of modern analysis problems including classification, regression, data mining and predictions (Mueller and Massaron, 2016). It has developed into its own type of research and carries much literature and code. Due to its magnitude and depth, for now, machine learning applications still lag behind though when it comes to natural resource management applications (but see Cushman and Huettmann, 2010; Drew et al., 2011).
Random forests (RF), one of the most popular SDM algorithms in those applications (Cutler et al., 2007; Oppel et al., 2012; Prasad et al., 2006), is a deep analysis platform (Breiman, 2001a, Breiman, 2001b) and has the flexibility needed to perform several types of statistical data analysis, including regression, classification, survival analysis, and unsupervised learning (Breiman, 2001a; Breiman and Cutler, 2004; Liaw and Wiener, 2002). It's a major ‘learner’ and classifier in machine learning (Fernández-Delgado et al., 2014; Hastie et al., 2009). Besides other concepts to obtain best bagging tree output, the model is based on unpruned CARTs (Breiman et al., 1984) and those can be categorized as a classification (and regression) tree (Strobl et al., 2009). This type of classification tree explains the variation of a single response variable by recursively splitting the data in a binary fashion into progressively more homogeneous groups, using one or more explanatory variables (Breiman, 2001a; Breiman and Cutler, 2004; Liaw and Wiener, 2002). The response variable is usually either numeric (regression trees) or categorical (classification trees), and the explanatory variables can be numeric and/or categorical (De'ath and Fabricius, 2000). Owing to its hierarchical nature, a RF is capable of capturing non-linear and correlated relationships in predictor variables. It can be particularly useful for inference from complex data based on predictions (Breiman, 2001b). A major application is species distribution modeling, which often exhibits complex interactions in predictor variables. Due to bagging, the Law of Large Numbers (Feller, 1968) shows that RF does not tend to overfit data (Breiman, 2001a, Breiman, 2001b). A RF model is expected to balance the accuracy and robustness of predictions because it inherently incorporates the concept of bagging and ensemble learning (Hastie et al., 2009). Therefore, RF models often perform better than other SDMs (e.g. Cutler et al., 2007; Jafariana et al., 2019; Mi et al., 2017; Peters et al., 2007; Prasad et al., 2006; Zhang et al., 2016). Inference is drawn from the prediction (Breiman, 2001a, Breiman, 2001b). Based on its underlying classification (CT) and regression (RT) tree algorithms, RF can produce categorical/classified/binary and numerical predictions, respectively. RF model applications in the form of CT and RT models are commonly and successfully used in species distribution modeling with species' presence/absence data (Cutler et al., 2007; Mi et al., 2017; Peters et al., 2007; Zhang et al., 2014). Statistically, CT can also produce numerical predictions (class probability or better called species relative index of occurrence, Liaw and Wiener, 2002). However, researchers have often focused on the application of either CTs (e.g. Mi et al., 2017; Peters et al., 2007) or RTs (e.g. Gavish et al., 2017; Kandel et al., 2015; Nenzén and Araújo, 2011; Zhang et al., 2016) in a SDM context. Required knowledge on the difference in model performance between CTs and RTs remains unclear. It is also worthwhile to explain that the RF code comes from Breiman (2001a) and then was released and implemented in several versions. However, the regression solution was not really part of Leo Breiman's initial code and mostly is based on Andy Liaw's implementation in R (Liaw and Wiener, 2002).
SDMs are usually constructed through a series of methods that relate a set of environmental predictors with information on species distributions (Drew et al., 2011; Guisan and Zimmermann, 2000). Information about the distributions of species, frequently housed in museum and herbarium collections, atlases, plant lists, or provided by volunteer observation networks (citizen science), is becoming increasingly available over the internet (Graham et al., 2004; Huettmann and Ickert-Bond, 2018). These data sets are typically composed of ‘presence-only’ (i.e. no information is usually available on the absence of most species), presence–absence or abundance data (Mateo et al., 2010). Accordingly, SDMs can be categorized in two groups: models that only need presence data (profile techniques) vs. those that require both presence and absence data or that require abundance data (group discrimination techniques; Mateo et al., 2010). The application of data mining is useful to either of those approaches as it is known to resolve distinctions from presence vs. absence or random well. Data mining, using machine learning and specifically RF, is the method of choice to find ‘a signal’, any signal as well as outliers, in data (e.g. Fernández-Delgado et al., 2014; Mi et al., 2017). Presence-only SDMs are more likely to yield potential distributions or the fundamental niche information for a species, whereas presence–absence SDMs are more likely to reflect the natural distribution or realized niche of a species (Zaniewski et al., 2002). Either way, such concepts most often yield the best-possible solution when high-powered algorithms are employed (Elith et al., 2006; Fernández-Delgado et al., 2014). Comparisons of various SDMs indicate presence–absence models tend to perform better than presence-only models (Elith et al., 2006; Mateo et al., 2010; Oppel et al., 2012). Thus, presence–absence models are increasingly used when only presence data are available, by creating artificial absence data (i.e. pseudo-absence data; Zaniewski et al., 2002; Mateo et al., 2010; Barbet-Massin et al., 2012). Several studies have suggested that pseudo-absence data should be restricted to locations that are documented to be distinctly unsuitable for the occurrence of a particular species (Mateo et al., 2010; Zaniewski et al., 2002). But once more, in real-world data mining, those are smaller differences as the latter methods can find the signal in a rather reliable fashion (e.g. Craig and Huettmann, 2009).
When developing SDM models, presence–absence data (response variables) are often treated using numeric (usually Y = 1 for presence and Y = 0 for absence) or categorical (binary value) variables. Correspondingly, they result in model prediction outputs that provide a value of the relative occurrence index scaled from 0 to 1, or a binary value represented by presence and absence (see the detailed description in the BIOMOD manual, Thuiller et al., 2009). However, in resource management, climate change and environmental conservation applications (e.g. reserve design, biodiversity assessment, climate change), information that is presented in a binary format such as species presence/absence may have more practical applications than that presented as a continuous index (Baltensperger et al., 2017; Fernandes et al., 2018; Kandel et al., 2015). Therefore, a threshold is needed to convert continuous indexes to binary presence–absence predictions. Furthermore, many commonly used performance measures such as the true skills statistics (TSS) and Kappa require binary data (Fielding and Bell, 1997; Pearce and Ferrier, 2000). Although many threshold selection methods exist for presence/absence data (Jiménez-Valverde and Lobo, 2007; Liu et al., 2005; Nenzén and Araújo, 2011), very few methods have been yet proposed for use with presence-only data (Liu et al., 2013; Liu et al., 2016), and it is somewhat unclear which threshold method is most appropriate for CTs and RTs.
In view of the above here we raise the following questions about CT and RT predictions for a valid inference: Which concept works best to predict species distribution with presence and pseudo-absence data; and which threshold is most appropriate if binary conversion is wanted or necessary?
While there are many implementations of the RF base code (e.g. in R, Python, Fortran, SPM software of Salford Systems Ltd) with different strengths and weakness (Liaw and Wiener, 2002; Strobl et al., 2009; Ishwaran and Kogalur, 2007; Brieuc et al., 2018; see Herrick, 2013 for an assessment), here we choose the R package “randomForest” (Liaw and Wiener, 2002) to construct CT and RT prediction models. We did so because it is particular popular in SDM work with an open- access and -source characteristic, and because R has a large library of statistical packages relevant to SDMs for data preprocessing and post processing (e.g. Freeman and Moisen, 2008a; Hijmans, 2012; Thuiller et al., 2009); those can easily be linked.
Under an ensemble modeling framework, 52 forest tree species from China were selected for comparison of the performance of CT and RT algorithms in projecting the distribution and potential range shifts of these species under current and future climates. This matters a lot because in China, concern on environmental protection and forest resource conservation has prompted the rapid development of tree plantations (Bryan et al., 2018; Li, 2004). Currently the area of forest plantations stands in China at 69 million hectares, or 24.8% of the total area worldwide (277.9 million hectares), which is well ahead of any other country (FAO, 2015). Moreover, in joining the international efforts of mitigating global climate change, China has set a target to increase forest carbon sink by expanding forest cover as a key measure in the forestry sector (26% forest cover by 2050, and 40 million new hectares by 2020 when compared to 2005 levels; SFAC, 2010). For achieving the goal of expanding forest cover, large areas of new plantations need to be established. China has a land area of 9.6 million square kilometers, spans a large range of climate and nature environments (Song and Zhang, 2010). The identification of climate requirements and predictions of potential range shifts of native tree species under an altered climate will greatly facilitate the assumed success in establishing new plantation forests. We also anticipate that this study will provide a scientific basis for the choice of species and sites for the large-scale forestation practice. Finally, these steps are facilitated by randomForest and here we assess whether that concept and its workflow can stand as a generic template.
Section snippets
Materials and methods
Fig. 1 shows the overall work flow for our ensemble forecasting approach. It's meant to be a generic concept to be applied to virtually any model prediction question with presence only data.
Differences in model accuracy for numerical predictions
We found that the relative performance of CT and RT depended on the choice of evaluation criteria (Fig. 2). For 2 degree method, Wilcoxon signed-rank tests indicated AUC values of ROC, OA, sensitivity, and specificity were significantly higher for RT than that for CT. When MAE was used to assess model performance, CT performed better than RT. CT and RT performed equally well with respect to RMSE, R2, and MXE. For SRE method, RT performed better than CT in terms of RMSE, R2, MXE, and AUC value
Discussion
This is the first study to compare the model performance of RT and CT using different thresholding methods. Our model predictions have a good to very good accuracy and allow to tackle questions of modern ecological applications, e.g. selection of sites and species for reforestation with future climate change in mind, and overall tree species conservation. Further, we applied one of the best algorithms (RF) known for the SDMs on a national level in an open source code with open access data. As
Conclusions
In conclusion, randomForest can perform as a leading prediction algorithm when used with multi-species and on a national level. However, in practice we argue for choosing RT rather than CT as the SDM if model discrimination capacity is viewed as more important than model reliability, and vice versa. In line with gradient theory, we recommend the use of probabilistic predictions of RT or CT for species distribution modeling. A binary conversion of model outputs should only be implemented when it
Acknowledgements
This study was funded by the National Key R&D Program of China (2017YFC0505501, 2017YFC0505603) and National Natural Science Foundation of China (41301056).
References (74)
- et al.
Ensemble forecasting of species distributions
Trends Ecol. Evol.
(2007) - et al.
Mapping seasonal European bison habitat in the Caucasus Mountains to identify potential reintroduction sites
Biol. Conserv.
(2015) - et al.
Using Maxent to model the historic distributions of stonefly species in Illinois streams: the effects of regularization and threshold selections
Ecol. Model.
(2013) - et al.
How much should one sample to accurately predict the distribution of species assemblages? A virtual community approach
Ecol. Inform.
(2018) - et al.
A comparison of the performance of threshold criteria for binary classification in terms of predicted prevalence and Kappa
Ecol. Model.
(2008) - et al.
New developments in museum-based informatics and applications in biodiversity analysis
Trends Ecol. Evol.
(2004) - et al.
Predictive habitat distribution models in ecology
Ecol. Model.
(2000) - et al.
Threshold criteria for conversion of probability of species presence to either-or presence-absence
Acta Oecol.
(2007) - et al.
Rapid multi-nation distribution assessment of a charismatic conservation species using open access ensemble model GIS predictions: red panda (Ailurus fulgens) in the Hindu-Kush Himalaya region
Biol. Conserv.
(2015) - et al.
Choice of threshold alters projections of species range shifts under climate change
Ecol. Model.
(2011)