Elsevier

Ecological Informatics

Volume 52, July 2019, Pages 46-56
Ecological Informatics

Classification and regression with random forests as a standard method for presence-only data SDMs: A future conservation example using China tree species

https://doi.org/10.1016/j.ecoinf.2019.05.003Get rights and content

Highlights

  • CT in machine learning was compared with RT for inference from predictions.

  • Choice of evaluation criteria changed the relative performance of CT and RT.

  • Choice of threshold altered model performance and species range shift projections.

  • Four objective threshold methods were recommended for binary predictions.

  • First-time generic guidelines were proposed on how to choose RF (CT or RT) methods.

Abstract

The random forests (RF) algorithm is a superb learner and classifier in machine learning applications. This ensemble model is also one of the most popular species distribution model algorithms (SDMs) available to date. RF by default can produce categorical and numerical species distribution maps based on its classification tree (CT) and regression tree (RT) algorithms, respectively. Statistically, CT can also produce numerical predictions (class probability). Many real-world applications (e.g. conservation planning) employ binary presence–absence outputs that use classification thresholds to make these conversions. However, there is little available information regarding the difference in model performance between CT and RT for inference settings. Here, under an ensemble modeling framework, 52 forest tree species with presence-only data for all of China were selected for comparison of the performance of CT and RT algorithms in projecting the distribution and potential range shifts of these species under current and future climates. Five climatic variables were used to develop CT and RT models. Eight threshold-setting approaches were employed to convert numerical predictions into binary predictions. With regard to probabilistic predictions, the relative performance of CT and RT depended on the choice of the evaluation criteria. For both RT and CT, threshold-setting methods significantly altered the determination of thresholds, model performance, and subsequently projections of species range shifts under climate change. The four threshold selection methods (MaxKappa, MaxOA, MaxTSS, and MinROCdist) based on the composite model accuracy measures most often achieved significantly higher model performance than CT default threshold method and other threshold methods. They consistently projected that species' geographical ranges changed in response to climate change with the same direction and magnitude. We argue for choosing RT rather than CT as the SDM if model discrimination capacity (the ability to differentiate between occurrences of presence and absence) is viewed as more important than model reliability (the agreement between predicted relative indexes of occurrence and observed proportions of occurrence), and vice versa. In line with gradient theory, we can recommend the use of numerical predictions for species distribution modeling since they help to convey more information than binary predictions. Binary conversion of model outputs should only be carried out when it is clearly justified by the application's objective. The four aforementioned threshold methods are promising objective methods for binary conversions of continuous predictions when presence-only data are available. This study proposes guidelines on how machine learning can be used for specific applied and theoretical applications in a SDM context.

Introduction

Species distribution models (SDMs) have been increasingly applied to tackle a wide range of questions in ecology, evolution, biogeography, forestry, climate change, and conservation biology. For instance, they have been used to quantify the environmental niche or niche shifts of species (e.g. Baltensperger et al., 2017; Han et al., 2018; Petitpierre et al., 2012), to identify hierarchies of environmental drivers (e.g. Crimmins et al., 2011; Zhang et al., 2016), and to generate and test biogeographical and ecological hypotheses (e.g. Patsiou et al., 2014; Zhang et al., 2016). SDMs have also been used to inform the prospective design of surveys for rare species (e.g. Sarre et al., 2013) and to map suitable sites for species recovery and reintroduction (e.g. Bleyhl et al., 2015; Kandel et al., 2015). Others have used SDMs to assess the effects of environmental changes on species distribution (e.g. Nenzén and Araújo, 2011; see https://www.fs.fed.us/nrs/atlas/ for large-scale application) and for modeling of species assemblages (biodiversity) by stacking individual species predictions (e.g. Cao et al., 2013; Cooper and Soberón, 2018; Gavish et al., 2017), or for landscape management (Drew et al., 2011). Booth et al. (2014) and Guillera-Arroita et al. (2015) reviewed and detailed the current use of SDMs, whereas the pure data mining concept is widely ignored still (but see work by Craig and Huettmann, 2009; Huettmann and Ickert-Bond, 2017).

Machine learning is a growing and leading approach for a wide range of modern analysis problems including classification, regression, data mining and predictions (Mueller and Massaron, 2016). It has developed into its own type of research and carries much literature and code. Due to its magnitude and depth, for now, machine learning applications still lag behind though when it comes to natural resource management applications (but see Cushman and Huettmann, 2010; Drew et al., 2011).

Random forests (RF), one of the most popular SDM algorithms in those applications (Cutler et al., 2007; Oppel et al., 2012; Prasad et al., 2006), is a deep analysis platform (Breiman, 2001a, Breiman, 2001b) and has the flexibility needed to perform several types of statistical data analysis, including regression, classification, survival analysis, and unsupervised learning (Breiman, 2001a; Breiman and Cutler, 2004; Liaw and Wiener, 2002). It's a major ‘learner’ and classifier in machine learning (Fernández-Delgado et al., 2014; Hastie et al., 2009). Besides other concepts to obtain best bagging tree output, the model is based on unpruned CARTs (Breiman et al., 1984) and those can be categorized as a classification (and regression) tree (Strobl et al., 2009). This type of classification tree explains the variation of a single response variable by recursively splitting the data in a binary fashion into progressively more homogeneous groups, using one or more explanatory variables (Breiman, 2001a; Breiman and Cutler, 2004; Liaw and Wiener, 2002). The response variable is usually either numeric (regression trees) or categorical (classification trees), and the explanatory variables can be numeric and/or categorical (De'ath and Fabricius, 2000). Owing to its hierarchical nature, a RF is capable of capturing non-linear and correlated relationships in predictor variables. It can be particularly useful for inference from complex data based on predictions (Breiman, 2001b). A major application is species distribution modeling, which often exhibits complex interactions in predictor variables. Due to bagging, the Law of Large Numbers (Feller, 1968) shows that RF does not tend to overfit data (Breiman, 2001a, Breiman, 2001b). A RF model is expected to balance the accuracy and robustness of predictions because it inherently incorporates the concept of bagging and ensemble learning (Hastie et al., 2009). Therefore, RF models often perform better than other SDMs (e.g. Cutler et al., 2007; Jafariana et al., 2019; Mi et al., 2017; Peters et al., 2007; Prasad et al., 2006; Zhang et al., 2016). Inference is drawn from the prediction (Breiman, 2001a, Breiman, 2001b). Based on its underlying classification (CT) and regression (RT) tree algorithms, RF can produce categorical/classified/binary and numerical predictions, respectively. RF model applications in the form of CT and RT models are commonly and successfully used in species distribution modeling with species' presence/absence data (Cutler et al., 2007; Mi et al., 2017; Peters et al., 2007; Zhang et al., 2014). Statistically, CT can also produce numerical predictions (class probability or better called species relative index of occurrence, Liaw and Wiener, 2002). However, researchers have often focused on the application of either CTs (e.g. Mi et al., 2017; Peters et al., 2007) or RTs (e.g. Gavish et al., 2017; Kandel et al., 2015; Nenzén and Araújo, 2011; Zhang et al., 2016) in a SDM context. Required knowledge on the difference in model performance between CTs and RTs remains unclear. It is also worthwhile to explain that the RF code comes from Breiman (2001a) and then was released and implemented in several versions. However, the regression solution was not really part of Leo Breiman's initial code and mostly is based on Andy Liaw's implementation in R (Liaw and Wiener, 2002).

SDMs are usually constructed through a series of methods that relate a set of environmental predictors with information on species distributions (Drew et al., 2011; Guisan and Zimmermann, 2000). Information about the distributions of species, frequently housed in museum and herbarium collections, atlases, plant lists, or provided by volunteer observation networks (citizen science), is becoming increasingly available over the internet (Graham et al., 2004; Huettmann and Ickert-Bond, 2018). These data sets are typically composed of ‘presence-only’ (i.e. no information is usually available on the absence of most species), presence–absence or abundance data (Mateo et al., 2010). Accordingly, SDMs can be categorized in two groups: models that only need presence data (profile techniques) vs. those that require both presence and absence data or that require abundance data (group discrimination techniques; Mateo et al., 2010). The application of data mining is useful to either of those approaches as it is known to resolve distinctions from presence vs. absence or random well. Data mining, using machine learning and specifically RF, is the method of choice to find ‘a signal’, any signal as well as outliers, in data (e.g. Fernández-Delgado et al., 2014; Mi et al., 2017). Presence-only SDMs are more likely to yield potential distributions or the fundamental niche information for a species, whereas presence–absence SDMs are more likely to reflect the natural distribution or realized niche of a species (Zaniewski et al., 2002). Either way, such concepts most often yield the best-possible solution when high-powered algorithms are employed (Elith et al., 2006; Fernández-Delgado et al., 2014). Comparisons of various SDMs indicate presence–absence models tend to perform better than presence-only models (Elith et al., 2006; Mateo et al., 2010; Oppel et al., 2012). Thus, presence–absence models are increasingly used when only presence data are available, by creating artificial absence data (i.e. pseudo-absence data; Zaniewski et al., 2002; Mateo et al., 2010; Barbet-Massin et al., 2012). Several studies have suggested that pseudo-absence data should be restricted to locations that are documented to be distinctly unsuitable for the occurrence of a particular species (Mateo et al., 2010; Zaniewski et al., 2002). But once more, in real-world data mining, those are smaller differences as the latter methods can find the signal in a rather reliable fashion (e.g. Craig and Huettmann, 2009).

When developing SDM models, presence–absence data (response variables) are often treated using numeric (usually Y = 1 for presence and Y = 0 for absence) or categorical (binary value) variables. Correspondingly, they result in model prediction outputs that provide a value of the relative occurrence index scaled from 0 to 1, or a binary value represented by presence and absence (see the detailed description in the BIOMOD manual, Thuiller et al., 2009). However, in resource management, climate change and environmental conservation applications (e.g. reserve design, biodiversity assessment, climate change), information that is presented in a binary format such as species presence/absence may have more practical applications than that presented as a continuous index (Baltensperger et al., 2017; Fernandes et al., 2018; Kandel et al., 2015). Therefore, a threshold is needed to convert continuous indexes to binary presence–absence predictions. Furthermore, many commonly used performance measures such as the true skills statistics (TSS) and Kappa require binary data (Fielding and Bell, 1997; Pearce and Ferrier, 2000). Although many threshold selection methods exist for presence/absence data (Jiménez-Valverde and Lobo, 2007; Liu et al., 2005; Nenzén and Araújo, 2011), very few methods have been yet proposed for use with presence-only data (Liu et al., 2013; Liu et al., 2016), and it is somewhat unclear which threshold method is most appropriate for CTs and RTs.

In view of the above here we raise the following questions about CT and RT predictions for a valid inference: Which concept works best to predict species distribution with presence and pseudo-absence data; and which threshold is most appropriate if binary conversion is wanted or necessary?

While there are many implementations of the RF base code (e.g. in R, Python, Fortran, SPM software of Salford Systems Ltd) with different strengths and weakness (Liaw and Wiener, 2002; Strobl et al., 2009; Ishwaran and Kogalur, 2007; Brieuc et al., 2018; see Herrick, 2013 for an assessment), here we choose the R package “randomForest” (Liaw and Wiener, 2002) to construct CT and RT prediction models. We did so because it is particular popular in SDM work with an open- access and -source characteristic, and because R has a large library of statistical packages relevant to SDMs for data preprocessing and post processing (e.g. Freeman and Moisen, 2008a; Hijmans, 2012; Thuiller et al., 2009); those can easily be linked.

Under an ensemble modeling framework, 52 forest tree species from China were selected for comparison of the performance of CT and RT algorithms in projecting the distribution and potential range shifts of these species under current and future climates. This matters a lot because in China, concern on environmental protection and forest resource conservation has prompted the rapid development of tree plantations (Bryan et al., 2018; Li, 2004). Currently the area of forest plantations stands in China at 69 million hectares, or 24.8% of the total area worldwide (277.9 million hectares), which is well ahead of any other country (FAO, 2015). Moreover, in joining the international efforts of mitigating global climate change, China has set a target to increase forest carbon sink by expanding forest cover as a key measure in the forestry sector (26% forest cover by 2050, and 40 million new hectares by 2020 when compared to 2005 levels; SFAC, 2010). For achieving the goal of expanding forest cover, large areas of new plantations need to be established. China has a land area of 9.6 million square kilometers, spans a large range of climate and nature environments (Song and Zhang, 2010). The identification of climate requirements and predictions of potential range shifts of native tree species under an altered climate will greatly facilitate the assumed success in establishing new plantation forests. We also anticipate that this study will provide a scientific basis for the choice of species and sites for the large-scale forestation practice. Finally, these steps are facilitated by randomForest and here we assess whether that concept and its workflow can stand as a generic template.

Section snippets

Materials and methods

Fig. 1 shows the overall work flow for our ensemble forecasting approach. It's meant to be a generic concept to be applied to virtually any model prediction question with presence only data.

Differences in model accuracy for numerical predictions

We found that the relative performance of CT and RT depended on the choice of evaluation criteria (Fig. 2). For 2 degree method, Wilcoxon signed-rank tests indicated AUC values of ROC, OA, sensitivity, and specificity were significantly higher for RT than that for CT. When MAE was used to assess model performance, CT performed better than RT. CT and RT performed equally well with respect to RMSE, R2, and MXE. For SRE method, RT performed better than CT in terms of RMSE, R2, MXE, and AUC value

Discussion

This is the first study to compare the model performance of RT and CT using different thresholding methods. Our model predictions have a good to very good accuracy and allow to tackle questions of modern ecological applications, e.g. selection of sites and species for reforestation with future climate change in mind, and overall tree species conservation. Further, we applied one of the best algorithms (RF) known for the SDMs on a national level in an open source code with open access data. As

Conclusions

In conclusion, randomForest can perform as a leading prediction algorithm when used with multi-species and on a national level. However, in practice we argue for choosing RT rather than CT as the SDM if model discrimination capacity is viewed as more important than model reliability, and vice versa. In line with gradient theory, we recommend the use of probabilistic predictions of RT or CT for species distribution modeling. A binary conversion of model outputs should only be implemented when it

Acknowledgements

This study was funded by the National Key R&D Program of China (2017YFC0505501, 2017YFC0505603) and National Natural Science Foundation of China (41301056).

References (74)

  • S. Oppel et al.

    Comparison of five modelling techniques to predict the spatial distribution and abundance of seabirds

    Biol. Conserv.

    (2012)
  • J. Pearce et al.

    Evaluating the predictive performance of habitat models developed using logistic regression

    Ecol. Model.

    (2000)
  • J. Peters et al.

    Random forests as a tool for ecohydrological distribution modelling

    Ecol. Model.

    (2007)
  • C.J.F. Ter Braak et al.

    A theory of gradient analysis

    Adv. Ecol. Res.

    (1988)
  • A. Zaniewski et al.

    Predicting species spatial distributions using presence-only data: a case study of native New Zealand ferns

    Ecol. Model.

    (2002)
  • L. Zhang et al.

    Using DEM to predict Abies faxoniana and Quercus aquifolioides distributions in the upstream catchment basin of the Min River in Southwest China

    Ecol. Indic.

    (2016)
  • O. Allouche et al.

    Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS)

    J. Appl. Ecol.

    (2006)
  • M. Ballings et al.

    AUC: Threshold Independent Performance Measures for Probabilistic Classifiers. R Package Version 0.3.0

  • A.P. Baltensperger et al.

    Expansion of American marten (Martes americana) distribution in response to climate and landscape change on the Kenai peninsula, Alaska

    J. Mammal.

    (2017)
  • M. Barbet-Massin et al.

    Selecting pseudo-absences for species distribution models: how, where and how many?

    Methods Ecol. Evol.

    (2012)
  • T.H. Booth et al.

    BIOCLIM: the first species distribution modelling package, its early applications and relevance to most current MAXENT studies

    Divers. Distrib.

    (2014)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • L. Breiman

    Statistical modeling: the two cultures

    Stat. Sci.

    (2001)
  • L. Breiman et al.

    Random Forests

  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • M.S.O. Brieuc et al.

    A practical introduction to random forest for genetic association studies in ecology and evolution

    Mol. Ecol. Resour.

    (2018)
  • B.A. Bryan et al.

    China's response to a national land-system sustainability emergency

    Nature

    (2018)
  • J.C. Cooper et al.

    Creating individual accessible area hypotheses improves stacked species distribution model performance

    Glob. Ecol. Biogeogr.

    (2018)
  • E. Craig et al.

    Using “blackbox” algorithms such as TreeNet and random forests for data-mining and for finding meaningful patterns, relationships and outliers in complex ecological data: An overview, an example using golden eagle satellite data and an outlook for a promising future

  • S.M. Crimmins et al.

    Changes in climatic water balance drive downhill shifts in plant species' optimum elevations

    Science

    (2011)
  • S.A. Cushman et al.

    Spatial Complexity, Informatics, and Wildlife Conservation

    (2010)
  • D.R. Cutler et al.

    Random forests for classification in ecology

    Ecology

    (2007)
  • G. De'ath et al.

    Classification and regression trees: a powerful yet simple technique for ecological data analysis

    Ecology

    (2000)
  • C.A. Drew et al.

    Predictive Species and Habitat Modeling in Landscape Ecology: Concepts and Applications

    (2011)
  • Editorial Board of Vegetation map of China (EBVMC) et al.

    1:1,000,000 Vegetation Distribution Map of China

    (2001)
  • J. Elith et al.

    Novel methods improve prediction of species' distributions from occurrence data

    Ecography

    (2006)
  • J.S. Evans et al.

    Modeling species distribution and change using random forest

  • Cited by (0)

    View full text