Use of freely available datasets and machine learning methods in predicting deforestation

https://doi.org/10.1016/j.envsoft.2016.10.006Get rights and content

Highlights

  • Freely available datasets have proven valuable in predicating deforestation.

  • Machine learning techniques are a reliable alternative to statistics.

  • Gaussian processes are suggested as an alternative to artificial neural networks.

  • Bayesian networks were more stable across sample methods.

Abstract

The range and quality of freely available geo-referenced datasets is increasing. We evaluate the usefulness of free datasets for deforestation prediction by comparing generalised linear models and generalised linear mixed models (GLMMs) with a variety of machine learning models (Bayesian networks, artificial neural networks and Gaussian processes) across two study regions. Freely available datasets were able to generate plausible risk maps of deforestation using all techniques for study zones in both Mexico and Madagascar. Artificial neural networks outperformed GLMMs in the Madagascan (average AUC 0.83 vs 0.80), but not the Mexican study zone (average AUC 0.81 vs 0.89). In Mexico and Madagascar, Gaussian processes (average AUC 0.89, 0.85) and structured Bayesian networks (average AUC 0.88, 0.82) performed at least as well as GLMMs (average AUC 0.89, 0.80). Bayesian networks produced more stable results across different sampling methods. Gaussian processes performed well (average AUC 0.85) with fewer predictor variables.

Introduction

Forests around the world remain at risk from a range of threats including urban population growth (DeFries et al., 2010), agricultural and infrastructure expansion (Newman et al., 2014), illegal logging (Gaveau et al., 2009) and insecure property rights (Robinson et al., 2014). With the loss of the forests, we are also losing valuable ecosystem services (Rogers et al., 2010), critical habitats for maintaining biodiversity (Buchanan et al., 2008) and destroying an important carbon sink that could help mitigate increasing atmospheric concentrations of carbon dioxide (Wang et al., 2009). In order to better understand and ultimately reduce these risks, researchers frequently turn to data driven analyses (Mas et al., 2004, Vaca et al., 2012, Allnutt et al., 2013, Newman et al., 2014) for which access to relevant and quality information is crucial.

Despite their value, many datasets, especially at high resolution, still remain difficult or costly to obtain. Socio-economic data may rely on costly surveys and gaining access to data on dynamic variables, such as city or road locations, for the relevant time periods (i.e. when the deforestation was occurring) can be difficult and may require manual digitisation of maps. In contrast, other geo-referenced datasets, such as those describing land use change (Vaca et al., 2012, Allnutt et al., 2013), protected areas (WDPA, 2010), political boundaries (NE, 2013a) and ecoregions (Olson et al., 2001) are becoming freely available. To date however, there has been no rigorous assessment of the utility of using these freely available datasets for deforestation risk modelling.

To analyse these data, researchers have often relied on classical statistics such as generalised linear models – GLMs (Hastie et al., 2009), and more recently generalised linear mixed models – GLMMs (Green et al., 2013). While these techniques are well accepted and easily implemented, they assume explanatory variables are independent (unless dependencies are explicitly modelled) and cannot exploit nonlinear relationships between dependent and independent variables (unless they are known to be nonlinear a priori and data can be transformed). Machine learning (ML) methods such as artificial neural networks - ANNs (Hastie et al., 2009), Bayesian networks – BNs (Fenton and Neil, 2013) and Gaussian processes – GPs (Rasmussen and Williams, 2006), do not make these assumptions. This may prove to be advantageous when it comes to modelling deforestation risk where predictor variables may not be independent or relationships linear.

While comparisons of multiple ML and statistical methods have been conducted in assessing landslide susceptibility (Pham et al., 2016), land use change (Tayyebi et al., 2014), and conservation biology (Kampichler et al., 2010), such broad comparisons have not been undertaken for deforestation risk assessment, with studies either offering no model comparison (Mas et al., 2004, Basse et al., 2014) or a limited comparison of only two methods (Pérez-Vega et al., 2012). This study aims to address this gap while at the same time evaluating a variety of relevant, freely available or low cost datasets to determine their usefulness in predicting deforestation risk, defined here as probability of the presence or absence of deforestation.

By using several statistical and machine learning techniques, we assess whether machine learning is able to improve on the more commonly used methods from classical statistics. In doing so we provide researchers with guidance on the comparative performance of these analytical methods in predicting deforestation risk. We first describe the datasets used in this study along with each deforestation risk modelling method compared. We then describe the design and implementation of each modelling method, the predictor variables included and the model evaluation metrics used in this study. Finally, we examine how the ML models compared against standard statistical models and the implications of these results.

Free or low cost datasets are becoming increasingly common and cover a range of factors relevant to analysing deforestation. While efforts are being made to look at methods for improving the quality of land use images in these datasets (Estes et al., 2016), many are already at a standard that is potentially useful for practical deforestation prediction. High levels of correlation amongst variables are common in land use change, with multiple factors sometimes resulting in the same result (van Vliet et al., 2016), and deforestation is no exception to this. While these correlations can create complications with model design and validation (van Vliet et al., 2016), it also suggests that the large range of available datasets (detailed further in this section) may provide an alternative source of variables in cases where more expensive or difficult to collect options are not available.

One major development in geo-referenced datasets is the World Database on Protected Areas (WDPA), which is maintained by the United Nations Environmental Program World Conservation Monitoring Centre (UNEP-WCMC). The positive influence of protected areas (PAs) on preventing deforestation within their boundaries has been shown (Mas, 2005, Gaveau et al., 2009), although there is some debate in the literature regarding the magnitude of this influence, with some evidence that the credit afforded to protected areas is due not to the protected status of the forest, but to other attributes, such as accessibility (Gaveau et al., 2009). The database is a global, geo-referenced dataset that details the location (as a polygon layer) and date of declaration for the world's PAs (WDPA, 2010). It also lists details such as the conservation category (if any) for each PA, as described by the International Union for the Conservation of Nature – IUCN (Dudley, 2008).

The Landsat Thematic Mapper and Enhanced Thematic Mapper images provide global data with a spatial resolution of 30 m × 30 m (Wang et al., 2009), with the most recent satellite, Landsat 8, being launched in 2013 (NASA, 2015). Data from the Landsat satellite program (UT-Battelle, 2013) are frequently used in deforestation studies for calculating slope and elevation variables (Mas et al., 2004, Gaveau et al., 2009, Wang et al., 2009). Satellite data is also available from the National Aeronautics and Space Administration's (NASA's) Geocover project which has been used by Conservation International (CI) to create land use change datasets for many deforestation hotspots. The dataset for Mexico is in raster format (28.5 m resolution) and maps forest lost between 1990 and 2000, and between 2000 and 2005, where forest is defined as old growth forest, secondary and degraded forests, and plantations (Vaca et al., 2012). An equivalent dataset covering Madagascar exists from the same source (Allnutt et al., 2013).

Other non-profit organisations also make data available without charge for scientific or other non-commercial purposes. Natural Earth (NE) has published a large number of datasets with global coverage, including political boundaries and locations of populated places, ports and airports (NE, 2013b). These datasets give access to a number of variables that have been linked to deforestation risk, such as distance to populations of different sizes (Mas et al., 2004) and political boundaries, such as states and countries that may have differing forest protection policies. Similarly, the World Wildlife Fund (WWF) has produced a global map of terrestrial ecoregions (Olson et al., 2001), which has been used to identify and control for differences in deforestation rates between the different ecoregions (Vaca et al., 2012). A global dataset of major roads is also available (mapAbility, 2012). While a useful reference, this last dataset should be used with caution as the dates for when the roads were created are not given, meaning that it cannot be verified if the roads were in existence at the time when deforestation occurred.

As well as free information it is possible to purchase data. As an example, The Oak Ridge National Laboratory offers the Landscan population pressure raster dataset at 1 km (UT-Battelle, 2013). While the cost may prove prohibitive for some studies, the data is considered high quality and the algorithms used to calculate population pressure make use of roads, populated areas (urban boundaries) and populated points (towns and villages). It has been used previously to estimate population pressure when analysing deforestation (Rogers et al., 2010). As with the road location data, values in the population pressure dataset represent the state of the world in a recent time period, rather than when deforestation was occurring. This may affect the relevance of the information, particularly if there have been significant population movements over the past few decades.

The datasets described have several advantages that make them widely applicable to deforestation studies. All provide extensive, in some cases global, coverage of deforestation hotspots while still having sufficient resolution for more local studies. Most also offer enough information to derive a selection of potentially useful predictor variables for estimating deforestation risk. The variety of the datasets provides an extensive range of potential predictor variables, including many of the most commonly studied predictors such as slope, elevation, population pressure and surrounding land use.

GLMs (Hastie et al., 2009) are a family of statistical techniques commonly applied in deforestation prediction. In a GLM the output is modelled as a linear combination of the inputs, sometimes passed through a nonlinear function (e.g. a sigmoidal function for logistic regression). GLMMs are an extension of GLMs that can model random effects among groups of predictors and are becoming widely used in environmental and conservation analysis studies (Green et al., 2013, Newman et al., 2014). Both GLMs and GLMMs are unable to account for interactions between predictors unless these are explicitly modelled and pre-specified.

ANNs are predictive models loosely based on the biological structure of the human brain (Rumelhart et al., 1986; or more recently Haykin, 2009). An ANN is constructed by linking input nodes, with weighted connections, to output nodes via one or more layers of hidden nodes. Without these hidden layers an ANN is equivalent to either linear or logistic regression, depending on the output function used. During training, data are presented to the network via the input layer. These values are then processed by the first layer of hidden nodes, where they are multiplied by the weights for each node and processed according to a sigmoidal activation function. The output from each layer of hidden nodes is used as the input to the next layer of hidden nodes (Haykin, 2009). The output from the final hidden layer is passed to the output nodes.

A BN is a graphical model that takes a probabilistic approach to representing relationships among variables (Fenton and Neil, 2013). At the core of this approach is Bayes theorem, which uses conditional probabilities to estimate the probability of a hypothesis given the evidence. Key benefits of BNs are their ability to deal with uncertain or missing data and a clear, graphical representation of the relationships between variables (Uusitalo, 2007). Another advantage of BNs is that both the network structure (cause and effect relationships among variables) and conditional probabilities can be either learnt from data or derived from expert knowledge.

GPs are a spatial model that can be viewed as a particular instance of the well-established geostatistics technique of kriging (Hastie et al., 2009) and allow for a very flexible range of response functions to be modelled. The model is defined by a mean and covariance function, which defines a prior distribution over the possible functions. Given a training set, this can be converted into a posterior distribution using Bayes Rule. The posterior distribution is then used to make predictions e.g. by taking the mean of the posterior distribution as the predicted value of the response function at a test point. A detailed explanation of the equations used in GPs is given in Section 1 of the online supplementary material.

ANNs have shown promising results when applied to deforestation risk modelling (Mas et al., 2004) and more generally to modelling changes in land use (Basse et al., 2014). BNs have been successfully applied to numerous environmental management studies including reforestation (Frayer et al., 2014) and forest dynamics (Liedloff and Smith, 2010). While no research was found that specifically used GPs in deforestation risk assessment, the method is similar to Kriging (Rasmussen and Williams, 2006) making it a suitable approach for modelling spatial patterns (Campos-Taberner et al., 2015, Yan et al., 2016).

Section snippets

Materials and methods

Two areas were selected for this study, one in Mexico (Fig. 1) and the other in Madagascar (Fig. 2). Mexico, is widely recognised as one of a handful of the remaining mega-diverse countries for biodiversity and has historically had high deforestation rates (Mas, 2005). Changes in government policies and investment in infrastructure in the late 20th century played a major role in increasing forest loss (Ellis and Porter-Bolland, 2008).

Within the Yucatan region there are variations in the types

Results

Results presented in this section cover both the usefulness of the datasets and a comparison of the statistical and machine learning methods tested for predicting the probability of deforestation.

Discussion and conclusion

Although obtaining datasets for deforestation studies can be challenging, the range and quality of freely available data sources is increasing. The datasets obtained for this study were sufficient to produce reasonable predictions of deforestation risk using even basic statistical models (GLMs and GLMMs), with AUC values generally above 0.8, which is considered good (Platts et al., 2008). The freely available data allowed for a variety of different predictors to be used in modelling

Acknowledgements

We would like to thank Conservation International, for making their dataset available to us for this study as well as the deforestation experts who assisted with the expert derived BNs. We would also like to thank Dr Toby Marthews for his assistance with the statistical models and Dr David Pullar for his early assistance in extracting the predictor variables from the spatial datasets. The input of Dr Lauren Coad in the design of the study is also gratefully acknowledged. Finally we are grateful

References (63)

  • A.C. Liedloff et al.

    Predicting a ‘tree change’ in Australia's tropical savannas: combining different types of models to understand complex ecosystem behaviour

    Ecol. Model.

    (2010)
  • J.F. Mas et al.

    Modelling deforestation using GIS and artificial neural networks

    Environ. Model. Softw.

    (2004)
  • W.J. McConnell et al.

    Physical and social access to land: spatio-temporal patterns of agricultural expansion in Madagascar

    Agric. Ecosyst. Environ.

    (2004)
  • R. Müller et al.

    Spatiotemporal modeling of the expansion of mechanized agriculture in the Bolivian lowland forests

    Appl. Geogr.

    (2011)
  • M.E. Newman et al.

    Assessing deforestation and fragmentation in a tropical moist forest over 68 years; the impact of roads and legal protection in the Cockpit Country, Jamaica

    For. Ecol. Manag.

    (2014)
  • A. Pérez-Vega et al.

    Comparing two approaches to land use/cover change modeling and their implications for the assessment of biodiversity loss in a deciduous tropical forest

    Environ. Model. Softw.

    (2012)
  • B.T. Pham et al.

    A comparative study of different machine learning methods for landslide susceptibility assessment: a case study of Uttarakhand area (India)

    Environ. Model. Softw.

    (2016)
  • P.J. Platts et al.

    Predicting tree distributions in an East African biodiversity hotspot: model selection, data bias and envelope uncertainty

    Ecol. Model.

    (2008)
  • B.E. Robinson et al.

    Does secure land tenure save forests? A meta-analysis of the relationship between land tenure and tropical deforestation

    Glob. Environ. Change

    (2014)
  • H.M. Rogers et al.

    Prioritizing key biodiversity areas in Madagascar by including data on human pressure and ecosystem services

    Landsc. Urban Plan.

    (2010)
  • C.O. Tan et al.

    Predictive models in ecology: comparison of performances and assessment of applicability

    Ecol. Inf.

    (2006)
  • A. Tayyebi et al.

    Comparing three global parametric and local non-parametric models to simulate land use change in diverse areas of the world

    Environ. Model. Softw.

    (2014)
  • L. Uusitalo

    Advantages and challenges of Bayesian networks in environmental modelling

    Ecol. Model.

    (2007)
  • J. van Vliet et al.

    A review of current calibration and validation practices in land-change modeling

    Environ. Model. Softw.

    (2016)
  • G. Wang et al.

    Mapping and spatial uncertainty analysis of forest vegetation carbon by combining national forest inventory data and satellite images

    For. Ecol. Manag.

    (2009)
  • T.F. Allnutt et al.

    Mapping recent deforestation and forest disturbance in northeastern Madagascar

    Trop. Conservation Sci.

    (2013)
  • O. Allouche et al.

    Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS)

    J. Appl. Ecol.

    (2006)
  • C.J.A. Bradshaw et al.

    Global evidence that deforestation amplifies flood risk and severity in the developing world

    Glob. Change Biol.

    (2007)
  • V. Calcagno et al.

    Glmulti: an R package for easy Automated model selection with (generalized) linear models

    J. Stat. Softw. 2010

    (2010)
  • M. Campos-Taberner et al.

    Mapping leaf area index with a smartphone and gaussian processes.” geoscience and remote sensing letters

    IEEE

    (2015)
  • N.V. Chawla et al.

    SMOTE: synthetic minority over-sampling technique

    J. Artif. Intell. Res.

    (2002)
  • Cited by (38)

    • Considerations for selecting a machine learning technique for predicting deforestation

      2020, Environmental Modelling and Software
      Citation Excerpt :

      The second study area (130 km × 90 km) is located in the state of Quintana Roo. To implement the example models, we obtained several freely available or low cost datasets, described fully in Mayfield et al. (2017). The datasets used were the Conservation International land use change data (Vaca et al., 2012), the World Database on Protected Areas (WDPA, 2010), Landscan population pressure (UT-Battelle, 2013), Landsat slope and elevation data (NASA, 2015), MapAbility road locations (mapAbility, 2012) and the Natural Earth city locations (NE, 2013).

    View all citing articles on Scopus
    View full text