Bagging GLM: Improved generalized linear model for the analysis of zero-inflated data
Research highlights
► This paper aims to propose a new modelling method for species distribution to overcome the problems caused by zero-inflated data sets. ► We combined a regression model and a machine-learning technique for new method. ► We compared model performances between new method and traditional method using several zero-inflated data sets. ► Our results showed that new method has high predictive ability with presence data when analyzing zero-inflated data sets.
Introduction
Predictive species distribution models play an important role in ecology and biogeography (Guisan and Thuiller, 2005, Guisan and Zimmermann, 2000, Marmion et al., 2009) and are being increasingly used in a range of applications, including regional biodiversity assessments, conservation biology, wildlife management, and conservation planning (Elith et al., 2006, Elith and Leathwick, 2007, Rodriguez et al., 2007). Significant use of species distribution models requires not only a high quantity of quality data on the target species, but also complex and robust statistical methodologies (Elith et al., 2006).
Statistical approaches are employed to model species distributions, using relevant predictors to estimate current abundances (Prasad et al., 2006). Currently, regression-like techniques are among the commonly used techniques in ecology and biogeography (Elith et al., 2006, Guisan et al., 1999, Guisan and Zimmermann, 2000, Margules and Sarkar, 2007, Marmion et al., 2009, Randin et al., 2006). These types of techniques can be graphed and assessed for ecological rationality, making the modeled relationships transparent and open to scrutiny (Elith et al., 2005), as well as easy to use (Olden et al. 2008). At the same time, newer computer-intensive data-mining approaches using machine-learning techniques based on recursion, re-sampling, averaging, and randomizations have increased, and tools have been developed (Elith et al., 2008, Olden et al., 2008, Prasad et al., 2006).
Species-occurrence data sets tend to contain a large proportion of zero values, i.e., absence values (Martin et al., 2005, Quintero et al., 2007). Such data sets are referred to as zero-inflated (Fletcher et al., 2005, Martin et al., 2005). Zero inflation causes over-dispersion, which means the actual variance of the observations exceeds the nominal variance of the postulated distribution (Hinde and Demetrio, 2002, Poortema, 1999). Subsequently, over-dispersion causes problems when making sound statistical inference by violating basic assumptions (Martin et al., 2005). This problem decreases the predictive performance of a model. Many researchers have considered this, and several approaches have been produced to solve it (Fletcher et al., 2005, Fukasawa et al., 2009, Hinde and Demetrio, 2002, Martin et al., 2005, Poortema, 1999). Suitable techniques for analyzing the zero-inflated data set must be selected. However, Martin et al. (2005) indicated that when considering how to model zero-inflated data sets, it is important to take into account the data properties, i.e., whether the data set contains a large number of true zero values, false zeroes, or both. As a result, these methods tend to be situation-dependent approaches. Due to these problems, selecting analysis methods can be difficult. A robust and easy-to-use technique to model zero-inflated data sets is much needed.
Here, we propose a new, simple approach that is based on both a regression model and model aggregation idea that uses bootstrapping to combine a large number of models (Breiman, 1996, Sutton, 2005) to treat zero-inflated data. The major reason for zero-inflated data is that presence and absence (zero) rates are not equal. We established a large number of regression models using the same composition of presence/absence data from zero-inflated data and then combined the models. More specifically, we used all of the presence/absence data, but weighted the presence data using machine-learning techniques (see Materials and methods section). In short, we used a model ensemble framework with machine-learning techniques (Arau'jo and New, 2007, Capinha and Anasta'cio, 2011). Although the model ensemble technique is not a singular solution for analyzing zero-inflated data, it has the potential to solve problems, as ensemble modeling may overcome unusual problems (Arau'jo and New, 2007, Capinha and Anasta'cio, 2011, Marmion et al., 2009). Thus, this new method may allow us to regulate problems caused by zero-inflated data and may provide good prediction for zero-inflated data sets.
In this study, we propose a model aggregation method that combines a generalized linear model (GLM) and bootstrap aggregation (bagging; hereafter referred to as "Bagging GLM”) to predict species distributions and treat zero-inflated data. We established distribution models for Vincetoxicum pycnostelma (a vascular plant) and Ninox scutulata (an owl) in Hyogo Prefecture, Japan, using both the Bagging GLM and normal GLM, and compared predictive performance. Both V. pycnostelma and N. scutulata are listed as endangered species in several regions in Japan, and their occurrence data sets are often zero-inflated. Additionally, we generated four theoretical zero-inflated data sets that contain different ratios of presence/absence data and applied these methods to them. We discuss the effectiveness and potential of the new method on the basis of our results.
Section snippets
V. pycnostelma (vascular plant) models
We used two different extents with each cell-size data set for V. pycnostelma distributions, as different spatial scales may produce different relationships between species-occurrence and environment factors (Hobbs, 2003). The broad scale included all of Hyogo Prefecture, Japan, approximately 100,000 × 200,000 m with a 1000-m cell size (Fig. 1); the other was a detail scale that covered one rural area in Hyogo Prefecture, approximately 2000 × 7000 m with a 10-m cell size (Fig. 1). In the broad scale,
V. pycnostelma (vascular plant) models
At the broad scale, AUC values for both Bagging GLMs and the normal GLM were almost the same at approximately 0.69 (Table 1). However, sensitivity and specificity values derived using Bagging GLMs obviously differed from those of the normal GLM. Sensitivity values for Bagging GLMs were all 86.7, whereas the normal GLM value was 77.8. Specificity values for Bagging GLMs ranged from 56.6 to 56.7, whereas that for the normal GLM was 64.7 (Table 1). The estimated coefficient values of the
Discussion
AUC values for Bagging GLMs and the normal GLM were almost the same, but the comprising values were different. Bagging GLMs tended to have high sensitivity values, whereas the normal GLM tended to have high specificity values in zero-inflated data sets, including artificial data sets. Below, we discuss the reasons for these results and the advantages of using the Bagging GLM.
In distribution models using real data sets (i.e., the V. Pycnostelma and N. scutulata models), Bagging GLMs showed high
Acknowledgements
We thank the staff of the Ecology Division of the Museum of Nature and Human Activities, Hyogo, for their valuable support, and the staff of the Laboratory of Biodiversity in Kobe University. We also thank Dr. K. Fukasawa for commenting on an earlier version of this manuscript.
References (39)
- et al.
The evaluation strip: a new and robust method for plotting predicted responses from species distribution models
Ecol. Mod
(2005) - et al.
New developments in museum-based informatics and applications in biodiversity analysis
Trends Ecol Evol
(2004) - et al.
Predictive habitat distribution models in ecology
Ecol. Mod.
(2000) Challenges and opportunities in integrating ecological knowledge across scales
Forest Ecol. Manag.
(2003)Categorical Data Analysis
(2002)- Arau'jo B, New M (2007) Ensemble forecasting of species distributions. Trends Ecol. Evol....
- et al.
An empirical comparison of voting classification algorithms: bagging, boosting, and variants
Machine Learning
(1999) Bagging predictors
Machine Learning
(1996)- et al.
Presence–absence versus presence-only modelling methods for predicting bird habitat suitability
Ecography
(2004) - Cameron S.A., Lozier J. D., Strange J. P.,Koch J. B. ,Cordes N.,Solter L. F., and Griswold T. L. (2011) Patterns of...
Assessing the environmental requirements of invaders using ensembles of distribution models
Diversity Distrib.
An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization
Machine Learning
Predicting species distributions from museum and herbarium records using multiresponse models fitted with multivariate adaptive regression splines
Diversity Distrib.
A working guide to boosted regression trees
J. Anim. Eco.
Modelling skewed data with many zeros: a simple approach combining ordinary and logistic regression
Environ. Ecol. Stat
Predicting future invasion of an invasive alien tree in a Japanese oceanic island by process-based statistical models using recent distribution maps
Ecol. Res.
Cited by (25)
Rangeland species potential mapping using machine learning algorithms
2023, Ecological EngineeringRational Machines and Artificial Intelligence
2021, Rational Machines and Artificial IntelligenceTrade-off relationship between modern agriculture and biodiversity: Heavy consolidation work has a long-term negative impact on plant species diversity
2016, Land Use PolicyCitation Excerpt :This method has high predictive ability using presence data and also is advantageous when analyzing presence-only data from broad-range records, such as those used in this study (Osawa et al., 2011, 2013a). A bagging GLM cannot serve as a significance test but can show coefficients for each explanatory variable (Osawa et al., 2011). We randomly selected a fixed number of presumed absence data from all meshes without PT species records as presence data, repeated this process to create 5000 GLM models, and then combined these into a final model.
Multiple factors drive regional agricultural abandonment
2016, Science of the Total Environment