Bagging GLM: Improved generalized linear model for the analysis of zero-inflated data

doi:10.1016/j.ecoinf.2011.05.003

Ecological Informatics

Volume 6, Issue 5, September 2011, Pages 270-275

https://doi.org/10.1016/j.ecoinf.2011.05.003 Get rights and content

Abstract

Species-occurrence data sets tend to contain a large proportion of zero values, i.e., absence values (zero-inflated). Statistical inference using such data sets is likely to be inefficient or lead to incorrect conclusions unless the data are treated carefully. In this study, we propose a new modeling method to overcome the problems caused by zero-inflated data sets that involves a regression model and a machine-learning technique. We combined a generalized liner model (GLM), which is widely used in ecology, and bootstrap aggregation (bagging), a machine-learning technique. We established distribution models of Vincetoxicum pycnostelma (a vascular plant) and Ninox scutulata (an owl), both of which are endangered and have zero-inflated distribution patterns, using our new method and traditional GLM and compared model performances. At the same time we modeled four theoretical data sets that contained different ratios of presence/absence values using new and traditional methods and also compared model performances. For distribution models, our new method showed good performance compared to traditional GLMs. After bagging, area under the curve (AUC) values were almost the same as with traditional methods, but sensitivity values were higher. Additionally, our new method showed high sensitivity values compared to the traditional GLM when modeling a theoretical data set containing a large proportion of zero values. These results indicate that our new method has high predictive ability with presence data when analyzing zero-inflated data sets. Generally, predicting presence data is more difficult than predicting absence data. Our new modeling method has potential for advancing species distribution modeling.

Research highlights

► This paper aims to propose a new modelling method for species distribution to overcome the problems caused by zero-inflated data sets. ► We combined a regression model and a machine-learning technique for new method. ► We compared model performances between new method and traditional method using several zero-inflated data sets. ► Our results showed that new method has high predictive ability with presence data when analyzing zero-inflated data sets.

Introduction

Predictive species distribution models play an important role in ecology and biogeography (Guisan and Thuiller, 2005, Guisan and Zimmermann, 2000, Marmion et al., 2009) and are being increasingly used in a range of applications, including regional biodiversity assessments, conservation biology, wildlife management, and conservation planning (Elith et al., 2006, Elith and Leathwick, 2007, Rodriguez et al., 2007). Significant use of species distribution models requires not only a high quantity of quality data on the target species, but also complex and robust statistical methodologies (Elith et al., 2006).

Statistical approaches are employed to model species distributions, using relevant predictors to estimate current abundances (Prasad et al., 2006). Currently, regression-like techniques are among the commonly used techniques in ecology and biogeography (Elith et al., 2006, Guisan et al., 1999, Guisan and Zimmermann, 2000, Margules and Sarkar, 2007, Marmion et al., 2009, Randin et al., 2006). These types of techniques can be graphed and assessed for ecological rationality, making the modeled relationships transparent and open to scrutiny (Elith et al., 2005), as well as easy to use (Olden et al. 2008). At the same time, newer computer-intensive data-mining approaches using machine-learning techniques based on recursion, re-sampling, averaging, and randomizations have increased, and tools have been developed (Elith et al., 2008, Olden et al., 2008, Prasad et al., 2006).

Species-occurrence data sets tend to contain a large proportion of zero values, i.e., absence values (Martin et al., 2005, Quintero et al., 2007). Such data sets are referred to as zero-inflated (Fletcher et al., 2005, Martin et al., 2005). Zero inflation causes over-dispersion, which means the actual variance of the observations exceeds the nominal variance of the postulated distribution (Hinde and Demetrio, 2002, Poortema, 1999). Subsequently, over-dispersion causes problems when making sound statistical inference by violating basic assumptions (Martin et al., 2005). This problem decreases the predictive performance of a model. Many researchers have considered this, and several approaches have been produced to solve it (Fletcher et al., 2005, Fukasawa et al., 2009, Hinde and Demetrio, 2002, Martin et al., 2005, Poortema, 1999). Suitable techniques for analyzing the zero-inflated data set must be selected. However, Martin et al. (2005) indicated that when considering how to model zero-inflated data sets, it is important to take into account the data properties, i.e., whether the data set contains a large number of true zero values, false zeroes, or both. As a result, these methods tend to be situation-dependent approaches. Due to these problems, selecting analysis methods can be difficult. A robust and easy-to-use technique to model zero-inflated data sets is much needed.

Here, we propose a new, simple approach that is based on both a regression model and model aggregation idea that uses bootstrapping to combine a large number of models (Breiman, 1996, Sutton, 2005) to treat zero-inflated data. The major reason for zero-inflated data is that presence and absence (zero) rates are not equal. We established a large number of regression models using the same composition of presence/absence data from zero-inflated data and then combined the models. More specifically, we used all of the presence/absence data, but weighted the presence data using machine-learning techniques (see Materials and methods section). In short, we used a model ensemble framework with machine-learning techniques (Arau'jo and New, 2007, Capinha and Anasta'cio, 2011). Although the model ensemble technique is not a singular solution for analyzing zero-inflated data, it has the potential to solve problems, as ensemble modeling may overcome unusual problems (Arau'jo and New, 2007, Capinha and Anasta'cio, 2011, Marmion et al., 2009). Thus, this new method may allow us to regulate problems caused by zero-inflated data and may provide good prediction for zero-inflated data sets.

In this study, we propose a model aggregation method that combines a generalized linear model (GLM) and bootstrap aggregation (bagging; hereafter referred to as "Bagging GLM”) to predict species distributions and treat zero-inflated data. We established distribution models for Vincetoxicum pycnostelma (a vascular plant) and Ninox scutulata (an owl) in Hyogo Prefecture, Japan, using both the Bagging GLM and normal GLM, and compared predictive performance. Both V. pycnostelma and N. scutulata are listed as endangered species in several regions in Japan, and their occurrence data sets are often zero-inflated. Additionally, we generated four theoretical zero-inflated data sets that contain different ratios of presence/absence data and applied these methods to them. We discuss the effectiveness and potential of the new method on the basis of our results.

Section snippets

V. pycnostelma (vascular plant) models

We used two different extents with each cell-size data set for V. pycnostelma distributions, as different spatial scales may produce different relationships between species-occurrence and environment factors (Hobbs, 2003). The broad scale included all of Hyogo Prefecture, Japan, approximately 100,000 × 200,000 m with a 1000-m cell size (Fig. 1); the other was a detail scale that covered one rural area in Hyogo Prefecture, approximately 2000 × 7000 m with a 10-m cell size (Fig. 1). In the broad scale,

V. pycnostelma (vascular plant) models

At the broad scale, AUC values for both Bagging GLMs and the normal GLM were almost the same at approximately 0.69 (Table 1). However, sensitivity and specificity values derived using Bagging GLMs obviously differed from those of the normal GLM. Sensitivity values for Bagging GLMs were all 86.7, whereas the normal GLM value was 77.8. Specificity values for Bagging GLMs ranged from 56.6 to 56.7, whereas that for the normal GLM was 64.7 (Table 1). The estimated coefficient values of the

Discussion

AUC values for Bagging GLMs and the normal GLM were almost the same, but the comprising values were different. Bagging GLMs tended to have high sensitivity values, whereas the normal GLM tended to have high specificity values in zero-inflated data sets, including artificial data sets. Below, we discuss the reasons for these results and the advantages of using the Bagging GLM.

In distribution models using real data sets (i.e., the V. Pycnostelma and N. scutulata models), Bagging GLMs showed high

Acknowledgements

We thank the staff of the Ecology Division of the Museum of Nature and Human Activities, Hyogo, for their valuable support, and the staff of the Laboratory of Biodiversity in Kobe University. We also thank Dr. K. Fukasawa for commenting on an earlier version of this manuscript.

References (39)

J. Elith et al.
The evaluation strip: a new and robust method for plotting predicted responses from species distribution models
Ecol. Mod
(2005)
C.H. Graham et al.
New developments in museum-based informatics and applications in biodiversity analysis
Trends Ecol Evol
(2004)
A. Guisan et al.
Predictive habitat distribution models in ecology
Ecol. Mod.
(2000)
N.T. Hobbs
Challenges and opportunities in integrating ecological knowledge across scales
Forest Ecol. Manag.
(2003)
A. Agresti
Categorical Data Analysis
(2002)
Arau'jo B, New M (2007) Ensemble forecasting of species distributions. Trends Ecol. Evol....
E. Bauer et al.
An empirical comparison of voting classification algorithms: bagging, boosting, and variants
Machine Learning
(1999)
L. Breiman
Bagging predictors
Machine Learning
(1996)
L. Brotons et al.
Presence–absence versus presence-only modelling methods for predicting bird habitat suitability
Ecography
(2004)
Cameron S.A., Lozier J. D., Strange J. P.,Koch J. B. ,Cordes N.,Solter L. F., and Griswold T. L. (2011) Patterns of...

C. Capinha et al.

Assessing the environmental requirements of invaders using ensembles of distribution models

Diversity Distrib.

(2011)

T. Dietterich

An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization

Machine Learning

(2000)

Elith J, H. Graham C, P. Anderson R, Dudik M, Ferrier S, Guisan A, J. Hijmans R, Huettmann F, R. Leathwick J, Lehmann A...

J. Elith et al.

Predicting species distributions from museum and herbarium records using multiresponse models fitted with multivariate adaptive regression splines

Diversity Distrib.

(2007)

J. Elith et al.

A working guide to boosted regression trees

J. Anim. Eco.

(2008)

Ezaki Y (2005) Bird Distribution and its Historical Change in Hyogo Prefecture Museum of Nature and Human Activities,...

D. Fletcher et al.

Modelling skewed data with many zeros: a simple approach combining ordinary and logistic regression

Environ. Ecol. Stat

(2005)

K. Fukasawa et al.

Predicting future invasion of an invasive alien tree in a Japanese oceanic island by process-based statistical models using recent distribution maps

Ecol. Res.

(2009)

Fukuoka N, Kurosaki N, Takahashi A (2005) Vascular Plants of Hyogo Prefecture 6 Humans and Nature 15 (In...

Cited by (25)

Not seeing the forest for the trees: Generalised linear model out-performs random forest in species distribution modelling for Southeast Asian felids
2023, Ecological Informatics
Species Distribution Models (SDMs) are a powerful tool to derive habitat suitability predictions relating species occurrence data with habitat features. Two of the most frequently applied algorithms to model species-habitat relationships are Generalised Linear Models (GLM) and Random Forest (RF). The former is a parametric regression model providing functional models with direct interpretability. The latter is a machine learning non-parametric algorithm, more tolerant than other approaches in its assumptions, which has often been shown to outperform parametric algorithms. Other approaches have been developed to produce robust SDMs, like training data bootstrapping and spatial scale optimisation. Using felid presence-absence data from three study regions in Southeast Asia (mainland, Borneo and Sumatra), we tested the performances of SDMs by implementing four modelling frameworks: GLM and RF with bootstrapped and non-bootstrapped training data. With Mantel and ANOVA tests we explored how the four combinations of algorithms and bootstrapping influenced SDMs and their predictive performances. Additionally, we tested how scale-optimisation responded to species' size, taxonomic associations (species and genus), study area and algorithm. We found that choice of algorithm had strong effect in determining the differences between SDMs' spatial predictions, while bootstrapping had no effect. Additionally, algorithm followed by study area and species, were the main factors driving differences in the spatial scales identified. SDMs trained with GLM showed higher predictive performance, however, ANOVA tests revealed that algorithm had significant effect only in explaining the variance observed in sensitivity and specificity and, when interacting with bootstrapping, in Percent Correctly Classified (PCC). Bootstrapping significantly explained the variance in specificity, PCC and True Skills Statistics (TSS). Our results suggest that there are systematic differences in the scales identified and in the predictions produced by GLM vs. RF, but that neither approach was consistently better than the other. The divergent predictions and inconsistent predictive abilities suggest that analysts should not assume machine learning is inherently superior and should test multiple methods. Our results have strong implications for SDM development, revealing the inconsistencies introduced by the choice of algorithm on scale optimisation, with GLM selecting broader scales than RF.
Rangeland species potential mapping using machine learning algorithms
2023, Ecological Engineering
Documenting habitats of rangeland plant species is required to properly manage rangelands and to understand ecosystem processes. A reliable rangeland species potential map can help managers and policy makers design a sustainable grazing system on rangelands. The aim of this study is to map the plant species in the Qurveh City rangelands, Kurdistan Province, Iran, using state-of-the-art machine learning algorithms, including Support Vector Machine (SVM), Artificial Neural Network (ANN), Naïve Bayes (NB), Bayes Net (BN) and Classification and Regression Tree (CART). A total of 185 rangeland species were used in the study, together with 20 conditioning factors, to build and validate models. The One-R feature section technique and multicollinearity test were used, respectively, to determine the most important factors and correlations between them. Model validation was performed using sensitivity, specificity, accuracy, F1-measure, Matthews correlation coefficient (MCC), Kappa, root mean square error (RMSE), and area under the receiver operating characteristic curve (AUC). Results showed that topographic wetness index (TWI), slope angle, elevation, soil phosphorus and soil potassium were the five most important factors to increase the rangeland plants habitat suitability. The Naïve Bayes algorithm (AUC = 0.782) had the highest performance and prediction accuracy and best consistency across the species in the investigated rangeland, followed by the SVM (AUC = 0.763), ANN (AUC = 0.762), CART (AUC = 0.627), and BN (AUC = 0.617) models.
Rational Machines and Artificial Intelligence
2021, Rational Machines and Artificial Intelligence
Application of a fish finder system to spatial distribution mapping for the razor clam Solen gordonis: Case study from coastal waters, northwestern Kyushu, Japan
2019, Journal of Sea Research
Effective spatial management planning for sedentary fishery resources requires accurate spatial distribution mapping of a focal stock including both fished and associated unfished populations. This study focuses on (sub-)adult razor clams (Solen gordonis), with shell lengths ≥25 mm, in Sasebo Bay and Omura Bay, northwestern Kyushu, Japan. They are commercially harvested, but little is known about their spatial distribution pattern. The objective was to estimate the spatial distribution and configuration of clam beds existing within an area of ca. 12 km². For this, clam and sediment sampling by scuba-equipped divers, acoustic bathymetric surveys using a recreational-grade fish finder (single-beam acoustic) system, and presence–absence modelling using generalized linear models were performed. For the collected clams, catch per unit effort (CPUE; number of clams collected/h) was calculated and used as a proxy for clam individual density. The results showed that the presence–absence variation in the spatial distribution of (sub-)adult clams was associated with the bottom sediment type. (Sub-)adult clams occurred frequently in the sediments with higher gravel (particles with diameters ≥2 mm, mainly shell fragments) and lower mud (diameters <0.063 mm) contents than the sediments of the absence sites. Acoustic signals of the fish finder system were backscattered more intensely from gravel-sand beds (i.e., typical of the presence sites) than from muddy sand beds (i.e., absence sites). Using the relationships among the presence–absence variation of (sub-)adult clams, sediment type, and backscatter strength, spatial distribution of clams was predicted with high accuracy. The presence probability of (sub-)adults (= x; continuous values from 0 to 1) predicted from the results was significantly regressed on CPUE (= y) [i.e., y = exp(−0.7984 + 4.9350x) − 0.05]. Using this relationship, a map for the relative abundance of (sub-)adult clams was obtained. This map depicted several clam beds including both exploited and unexploited ones. The latter beds virtually conform to no-take zones in the razor clam fishery.
Trade-off relationship between modern agriculture and biodiversity: Heavy consolidation work has a long-term negative impact on plant species diversity
2016, Land Use Policy
Citation Excerpt :
This method has high predictive ability using presence data and also is advantageous when analyzing presence-only data from broad-range records, such as those used in this study (Osawa et al., 2011, 2013a). A bagging GLM cannot serve as a significance test but can show coefficients for each explanatory variable (Osawa et al., 2011). We randomly selected a fixed number of presumed absence data from all meshes without PT species records as presence data, repeated this process to create 5000 GLM models, and then combined these into a final model.
Human-driven land-use changes often cause a decline in biodiversity. Although traditional agricultural practices maintained biodiversity at high levels, recent land-use changes may have negative consequences on species composition. In this study, we examined the hypothesis that land consolidation, which is a major recent land-use change in agricultural areas, decreases plant species diversity over the long term (the so-called negative legacy). To test this hypothesis, we examined the relationships between consolidated areas and the occurrence of threatened plant species across Japan and at the prefecture scale. Twenty-three threatened plant species were selected, all of which were formerly common. Our results show that areas containing records of threatened plant species rarely experienced consolidation at whole-country and prefectural scales. Breakdown analysis showed that unconsolidated agricultural areas contained significantly more threatened species than consolidated agricultural areas. These results suggest that threatened plant species require unconsolidated agricultural areas (i.e., these species could not grow in consolidated areas). Thus, we propose that consolidation history could be used as an indicator of the potential for biodiversity recovery. We also suggest that consolidated agricultural areas should be used for food production rather than for the restoration of biodiversity, for reasons of cost efficiency.
Multiple factors drive regional agricultural abandonment
2016, Science of the Total Environment
An understanding of land-use change and its drivers in agroecosystems is important when developing adaptations to future environmental and socioeconomic pressures. Agricultural abandonment occurs worldwide with multiple potentially positive and negative consequences; however, the main factors causing agricultural abandonment in a country i.e., at the macro scale, have not been identified. We hypothesized that socio-environmental factors driving agricultural abandonment could be summarized comprehensively into two, namely “natural” and “social”, and the relative importance of these differs among regions. To test this postulate, we analyzed the factors currently leading to agricultural abandonment considering ten natural environment variables (e.g., temperature) and five social variables (e.g., number of farmers) using the random forest machine learning method after dividing Japan into eight regions. Our results showed that agricultural abandonment was driven by various socio-environmental factors, and the main factors leading to agricultural abandonment differed among regions, especially in Hokkaido in northern Japan. Hokkaido has a relatively large area of concentrated farmland, and abandonment might have resulted from the effectiveness of cultivation under specific climate factors, whereas the other regions have relatively small areas of farmland with many elderly part-time farmers. In such regions, abandonment might have been caused by the decreasing numbers of potential farmers. Thus, two different drivers of agricultural abandonment were found: inefficient cultivation and decreasing numbers of farmers. Therefore, agricultural abandonment cannot be prevented by adopting a single method or policy. Agricultural abandonment is a significant problem not only for food production but also for several ecosystem services. Governments and decision-makers should develop effective strategies to prevent further abandonment to ensure sustainable future management of agro-ecosystems.

View all citing articles on Scopus

View full text

Bagging GLM: Improved generalized linear model for the analysis of zero-inflated data

Abstract

Research highlights

Introduction

Section snippets

V. pycnostelma (vascular plant) models

V. pycnostelma (vascular plant) models

Discussion

Acknowledgements

Ecol. Mod

Trends Ecol Evol

Ecol. Mod.

Forest Ecol. Manag.

Categorical Data Analysis

An empirical comparison of voting classification algorithms: bagging, boosting, and variants

Machine Learning

Bagging predictors

Machine Learning

Presence–absence versus presence-only modelling methods for predicting bird habitat suitability

Ecography

Assessing the environmental requirements of invaders using ensembles of distribution models

Diversity Distrib.

An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization

Machine Learning

Predicting species distributions from museum and herbarium records using multiresponse models fitted with multivariate adaptive regression splines

Diversity Distrib.

A working guide to boosted regression trees

J. Anim. Eco.

Modelling skewed data with many zeros: a simple approach combining ordinary and logistic regression

Environ. Ecol. Stat

Predicting future invasion of an invasive alien tree in a Japanese oceanic island by process-based statistical models using recent distribution maps

Ecol. Res.