Detecting the impact area of BP deepwater horizon oil discharge: an analysis by time varying coefficient logistic models and boosted trees

Li, Tianxi; Gao, Chao; Xu, Meng; Rajaratnam, Bala

doi:10.1007/s00180-013-0449-y

Detecting the impact area of BP deepwater horizon oil discharge: an analysis by time varying coefficient logistic models and boosted trees

Original Paper
Published: 21 August 2013

Volume 29, pages 141–157, (2014)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Tianxi Li¹,
Chao Gao²,
Meng Xu³ &
…
Bala Rajaratnam^1,4

640 Accesses
3 Citations
Explore all metrics

Abstract

The Deepwater Horizon oil discharge in the Gulf of Mexico is considered to be one of the worst environmental disasters to date. The spread of the oil spill and its consequences thereof had various environmental impacts. The National Oceanic and Atmospheric Administration (NOAA) in conjunction with the Environmental Protection Agency (EPA), the US Fish and Wildlife Service, and the American Statistical Association (ASA) have made available a few datasets containing information of the oil spill. In this paper, we analyzed four of these datasets in order to explore the use of applied statistics and machine learning methods to understand the spread of the oil spill. In particular, we analysed the “gliders, floats, boats” and “birds” data. The former contains various measurements on sea water such as salinity, temperature, spacial locations, depth and time. The latter contains information on the living conditions of birds, such as living status, oil conditions, locations and time. A varying-coefficients logistic regression was fitted to the birds data. The result indicated that the oil was spreading more quickly along the East–West direction. Analysis via boosted trees and logistic regression showed similar results based on the information provided by the above data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative analysis of machine learning algorithms for predicting wave runup

Article Open access 18 December 2023

Assessing the Best-Fit Regression Models for Predicting the Marine Water Quality Determinants

Breaking Away from ‘Traditional’ Uses of Machine Learning: A Case Study Linking Sooty Shearwaters (Ardenna griseus) and Upcoming Changes in the Southern Oscillation Index

References

Cook D (2013) The 2011 data expo of the American Statistical Association. Comput Stat (forthcoming)
Friedman J (2000) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232
Article Google Scholar
Friedman J (2002a) Multiple additive regression trees. An interface with R from Salford systems
Friedman J (2002b) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378
Article MATH Google Scholar
Hastie T, Tibshirani R (1990) Generalized additive models. Chapman & Hall, London
MATH Google Scholar
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer series in statistics, 2nd edn. Springer, New York
Book Google Scholar
Hosmer D, Lemeshow S, Sturdivant R (2013) Applied logistic regression. Wiley series in probability and statistics. Wiley, Hoboken
Book Google Scholar
Leathwick J, Elith J, Hastie T (2006) Comparative performance of generalized additive models and multivariate adaptive regression splines for statistical modelling of species distributions. Ecol Model 199(2):188–196
Article Google Scholar
Loecher M (2011) RgoogleMaps: overlays on Google map tiles in R. R package version 1.1.9.15
R Development Core Team (2010) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Ridgeway G (2010) gbm: Generalized boosted regression models. R package version 1(6–3):1
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, Stanford University, Stanford, USA
Tianxi Li & Bala Rajaratnam
Department of Statistics, Yale University, New Haven, USA
Chao Gao
Department of Environmental Science, Nankai University, Tianjin, China
Meng Xu
Department of Environmental Earth System Science, Stanford University, Stanford, USA
Bala Rajaratnam

Authors

Tianxi Li
View author publications
You can also search for this author in PubMed Google Scholar
Chao Gao
View author publications
You can also search for this author in PubMed Google Scholar
Meng Xu
View author publications
You can also search for this author in PubMed Google Scholar
Bala Rajaratnam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tianxi Li.

Appendices

1.1 Model selection by predictive power

1.1.1 Model selection in birds analysis

We selected the final model by comparing predictive power of the candidates. This is because the prediction bound was what we mainly leveraged in the analysis. In the logistic regression model, we calculated the predicted probability at each validation (test) point $x$ that were selected in 3.4, denoted as

$$\begin{aligned} \hat{p}(x) = \hat{P}_x(y=1) \end{aligned}$$

for all $x$ in the validation set. To evaluate the predictive power, we used 10-fold cross validation for the models. In the $k$th iteration, suppose the 10 % validation set is $T_{(k)}$. Then according to the predicted probability, we could compute the log likelihood of $T_{(k)}$ in each iteration as

$$\begin{aligned} log(\hat{P}(T_{(i)})) = \sum \limits _{x \in T_{(i)}}log(\hat{P}_x(y=y(x))). \end{aligned}$$

Taking the average of all the 10 iterations $\sum _{k=1}^{10}log(\hat{P}(T_{(i)}))/10$ gave the measure of the predictive power of each model.

We mainly considered four models: (a) the model with constant time-varying coefficients (this is equivalent to model without using time); (b) the model with $f$ being a linear function of $t$ for Longitude’s splines and constant for Latitude’s splines (this was the one we finally used in the analysis); (c) the model with $f$ being a linear function of $t$ for Latitude’s splines and constant for Longitude’s splines and (d) the model with $f$ being a linear function of $t$ for both Latitude and Longitude. All of these four models had significant (or nearly significant) coefficients in all variables and passed the goodness of fit test. The average predictive log likelihood for the models are given in Table 7. It shows that model (b) gave the best predictive power as we claimed in the paper.

Table 7 10-fold cross-validation result

Full size table

1.1.2 Model selection of boosted trees

We used 5-fold cross-validation to select the finally fitted boosting model. Figure 5 shows CV curve for this process. The average absolute error on the 1/5 held-out set in each iteration is shown as the red curve while the black one shows the training error. We chose the one with smallest CV error in the analysis.

1.2 Regression coefficients used in boosted trees

There were 1901 (location, time) points we used from the Physical Measurements in Sect. 3.3. Figure 6 shows the histogram of slope coefficients of all such regression models, $\beta _{ij}$ and $\alpha _{ij}$, of Temperature and Salinity, respectively.

1.3 Choose validation radius in $B(x,r,t)$

How to choose radius $r$ in $B(x,r,t)$ is a problem we need to solve for validation. As we used permutation test for the hypothesis, there was no direct way to do power analysis. Thus we chose $r$ in an ad hoc way.

We know that there should be a trade-off when we use different $r$. Larger $r$ would lead to more validation points, but would include too many points that are not close enough for the bird samples, thus gives a bad point estimation via local averaging. On the other hand, smaller $r$ tend to give less biased point estimation but we might have too few points for testing. Figure 7 shows the number of validation points as we increase $r$. For smaller $r$, the number of validation set would be very small and increased slowly. In such range of $r$, the neighbor $B(x,r,t)$ is too small for most of the bird sample to include estimation points. As $r$ becomes larger, the number increases more rapidly, as we begin to include more and more remote points. There are two kinks on the curve. One is around 0.3, which would lead to 12 validation points. This is too far from enough for testing. Thus we chose the second kink range from 0.8 to 1.0, and took these as the candidates for validation radius.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, T., Gao, C., Xu, M. et al. Detecting the impact area of BP deepwater horizon oil discharge: an analysis by time varying coefficient logistic models and boosted trees. Comput Stat 29, 141–157 (2014). https://doi.org/10.1007/s00180-013-0449-y

Download citation

Received: 26 January 2012
Accepted: 06 August 2013
Published: 21 August 2013
Issue Date: February 2014
DOI: https://doi.org/10.1007/s00180-013-0449-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detecting the impact area of BP deepwater horizon oil discharge: an analysis by time varying coefficient logistic models and boosted trees

Abstract

Access this article

Similar content being viewed by others

A comparative analysis of machine learning algorithms for predicting wave runup

Assessing the Best-Fit Regression Models for Predicting the Marine Water Quality Determinants

Breaking Away from ‘Traditional’ Uses of Machine Learning: A Case Study Linking Sooty Shearwaters (Ardenna griseus) and Upcoming Changes in the Southern Oscillation Index

References