Abstract
Many problems in the environmental and biological sciences involve the analysis of large quantities of data. Further, the data in these problems are often subject to various types of structure and, in particular, spatial dependence. Traditional model fitting often fails due to the size of the datasets since it is difficult to not only specify but also to compute with the full covariance matrix describing the spatial dependence. We propose a very general type of mixed model that has a random spatial component. Recognizing that spatial covariance matrices often exhibit a large number of zero or near-zero entries, covariance tapering is used to force near-zero entries to zero. Then, taking advantage of the sparse nature of such tapered covariance matrices, backfitting is used to estimate the fixed and random model parameters. The novelty of the paper is the combination of the two techniques, tapering and backfitting, to model and analyze spatial datasets several orders of magnitude larger than those datasets typically analyzed with conventional approaches. Results will be demonstrated with two datasets. The first consists of regional climate model output that is based on an experiment with two regional and two driver models arranged in a two-by-two layout. The second is microarray data used to build a profile of differentially expressed genes relating to cerebral vascular malformations, an important cause of hemorrhagic stroke and seizures.
Similar content being viewed by others
References
Abramowitz, M., Stegun, I.A. (eds.): Handbook of Mathematical Functions. Dover, New York (1970)
Bates, D., Maechler, M.: Matrix: A Matrix package for R. R package version 0.995-12 (2006)
Breiman, L., Friedman, J.H.: Estimating optimal transformations for multiple regression and correlations (with discussion). J. Am. Stat. Assoc. 80, 580–619 (1985)
Buja, A., Hastie, T.J., Tibshirani, R.J.: Linear smoothers and additive models (with discussion). Ann. Stat. 17, 453–555 (1989)
Christensen, J., Christensen, O.: A summary of the PRUDENCE model projections of changes in European climate by the end of this century. Clim. Change 81, 7–30 (2007)
Christensen, J., Carter, T.R., Rummukainen, M.: Evaluating the performance and utility of regional climate models: the PRUDENCE project. Clim. Change 81, 1–6 (2007)
Cressie, N.A.C.: Statistics for Spatial Data. Wiley, New York (1993). Revised reprint
Fowler, H.J., Ekström, M., Blenkinsop, S., Smith, A.P.: Estimating change in extreme European precipitation using a multimodel ensemble. J. Geophys. Res. 112, D18104 (2007)
Furrer, R.: Spam: sparse matrix algebra. http://www.mines.edu/~rfurrer/software/spam/ (2007)
Furrer, R., Genton, M.G., Nychka, D.: Covariance tapering for interpolation of large spatial datasets. J. Comput. Graph. Stat. 15, 502–523 (2006)
Furrer, R., Knutti, R., Sain, S.R., Nychka, D.W., Meehl, G.A.: Spatial patterns of probabilistic temperature change projections from a multivariate Bayesian analysis. Geophys. Res. Lett. 34, L06711 (2007a)
Furrer, R., Sain, S.R., Nychka, D.W., Meehl, G.A.: Multivariate Bayesian analysis of atmosphere-ocean general circulation models. Environ. Ecol. Stat. 14, 249–266 (2007b)
Furrer, R., Sain, S.R.: Spam: A sparse matrix R package with emphasis on MCMC methods for Gaussian Markov random fields. Technical Report, MCS-08-05, Colorado School of Mines, Golden, USA (2008)
George, A., Liu, J.W.H.: Computer Solution of Large Sparse Positive Definite Systems. Prentice-Hall, Englewood Cliffs (1981)
Gneiting, T.: Correlation functions for atmospheric data analysis. Q.J.R. Meteorol. Soc. 125, 2449–2464 (1999)
Gneiting, T.: Compactly supported correlation functions. J. Multivar. Anal. 83, 493–508 (2002)
Handcock, M.S., Stein, M.L.: A Bayesian analysis of kriging. Technometrics 35, 403–410 (1993)
Harville, D.A.: Matrix Algebra From a Statistician’s Perspective. Springer, New York (1997)
Horn, R.A., Johnson, C.R.: Topics in Matrix Analysis. Cambridge University Press, Cambridge (1994)
Ihaka, R., Gentleman, R.: R: A language for data analysis and graphics. J. Comput. Graph. Stat. 5, 299–314 (1996)
Kaufman, C., Sain, S.R.: Bayesian functional ANOVA modeling using Gaussian process prior distributions (2008, submitted)
Kitanidis, P.K.: Introduction to Geostatistics: Applications in Hydrogeology. University Press, Cambridge (1997)
Koenker, R., Ng, P.: SparseM: sparse matrix package for R. http://www.econ.uiuc.edu/~roger/research/sparse/SparseM.pdf (2003)
Li, C., Tseng, G.C., Wong, H.W.: Model-based analysis of oligonucleotide arrays and issues in cDNA microarray analysis. In: Speed, T.P. (ed.) Statistical Analysis of Gene Expression Microarray Data, pp. 1–34. Chapman & Hall/CRC, London (2003). Chap. 1
Lockhart, D.J., Dong, H., Byrne, M.C., Follettie, M.T., Gallo, M.V., Chee, M.S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., Brown, E.L.: Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol. 14, 1675–1680 (1996)
Matérn, B.: Spatial variation: stochastic models and their application to some problems in forest surveys and other sampling investigations. Medd. Statens Skogsforsk. Inst. Stockh. 49(5) (1960)
Nychka, D.W.: Spatial-process estimates as smoothers. In: Schimek, M.G. (ed.) Smoothing and Regression: Approaches, Computation, and Application, pp. 393–424. Wiley, New York (2000). Chap. 13
PRUDENCE: Prediction of regional scenarios and uncertainties for defining european climate change risks and effects. http://prudence.dmi.dk (2007)
R Development Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org (2006)
Rue, H., Held, L.: Gaussian Markov Random Fields: Theory and Applications. Chapman & Hall, London (2005)
Sain, S.R., Furrer, R., Cressie, N.: Combining regional climate model output via a multivariate Markov random field model. In: 56th Session of the International Statistical Institute, Lisbon, Portugal (2007)
Schabenberger, O., Gotway, C.A.: Statistical Methods for Spatial Data Analysis. Chapman & Hall/CRC, London (2005)
Scott, D.W.: Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, New York (1992)
Shenkar, R., Elliott, J.P., Diener, K., Gault, J., Hu, L., Cohrs, R.J., Phang, T., Hunter, L., Breeze, R.E., Awad, I.A.: Differential gene expression in human cerebrovascular malformations (with discussion). Neurosurgery 52, 465–478 (2003)
Speed, T.P. (ed.): Statistical Analysis of Gene Expression Microarray Data. Chapman & Hall/CRC, New York (2003)
Stein, M.L.: Uniform asymptotic optimality of linear predictions of a random field using an incorrect second-order structure. Ann. Stat. 18, 850–872 (1990)
Stein, M.L.: A simple condition for asymptotic optimality of linear predictions of random fields. Stat. Probab. Lett. 17, 399–404 (1993)
Stein, M.L.: Interpolation of Spatial Data. Springer, New York (1999a)
Stein, M.L.: Predicting random fields with increasing dense observations. Ann. Appl. Probab. 9, 242–273 (1999b)
Wang, H., He, X.: Detecting differential expressions in GeneChip microarray studies: A quantile approach. J. Am. Stat. Assoc. 102, 104–112 (2007)
Wendland, H.: Piecewise polynomial, positive definite and compactly supported radial functions of minimal degree. Adv. Comput. Math. 4, 389–396 (1995)
Wu, Z.M.: Compactly supported positive definite radial functions. Adv. Comput. Math. 4, 283–292 (1995)
Zimmerman, D.L., Cressie, N.: Mean squared prediction error in the spatial linear model with estimated covariance parameters. Ann. Inst. Stat. Math. 44, 27–43 (1992)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Furrer, R., Sain, S.R. Spatial model fitting for large datasets with applications to climate and microarray problems. Stat Comput 19, 113–128 (2009). https://doi.org/10.1007/s11222-008-9075-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-008-9075-x