Assessing the effective sample size for large spatial datasets: A block likelihood approach
Introduction
The effective sample size (ESS) has been developed to quantify the number of independent and identically distributed observations within a sample of size n (Cressie, 1993). Currently, with the rapid proliferation and acquisition of large datasets, the extraction of the relevant features and information within such datasets is particularly important, especially when a reduction of information occurs due to autocorrelation that is present in a set of georeferenced observations (Griffith, 2005). These considerations are predominantly useful in the analysis of remotely sensed satellite data because in such a context, the observations generally present high levels of spatial association, so the amount of duplicated information can be considerable (Griffith, 2015).
The literature on the ESS is substantial. While Vallejos and Osorio (2014) provided a novel definition for spatial random fields with constant means, Acosta and Vallejos (2018) extended this quantity for a general spatial regression model, including the asymptotic distribution of the maximum likelihood estimator in the context of increasing domain sampling schemes. At the same time, the ESS has been studied for linear models with replicates (Faes et al., 2009). In a Bayesian context, Berger et al. (2014) addressed the ESS from a model selection perspective. Chatterjee and Diaconis (2018) and Elvira et al. (2018) studied the ESS in specific contexts of importance sampling. ESS applications can be found, for instance, in Solow and Polanky (1994) and Li et al. (2016), while Lenth (2001) provided practical guidelines for ESS determination.
The ESS introduced by Vallejos and Osorio (2014) is based on the Fisher information associated with maximum likelihood (ML) estimation under the Gaussian assumption. The objective functions of ML estimation and the ESS depend on the inverse of the correlation matrix, making their implementations challenging from a computational perspective in all circumstances where the sample size is very large. Modern approaches require a compromise between statistical and computational performances. For a thorough review of the current methods for analyzing large spatial datasets, the reader is referred to Heaton et al. (2019) and the references therein.
To circumvent the so-called “big n” problem, we propose an alternative sample size reduction approach, which is fully based on block likelihood inference (Caragea and Smith, 2007; Varin et al., 2011). First, parameter estimation is carried out by means of the block likelihood estimation method; this accomplishes a trade-off between statistical and computational efficiency, where the inverses of small correlation matrices are involved. Second, a new notion of the effective sample size (named ) comes from the Godambe information arising from this estimation framework. In particular, we focus on the “small blocks” method in Caragea and Smith (2007), which performs better than other similar competitors (Caragea and Smith, 2007; Varin et al., 2011; Stein, 2013). At the same time, the “small blocks” version generates a particularly amenable expression for . To illustrate the use of , some parametric correlation structures are revisited, including the intraclass, AR(1), and Matérn models that were previously analyzed by Vallejos and Osorio (2014), as well as the simultaneous autoregressive model (see, e.g., Griffith, 2015). Our proposal is also compared to the full likelihood-based ESS in terms of both statistical and computational efficiency through Monte Carlo simulation experiments. The proposed methodology is applied on a large dataset consisting of 21 million observations from a Harvard forest database.
The article is organized as follows. Section 2 briefly reviews the definition, some examples, and the main theoretical attributes of the traditional ESS. Section 3 introduces and its properties. Section 4 discusses computational aspects that permit an efficient implementation of for regularly-spaced locations. In Section 5, we calculate for some parametric families of correlation functions. In Section 6, the discrepancies between the ESS and estimates obtained through simulation experiments are assessed. The computational performances of these approaches are also explored. Section 7 presents a real data application. Finally, Section 8 is a discussion of the main findings, and this includes problems to study in future research. For a neater exposition, the proofs of the main results are given in Appendix A, and additional numerical studies that complement the main findings of the manuscript are contained in Appendix B.
Section snippets
Background
In this section, an approach that allows us to quantify the amount of duplicated information within a sample of spatial data due to the effect of spatial autocorrelation is described. This approach, proposed by Vallejos and Osorio (2014), is based on Fisher information about the mean.
Consider a spatial random field, , and let be a realization at n spatial locations. For simplicity, we use the notations and , with ⊤ standing for the transpose
Effective sample size based on Godambe information: definition and properties
An alternative method is proposed to carry out sample size reduction for spatial data. A new notion of the effective sample size is defined for Gaussian random fields in terms of the Godambe information about the mean, which comes from block likelihood inference. We shall use the notation for our proposal. We expect to be a reasonable approximation of the traditional ESS.
The block likelihood estimation framework (Caragea and Smith, 2007) is an estimation method within the class of
Computational aspects for regular grids
Regular grids are common in satellite data and image modeling. In the computation of , a regularly gridded spatial design permits the reuse of some calculations. We now discuss some computational tools that will be used in the subsequent sections.
For a spatial random field defined on a rectangular grid of , assume that all blocks are regular lattices with the same size. Thus, we can write for every . Let be the distance matrix between blocks i and j, that is, the
Examples
In this section, the ESS and are compared for different correlation structures. We focus on the intraclass and AR(1) correlation matrices, which are described in the previous sections, as well as on the Matérn and simultaneous autoregressive correlation models, which are widely used in the spatial statistics literature.
Maximum likelihood versus block likelihood estimation
The goal of this section is to compare the relative performances of the block likelihood (BL) and maximum likelihood (ML) estimation methods. We focus on the estimation of the range parameter of the covariance function because it is the aspect that most influences the assessment of the effective sample size. All computations described below were performed using a computer equipped with a 2.7 GHz processor and 8 GB of RAM.
In pursuance of running the experiments, lattices in of sizes ,
Data analysis
We illustrate the use of the block likelihood approach developed in Section 3 on a forest dataset, which consists of a three-band reference image of size 5616 × 3744 pixels and thus represents a dataset with a large sample size (). The image is shown in Fig. 7.1, and it was taken above a section of forest at the Harvard Forest, Petersham, MA, USA. The image belongs to one of the comprehensive databases that are part of a long study carried out in Harvard Forest. The image and code for
Discussion
This paper introduced a new way to address the computation of the effective sample size. The methodology is based on block likelihood inference. We showed that preserves some relevant attributes of the traditional ESS. The approach, equipped with powerful computational machinery, is appropriate for large spatial datasets and reduces to the original ESS when the number of blocks is equal to one. The use of our findings has been illustrated with a real forest dataset. Once was
Acknowledgements
Alfredo Alegría acknowledges the funding of the National Agency for Research and Development of Chile, through grant ANID/FONDECYT/INICIACIÓN/No. 11190686. Felipe Osorio was partially supported by UTFSM through grant PI_LI_19_11. Ronny Vallejos was partially supported by the AC3E, UTFSM, under grant FB-0008, and by UTFSM under grant PI_L_18_20. Ronny Vallejos and Felipe Osorio also acknowledge financial support from CONICYT through the MATH-AMSUD program, grant 20-MATH-03.
References (33)
- et al.
Composite likelihood estimation for a Gaussian process under fixed domain asymptotics
J. Multivar. Anal.
(2019) - et al.
Asymptotic properties of computationally efficient alternative estimators for a class of multivariate normal models
J. Multivar. Anal.
(2007) - et al.
Likelihood approximation with hierarchical matrices for large spatial datasets
Comput. Stat. Data Anal.
(2019) - et al.
Flexible and efficient estimating equations for variogram estimation
Comput. Stat. Data Anal.
(2018) - et al.
Effective sample size of spatial process models
Spat. Stat.
(2014) - et al.
On the relationship between conditional (car) and simultaneous (sar) autoregressive models
Spat. Stat.
(2018) - et al.
Efficient maximum approximated likelihood inference for Tukey's g-and-h distribution
Comput. Stat. Data Anal.
(2015) - et al.
Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables
(1972) - et al.
Effective sample size for spatial regression processes
Electron. J. Stat.
(2018) - et al.
On the effective geographic sample size
J. Stat. Comput. Simul.
(2018)
The Design and Analysis of Computer Algorithms
The effective sample size
Econom. Rev.
Comparing composite likelihood methods based on pairs for spatial Gaussian random fields
Stat. Comput.
The sample size required in importance sampling
Ann. Appl. Probab.
Statistics for Spatial Data
Rethinking the effective sample size
Cited by (4)
Effective sample size for georeferenced and temporally evolving data
2023, Spatial StatisticsCitation Excerpt :Classical spatial statistics books, such as Haining (1993), Cressie (2015) and Schabenberger and Gotway (2017), contain insights about this topic. Thus, ESS is a convenient mathematical device for building model-informed geographic sampling designs that allow scientists to extract large amounts of comprehensive information from relatively small samples (Griffith, 2005, 2008; Vallejos and Osorio, 2014; Li et al., 2016; Acosta and Vallejos, 2018; Acosta et al., 2021). Currently, these subsampling techniques are particularly useful to circumvent the computational difficulties produced by the ever-increasing availability of data (Nordman and Lahiri, 2004; Bradley, 2021).
Improving upon the effective sample size based on Godambe information for block likelihood inference
2024, Computational Statistics