Assessing the effective sample size for large spatial datasets: A block likelihood approach

https://doi.org/10.1016/j.csda.2021.107282Get rights and content

Abstract

The development of new techniques for sample size reduction has attracted growing interest in recent decades. Recent findings allow us to quantify the amount of duplicated information within a sample of spatial data through the so-called effective sample size (ESS), whose definition arises from the Fisher information that is associated with maximum likelihood estimation. However, in all circumstances where the sample size is very large, maximum likelihood estimation and ESS evaluation are challenging from a computational viewpoint. An alternative definition of the ESS, in terms of the Godambe information from a block likelihood estimation approach, is presented. Several theoretical properties satisfied by this quantity are investigated. Our proposal is evaluated in some parametric correlation structures, including the intraclass, AR(1), Matérn, and simultaneous autoregressive models. Simulation experiments show that our proposal provides accurate approximations of the full likelihood-based ESS while maintaining a moderate computational cost. A large dataset is analyzed to quantify the effectiveness and limitations of the proposed framework in practice.

Introduction

The effective sample size (ESS) has been developed to quantify the number of independent and identically distributed observations within a sample of size n (Cressie, 1993). Currently, with the rapid proliferation and acquisition of large datasets, the extraction of the relevant features and information within such datasets is particularly important, especially when a reduction of information occurs due to autocorrelation that is present in a set of georeferenced observations (Griffith, 2005). These considerations are predominantly useful in the analysis of remotely sensed satellite data because in such a context, the observations generally present high levels of spatial association, so the amount of duplicated information can be considerable (Griffith, 2015).

The literature on the ESS is substantial. While Vallejos and Osorio (2014) provided a novel definition for spatial random fields with constant means, Acosta and Vallejos (2018) extended this quantity for a general spatial regression model, including the asymptotic distribution of the maximum likelihood estimator in the context of increasing domain sampling schemes. At the same time, the ESS has been studied for linear models with replicates (Faes et al., 2009). In a Bayesian context, Berger et al. (2014) addressed the ESS from a model selection perspective. Chatterjee and Diaconis (2018) and Elvira et al. (2018) studied the ESS in specific contexts of importance sampling. ESS applications can be found, for instance, in Solow and Polanky (1994) and Li et al. (2016), while Lenth (2001) provided practical guidelines for ESS determination.

The ESS introduced by Vallejos and Osorio (2014) is based on the Fisher information associated with maximum likelihood (ML) estimation under the Gaussian assumption. The objective functions of ML estimation and the ESS depend on the inverse of the correlation matrix, making their implementations challenging from a computational perspective in all circumstances where the sample size is very large. Modern approaches require a compromise between statistical and computational performances. For a thorough review of the current methods for analyzing large spatial datasets, the reader is referred to Heaton et al. (2019) and the references therein.

To circumvent the so-called “big n” problem, we propose an alternative sample size reduction approach, which is fully based on block likelihood inference (Caragea and Smith, 2007; Varin et al., 2011). First, parameter estimation is carried out by means of the block likelihood estimation method; this accomplishes a trade-off between statistical and computational efficiency, where the inverses of small correlation matrices are involved. Second, a new notion of the effective sample size (named ESSB) comes from the Godambe information arising from this estimation framework. In particular, we focus on the “small blocks” method in Caragea and Smith (2007), which performs better than other similar competitors (Caragea and Smith, 2007; Varin et al., 2011; Stein, 2013). At the same time, the “small blocks” version generates a particularly amenable expression for ESSB. To illustrate the use of ESSB, some parametric correlation structures are revisited, including the intraclass, AR(1), and Matérn models that were previously analyzed by Vallejos and Osorio (2014), as well as the simultaneous autoregressive model (see, e.g., Griffith, 2015). Our proposal is also compared to the full likelihood-based ESS in terms of both statistical and computational efficiency through Monte Carlo simulation experiments. The proposed methodology is applied on a large dataset consisting of 21 million observations from a Harvard forest database.

The article is organized as follows. Section 2 briefly reviews the definition, some examples, and the main theoretical attributes of the traditional ESS. Section 3 introduces ESSB and its properties. Section 4 discusses computational aspects that permit an efficient implementation of ESSB for regularly-spaced locations. In Section 5, we calculate ESSB for some parametric families of correlation functions. In Section 6, the discrepancies between the ESS and ESSB estimates obtained through simulation experiments are assessed. The computational performances of these approaches are also explored. Section 7 presents a real data application. Finally, Section 8 is a discussion of the main findings, and this includes problems to study in future research. For a neater exposition, the proofs of the main results are given in Appendix A, and additional numerical studies that complement the main findings of the manuscript are contained in Appendix B.

Section snippets

Background

In this section, an approach that allows us to quantify the amount of duplicated information within a sample of spatial data due to the effect of spatial autocorrelation is described. This approach, proposed by Vallejos and Osorio (2014), is based on Fisher information about the mean.

Consider a spatial random field, {X(s):sRd}, and let X(s1),,X(sn) be a realization at n spatial locations. For simplicity, we use the notations Xi=X(si) and X=(X1,,Xn), with standing for the transpose

Effective sample size based on Godambe information: definition and properties

An alternative method is proposed to carry out sample size reduction for spatial data. A new notion of the effective sample size is defined for Gaussian random fields in terms of the Godambe information about the mean, which comes from block likelihood inference. We shall use the notation ESSB for our proposal. We expect ESSB to be a reasonable approximation of the traditional ESS.

The block likelihood estimation framework (Caragea and Smith, 2007) is an estimation method within the class of

Computational aspects for regular grids

Regular grids are common in satellite data and image modeling. In the computation of ESSB, a regularly gridded spatial design permits the reuse of some calculations. We now discuss some computational tools that will be used in the subsequent sections.

For a spatial random field defined on a rectangular grid of Z2, assume that all blocks are regular lattices with the same size. Thus, we can write |bi|=|b| for every i=1,,m. Let Hbibj be the distance matrix between blocks i and j, that is, the (k,r

Examples

In this section, the ESS and ESSB are compared for different correlation structures. We focus on the intraclass and AR(1) correlation matrices, which are described in the previous sections, as well as on the Matérn and simultaneous autoregressive correlation models, which are widely used in the spatial statistics literature.

Maximum likelihood versus block likelihood estimation

The goal of this section is to compare the relative performances of the block likelihood (BL) and maximum likelihood (ML) estimation methods. We focus on the estimation of the range parameter of the covariance function because it is the aspect that most influences the assessment of the effective sample size. All computations described below were performed using a computer equipped with a 2.7 GHz processor and 8 GB of RAM.

In pursuance of running the experiments, lattices in R2 of sizes 16×8=128,

Data analysis

We illustrate the use of the block likelihood approach developed in Section 3 on a forest dataset, which consists of a three-band reference image of size 5616 × 3744 pixels and thus represents a dataset with a large sample size (n=21026304). The image is shown in Fig. 7.1, and it was taken above a section of forest at the Harvard Forest, Petersham, MA, USA. The image belongs to one of the comprehensive databases that are part of a long study carried out in Harvard Forest. The image and code for

Discussion

This paper introduced a new way to address the computation of the effective sample size. The methodology is based on block likelihood inference. We showed that ESSB preserves some relevant attributes of the traditional ESS. The approach, equipped with powerful computational machinery, is appropriate for large spatial datasets and reduces to the original ESS when the number of blocks is equal to one. The use of our findings has been illustrated with a real forest dataset. Once ESSB was

Acknowledgements

Alfredo Alegría acknowledges the funding of the National Agency for Research and Development of Chile, through grant ANID/FONDECYT/INICIACIÓN/No. 11190686. Felipe Osorio was partially supported by UTFSM through grant PI_LI_19_11. Ronny Vallejos was partially supported by the AC3E, UTFSM, under grant FB-0008, and by UTFSM under grant PI_L_18_20. Ronny Vallejos and Felipe Osorio also acknowledge financial support from CONICYT through the MATH-AMSUD program, grant 20-MATH-03.

References (33)

  • A.V. Aho et al.

    The Design and Analysis of Computer Algorithms

    (1974)
  • J. Berger et al.

    The effective sample size

    Econom. Rev.

    (2014)
  • M. Bevilacqua et al.

    Comparing composite likelihood methods based on pairs for spatial Gaussian random fields

    Stat. Comput.

    (2015)
  • S. Chatterjee et al.

    The sample size required in importance sampling

    Ann. Appl. Probab.

    (2018)
  • N.A.C. Cressie

    Statistics for Spatial Data

    (1993)
  • V. Elvira et al.

    Rethinking the effective sample size

  • Cited by (4)

    • Effective sample size for georeferenced and temporally evolving data

      2023, Spatial Statistics
      Citation Excerpt :

      Classical spatial statistics books, such as Haining (1993), Cressie (2015) and Schabenberger and Gotway (2017), contain insights about this topic. Thus, ESS is a convenient mathematical device for building model-informed geographic sampling designs that allow scientists to extract large amounts of comprehensive information from relatively small samples (Griffith, 2005, 2008; Vallejos and Osorio, 2014; Li et al., 2016; Acosta and Vallejos, 2018; Acosta et al., 2021). Currently, these subsampling techniques are particularly useful to circumvent the computational difficulties produced by the ever-increasing availability of data (Nordman and Lahiri, 2004; Bradley, 2021).

    View full text