Wald-based spatial scan statistics for cluster detection

https://doi.org/10.1016/j.csda.2018.06.002Get rights and content

Abstract

The spatial scan test, which is often carried out by maximizing a likelihood ratio-based statistic over a collection of cluster candidates, is widely used in cluster detection and disease surveillance. As the likelihood ratio statistic may not be available if the exact distribution of the response variable is not specified, a Wald-based spatial scan approach is proposed. The idea is to construct a special explanatory variable for spatial clusters in the linear function of a statistical model. The spatial scan test is carried out by scanning the special explanatory variable over the collection of cluster candidates. An advantage is that the Wald-based spatial scan statistic can bridge spatial clusters and linear functions of statistical models. It can be easily combined with well-known statistical models beyond generalized linear models. It is expected that the proposed approach will have a great impact on cluster detection when the likelihood inference is intractable or unavailable.

Introduction

The spatial scan statistic is typically formulated under hypothesis testing problems with the null hypothesis that a disease rate is homogeneous in the entire region against the alternative hypothesis that the disease rate is elevated in a subregion. It has been successfully formulated under the framework of logistic linear models for Bernoulli orbinomial data (Kulldorff and Nagarwalla, 1995) and loglinear models for Poisson data Assuncao and Costa (2006), Zhang and Lin (2009). The spatial scan approach, which is carried out by a spatial scan statistic Kulldorff (1997), Tango and Takahashi (2005), is popular and widely used in cluster detection and disease surveillance. It has been considered as an important and fundamental tool in spatial epidemiology. The spatial scan approach has been extended to detect clusters in multinomial data (Jung et al., 2010), normal data (Huang et al., 2009), and survival data Bhatt and Tiwari (2014), Huang et al. (2007). It has also been extended to account for spatial correlation (Loh and Zhu, 2007), overdispersion (Zhang et al., 2012), and inflated zeros Cançado et al. (2014), de Lima et al. (2015), de Lima et al. (2017). Previous spatial scan statistics are mostly formulated under the framework of likelihood ratio statistics. As the computation of a likelihood ratio statistic needs the exact distribution, the implementation of the previous spatial scan approach is difficult if the exact distribution is not provided or hard to compute.

The likelihood ratio-based spatial scan statistic has nice theoretical properties. By the Neyman–Pearson Lemma ( Lehmann, 1986, P. 72), the uniformly most powerful (UMP) test can be formulated by the likelihood ratio statistic, indicating that likelihood ratio-based spatial scan statistics are powerful in detecting spatial clusters. It is not claimed by the Neyman–Pearson Lemma that the likelihood ratio statistic can dominate any other test statistic. We may have other tests which are as powerful as the likelihood ratio test. An example is the well-known t-test in linear models, which provides a uniformly most powerful unbiased (UMPU) test for the significance of regression coefficients ( Lehmann, 1986, P. 397). As the t-statistic becomes the Wald statistic in linear regression, we study the Wald-based spatial scan approach in this article.

The formulation of Wald-based spatial scan statistics is consistent with output formats of general statistical models. If the distribution of a response variable is modeled by a linear function of explanatory variables, then any valid fitting procedure should provide estimates of linear coefficients and their variance–covariance matrix. A set of Wald statistics (Wald, 1943) is basically used to assess the significance of individual linear coefficients. A Wald-based spatial scan statistic can be formulated if we can transform spatial clusters into explanatory variables. Since the derivation can be based on any estimation methods, the Wald-based spatial scan statistic provides an important option if the computation of the likelihood ratio statistic is difficult or even impossible. Although initial ideas for Poisson data can be traced back Zhang and Lin (2009), Zhang and Lin (2013), the formal approach to statistical models beyond GLMs (generalized linear models) has not been investigated, which motivates the present research.

The proposed approach is important in extension and generalization of the spatial scan test for cluster detection. Note that Kulldorff’s spatial scan statistic (Kulldorff, 1997) is constructed via a likelihood ratio statistic in a Bernoulli or a Poisson model. It cannot be used if the exact distribution of data is not provided or intractable. Tango and Takahashi’s flexibly shaped spatial scan statistic (Tango and Takahashi, 2005) is also constructed via a likelihood ratio statistic. It faces the same problem if the likelihood function is not provided or intractable. An obvious and important example is the construction of the spatial scan statistic in the quasi-Poisson model (McCullagh, 1983). As the variability of the disease count exceeds the corresponding value provided by the Poisson model, disregarding the presence of overdispersion in the quasi-Poisson model may lead to an inflation of type I error probabilities (Zhang et al., 2012). This phenomenon is often termed as the overdispersion problem in GLMs. Since the exact distribution is usually not specified, the likelihood ratio statistic is generally not well-defined. To solve the problem, one can introduce a Gamma distribution for overdispersion in the quasi-Poisson model. This may induce a likelihood ratio-based quasi-Poisson spatial scan statistic, but the derivation of a likelihood ratio-based spatial scan statistic is hard if a normal prior is utilized. If a Wald-based approach is used, then we can address the difference between the choices of the Gamma and the normal distributions for overdispersion. In addition, the proposed Wald-based spatial scan approach bridges cluster detection and linear functions of statistical models. It can be easily combined with well-known statistical approaches when response and explanatory variables are involved.

The article is organized as follows. In Section 2, we briefly review the likelihood ratio-based spatial scan statistics. In Section 3, we propose the Wald-based spatial scan approach, which also contains its specifications to a few important statistical models, such as the negative binomial, the quasi-binomial, and the quasi-Poisson models. Note that they are not exponential family distributions. These examples indicate that the spatial scan test can still be used if the model is not a GLM. In Section 4, we numerically evaluate the properties of our Wald-based spatial scan statistic in comparison with the likelihood ratio-based spatial scan statistic. In Section 5, we provide a discussion.

Section snippets

Likelihood ratio-based spatial scan statistic

The scan approach was originally developed for one dimensional point process (Naus, 1965). By a likelihood ratio-based method, Kulldorff (1997) extended it to cluster detection for two-dimensional aggregated unit data when the response follows Poisson or Bernoulli distributions. Kulldorff’s scan approach was later extended to other distributions. In order to understand the impact of our Wald-based spatial scan statistic, it is important to review the likelihood ratio-based spatial scan

Wald-based spatial scan statistic

We propose our approach under the framework of general statistical models. Because of their importance, we study its specification under the framework of GLMs. We also study possible specifications beyond GLMs, where we use the negative binomial, quasi-binomial, and quasi-Poisson models as examples. In linear regression for Gaussian data, the Wald-based spatial scan statistic can be as powerful as the likelihood ratio-based spatial scan statistic. The main reason is that the Wald test is

Simulation and case study

We evaluated properties of our approach via simulation and case studies. Both were carried out based on the spatial template provided in the Jiangxi infant mortality data. Jiangxi is an eastern province of China, which has 99 counties. The data contained the county-level infant birth and death counts obtained from the 2000 Census of China. In 2000, the province had the highest infant mortality rate (IMR) in eastern China. The data set was previously studied with a likelihood ratio-based spatial

Discussion

We have proposed an approach to extend the likelihood ratio-based spatial scan statistic to the Wald-based spatial scan statistic. A nice feature is that the Wald-based spatial scan approach can still be used even if the derivation of likelihood ratio statistic is difficult. The exact distribution is not required in the Wald-based spatial scan approach. It is more flexible to combine with existing well-known models for spatial clusters as opposed to the relatively rigid likelihood-ratio based

Acknowledgments

The authors appreciate the comments from an associate editor and two anonymous referees. These comments significantly improve the quality of the article. This work is supported by “the Fundamental Research Funds for the Central Universities” in UIBE (CXTD9-07) of the corresponding author Ying Liu.

References (24)

  • ZhangT. et al.

    Scan statistics in loglinear models

    Comput. Statist. Data Anal.

    (2009)
  • ZhangT. et al.

    On the limiting distribution of the spatial scan statistic

    J. Multivariate Anal.

    (2013)
  • AgrestiA.

    Categorical Data Analysis

    (2002)
  • AssuncaoR. et al.

    Tavares and S. Ferreira. Fast detection of arbitrarily shaped disease clusters

    Stat. Med.

    (2006)
  • BhattV. et al.

    A spatial scan statistic for survival data based on Weibull distribution

    Stat. Med.

    (2014)
  • CançadoA. et al.

    A zero-inflated Poisson-based spatial scan statistic

    Environ. Ecol. Stat.

    (2014)
  • de LimaM.S. et al.

    Spatial scan statistics for models with overdispersion and inflated zeros

    Statist. Sinica

    (2015)
  • de LimaM.S. et al.

    ScanZID: Spatial scan statistics with zero inflation and dispersion

  • GangnonR.E. et al.

    A hierarchical model for spatial cluster disease rates

    Stat. Med.

    (2003)
  • GreenP.J.

    Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternative

    J. R. Stat. Soc. Ser. B

    (1984)
  • HuangL. et al.

    A spatial scan statistic for survival data

    Biometrics

    (2007)
  • HuangL. et al.

    Weighted normal spatial scan statistic for heterogeneous population data

    J. Amer. Statist. Assoc.

    (2009)
  • Cited by (5)

    • Detecting spatial clusters in functional data: New scan statistic approaches

      2021, Spatial Statistics
      Citation Excerpt :

      This obstacle was circumvented by Cucala et al. (2017), who developed a spatial scan statistic based on a likelihood ratio and a multivariate normal probability model that takes account of the correlations between variables. Although the previous scan statistics are based on likelihood ratios, other approaches have been proposed such as the extension of the method to generalized likelihood ratios (Jung, 2009; Ahmed and Genin, 2020), the Wald-based spatial scan statistic (Liu et al., 2018) or rank-based spatial scan statistics (Jung and Cho, 2015; Cucala et al., 2019). More recently, (Abolhassani and Prates, 2021) proposed a thorough review of the literature on scan statistics.

    • Confidence intervals for spatial scan statistic

      2021, Computational Statistics and Data Analysis
    • Investigating spatial scan statistics for multivariate functional data

      2023, Journal of the Royal Statistical Society. Series C: Applied Statistics
    View full text