A score test for zero-inflation in multilevel count data

https://doi.org/10.1016/j.csda.2008.10.041Get rights and content

Abstract

The zero-inflated Poisson regression (ZIP) in many situations is appropriate for analyzing multilevel correlated count data with excess zeros. In this paper, a score test for assessing ZIP regression against Poisson regression in multilevel count data with excess zeros is developed. The sampling distribution and power of the score statistic test is evaluated using a simulation study. The results show that under a wide range of conditions, the score statistic performs satisfactorily. Finally, the use of the score test is illustrated on DMFT index data of children 7–8 years old.

Introduction

Zero-inflated count data occurs in many fields such as public health, epidemiology, medicine, sociology, engineering and agriculture. The zero-inflated family of models was developed and extended with a variety of examples from different disciplines (Mullahy, 1986, Lambert, 1992, Greene, 1994, Böhning, 1998). Additionally, there are further applications of this method (Böhning et al., 1999, Yau and Lee, 2001, Cheung, 2002, Lee et al., 2004). Hall (2000) has considered clustered data and shown how mixed effect versions of zero-inflated Poisson (ZIP) and zero-inflated binomial (ZIB) models are appropriate. Subsequently ZIP regression models with cluster-specific random effects were considered (Wang et al., 2002, Hur et al., 2002).

Van den Broek (1995) proposed a score test for zero-inflation in a Poisson distribution. This was extended for the special case of ZIP and ZIB models with covariates (Deng and Paul, 2000, Jansakul and Hinde, 2002) and more recently, Xiang et al. (2006) proposed a score test for zero-inflation in correlated count data.

In practice, the non-zero part of the count data is often over-dispersed and another distribution such as the zero-inflated negative binomial (ZINB) may be more appropriate than ZIP. In this situation, there are tests for assessing ZINB against ZIP model (Ridout et al., 2001, Xiang et al., 2007).

Often, because of a hierarchical study design or data collection, data shows intraclass correlation, simultaneously within multiple levels. This is inherent in health surveys when subjects are typically nested in clusters, regions or provinces. Lee et al. (2006) extended the ZIP with random effects to multilevel ZIP regression and Moghimbeigi et al. (2008) developed the ZINB with random effects to multilevel ZINB regression to model multilevel clustered count data.

The aim of this paper is to present a test for assessing zero-inflation in multilevel (two-fold correlated) count data. A brief review of the standard ZIP model is presented and then a multilevel ZIP regression incorporating random effects to account for data dependency and over-dispersion is used. The sampling distribution and power of the score test statistics are evaluated by a simulation study. The study is motivated by the analysis of the DMFT index of children aged 7–8 years. The DMFT index is the sum of the decayed, missing and filled surfaces of the permanent and primary teeth (World Health Organization, 1997).

Section snippets

Multilevel ZIP model

Let y be the response count variable. The ZIP distribution can be written as: p(y=0)=ϕ+(1ϕ)exp(exp(η))P(Y=y)=(1ϕ)exp(exp(η))exp(η)yy!,y=1,2,3, where exp(η) is the parameter of the Poisson distribution and exp(exp(η))1exp(exp(η))ϕ1. Thus, it incorporates more zeros than those permitted by the Poisson distribution (ϕ=0); while ϕ<0 corresponds to the zero-deflated situation (Dietz and Böhning, 2000). This distribution has: E(Y)=(1ϕ)exp(η)V ar(Y)=(1ϕ)(1+ϕexp(η))exp(η). Consider a

The score test for zero-inflation

The score test for comparing a multilevel ZIP regression against a multilevel Poisson regression is described as follows. Let γ=ϕ/(1ϕ), exp(exp(ηijk))<γ< for exp(exp(ηijk))1exp(exp(ηijk))ϕ1. Testing H0:ϕ=0 is equivalent to testing H0:γ=0. The conditional distribution of can be written as: 1=ijk1ijk1ijk=log(1+γ)+I(yijk=0)log(γ+exp(exp(ηijk)))+I(yijk>0)(ηijkyijkexp(ηijk)+log(yijk!)). Let τ=σu2 and ω=σv2. Taking the first and second derivatives of with respect to β,u,v,τ,ω

Sampling distribution

Eq. (5) exhibits the quadratic form that is expected from standard statistical theory (Cox and Hinkley, 1979). Under H0:γ=0, S has an approximate χ12 distribution. With u=0 and v=0, S reduces to the ordinary score test of Van den Broek (1995) and with only u=0 it reduces to the score test for correlated count data of Xiang et al. (2006). To investigate the distribution of the score test statistic S for small sample sizes, a simulation study is conducted under the null hypothesis. The

Application

In dental epidemiology, the DMFT index is an important and well-known indicator and overall measure for the dental status of a person. It is a count number standing for the number of Decayed, Missing, and Filled Teeth. Data of this study is a part of a national health study in Iran, 1991. In this national health survey, clusters (regions) with unequal sizes were selected within provinces and then the health of residents (included dental health) was measured within clusters. Because of the

Discussion

In many situations, count data have a large proportion of excess zeros and the zero-inflated Poisson may be appropriate. In this context there is a simple score test for comparing the ZIP model with a constant proportion of excess zeroes to a standard Poisson regression model of Van den Broek (1995). This test was extended to the general situation where the zero probability is allowed to depend on covariates (Jansakul and Hinde, 2002), and correlated count data (Xiang et al., 2006). In this

Acknowledgements

The authors would like to thank the Ministry of Health and Medical Education for the health survey data. Thanks are also due to two reviewers for helpful comments and suggestions.

References (24)

  • N. Jansakul et al.

    Score tests for zero-inflated Poisson models

    Comput. Statist. Data Anal.

    (2002)
  • J. Mullahy

    Specification and testing of some modified count data models

    J. Econometrics

    (1986)
  • K. Wang et al.

    A zero-inflated Poisson mixed model to analyze diagnosis related groups with majority of same-day hospital stays

    Comput. Methods Programs Biomed.

    (2002)
  • D. Böhning

    Zero-inflated Poisson models and C.A.MAN: A tutorial collection of evidence

    Biometrical J.

    (1998)
  • D. Böhning et al.

    The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology

    J. Roy. Statist. Soc. A

    (1999)
  • Y.B. Cheung

    Zero-inflated models for regression analysis of count data: A study of growth and development

    Stat. Med.

    (2002)
  • D.R. Cox et al.

    Theoretical Statistics

    (1979)
  • D. Deng et al.

    Score tests for zero-inflation in generalized linear models

    Canad. J. Statist.

    (2000)
  • K. Dietz et al.

    On estimation of the Poisson parameter in zero-modified Poisson models

    Comput. Statist. Data Anal.

    (2000)
  • Greene, W.H., 1994. Accounting for excess zeros and sample selection in poisson and negative Binomial regression...
  • D.B. Hall

    Zero-inflated Poisson and binomial regression with random effects: A case study

    Biometrics

    (2000)
  • K. Hur et al.

    Modeling clustered count data with excess zeros in health care outcomes research

    Health Serv. Outcomes Res. Method

    (2002)
  • Cited by (22)

    • Two-part zero-inflated negative binomial regression model for quantitative trait loci mapping with count trait

      2015, Journal of Theoretical Biology
      Citation Excerpt :

      To select a model that is evidence for data, it is necessary to compare models after fitting data. Unlike other fields (Van den Broek, 1995; Ridout et al., 2001; Xiang et al., 2006, 2007; Moghimbeigi et al., 2009), in the QTL mapping context, there is not a score test in comparison and selection optimum model for zero-inflated and over-dispersed count traits. In the data, the frequency distribution of the cholesterol gallstone counts shows the large proportion (about 57%) of zero data.

    • Comparing statistical methods for analyzing skewed longitudinal count data with many zeros: An example of smoking cessation

      2013, Journal of Substance Abuse Treatment
      Citation Excerpt :

      It was impractical for us to cover all these models in one paper. Second, with respect to selecting the best-fit models, at least two other modeling evaluation statistics have been proposed: a score test for testing both zero-inflation and overdispersion, and for comparing nested models (Hall & Berenhaut, 2002; Moghimbeigi, Eshraghian, Mohammad, & McArdle, 2009; Ridout et al., 2001), and Vuong's test for comparing non-nested models (Vuong, 1989). The model selection process should also be accompanied by judgment based on expert knowledge and by model diagnosis based on residual analysis.

    • Score tests for zero-inflation and overdispersion in two-level count data

      2013, Computational Statistics and Data Analysis
      Citation Excerpt :

      Xiang et al. (2006) propose a score test for zero-inflation in correlated count data, and Lee et al. (2006) extend the ZIP regression model to a multilevel ZIP regression model with random effects. Recently, Moghimbeigi et al. (2009) propose a score test for zero-inflation in multilevel count data. In Poisson data with too many zeros, the variance often exceeds the mean, causing overdispersion.

    • An extension of an over-dispersion test for count data

      2011, Computational Statistics and Data Analysis
    View all citing articles on Scopus
    View full text