A score test for zero-inflation in multilevel count data
Introduction
Zero-inflated count data occurs in many fields such as public health, epidemiology, medicine, sociology, engineering and agriculture. The zero-inflated family of models was developed and extended with a variety of examples from different disciplines (Mullahy, 1986, Lambert, 1992, Greene, 1994, Böhning, 1998). Additionally, there are further applications of this method (Böhning et al., 1999, Yau and Lee, 2001, Cheung, 2002, Lee et al., 2004). Hall (2000) has considered clustered data and shown how mixed effect versions of zero-inflated Poisson (ZIP) and zero-inflated binomial (ZIB) models are appropriate. Subsequently ZIP regression models with cluster-specific random effects were considered (Wang et al., 2002, Hur et al., 2002).
Van den Broek (1995) proposed a score test for zero-inflation in a Poisson distribution. This was extended for the special case of ZIP and ZIB models with covariates (Deng and Paul, 2000, Jansakul and Hinde, 2002) and more recently, Xiang et al. (2006) proposed a score test for zero-inflation in correlated count data.
In practice, the non-zero part of the count data is often over-dispersed and another distribution such as the zero-inflated negative binomial (ZINB) may be more appropriate than ZIP. In this situation, there are tests for assessing ZINB against ZIP model (Ridout et al., 2001, Xiang et al., 2007).
Often, because of a hierarchical study design or data collection, data shows intraclass correlation, simultaneously within multiple levels. This is inherent in health surveys when subjects are typically nested in clusters, regions or provinces. Lee et al. (2006) extended the ZIP with random effects to multilevel ZIP regression and Moghimbeigi et al. (2008) developed the ZINB with random effects to multilevel ZINB regression to model multilevel clustered count data.
The aim of this paper is to present a test for assessing zero-inflation in multilevel (two-fold correlated) count data. A brief review of the standard ZIP model is presented and then a multilevel ZIP regression incorporating random effects to account for data dependency and over-dispersion is used. The sampling distribution and power of the score test statistics are evaluated by a simulation study. The study is motivated by the analysis of the DMFT index of children aged 7–8 years. The DMFT index is the sum of the decayed, missing and filled surfaces of the permanent and primary teeth (World Health Organization, 1997).
Section snippets
Multilevel ZIP model
Let be the response count variable. The ZIP distribution can be written as: where is the parameter of the Poisson distribution and . Thus, it incorporates more zeros than those permitted by the Poisson distribution (); while corresponds to the zero-deflated situation (Dietz and Böhning, 2000). This distribution has: Consider a
The score test for zero-inflation
The score test for comparing a multilevel ZIP regression against a multilevel Poisson regression is described as follows. Let , for . Testing is equivalent to testing . The conditional distribution of can be written as: Let and . Taking the first and second derivatives of with respect to
Sampling distribution
Eq. (5) exhibits the quadratic form that is expected from standard statistical theory (Cox and Hinkley, 1979). Under , has an approximate distribution. With and , reduces to the ordinary score test of Van den Broek (1995) and with only it reduces to the score test for correlated count data of Xiang et al. (2006). To investigate the distribution of the score test statistic for small sample sizes, a simulation study is conducted under the null hypothesis. The
Application
In dental epidemiology, the DMFT index is an important and well-known indicator and overall measure for the dental status of a person. It is a count number standing for the number of Decayed, Missing, and Filled Teeth. Data of this study is a part of a national health study in Iran, 1991. In this national health survey, clusters (regions) with unequal sizes were selected within provinces and then the health of residents (included dental health) was measured within clusters. Because of the
Discussion
In many situations, count data have a large proportion of excess zeros and the zero-inflated Poisson may be appropriate. In this context there is a simple score test for comparing the ZIP model with a constant proportion of excess zeroes to a standard Poisson regression model of Van den Broek (1995). This test was extended to the general situation where the zero probability is allowed to depend on covariates (Jansakul and Hinde, 2002), and correlated count data (Xiang et al., 2006). In this
Acknowledgements
The authors would like to thank the Ministry of Health and Medical Education for the health survey data. Thanks are also due to two reviewers for helpful comments and suggestions.
References (24)
- et al.
Score tests for zero-inflated Poisson models
Comput. Statist. Data Anal.
(2002) Specification and testing of some modified count data models
J. Econometrics
(1986)- et al.
A zero-inflated Poisson mixed model to analyze diagnosis related groups with majority of same-day hospital stays
Comput. Methods Programs Biomed.
(2002) Zero-inflated Poisson models and C.A.MAN: A tutorial collection of evidence
Biometrical J.
(1998)- et al.
The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology
J. Roy. Statist. Soc. A
(1999) Zero-inflated models for regression analysis of count data: A study of growth and development
Stat. Med.
(2002)- et al.
Theoretical Statistics
(1979) - et al.
Score tests for zero-inflation in generalized linear models
Canad. J. Statist.
(2000) - et al.
On estimation of the Poisson parameter in zero-modified Poisson models
Comput. Statist. Data Anal.
(2000) - Greene, W.H., 1994. Accounting for excess zeros and sample selection in poisson and negative Binomial regression...
Zero-inflated Poisson and binomial regression with random effects: A case study
Biometrics
Modeling clustered count data with excess zeros in health care outcomes research
Health Serv. Outcomes Res. Method
Cited by (22)
Two-part zero-inflated negative binomial regression model for quantitative trait loci mapping with count trait
2015, Journal of Theoretical BiologyCitation Excerpt :To select a model that is evidence for data, it is necessary to compare models after fitting data. Unlike other fields (Van den Broek, 1995; Ridout et al., 2001; Xiang et al., 2006, 2007; Moghimbeigi et al., 2009), in the QTL mapping context, there is not a score test in comparison and selection optimum model for zero-inflated and over-dispersed count traits. In the data, the frequency distribution of the cholesterol gallstone counts shows the large proportion (about 57%) of zero data.
Comparing statistical methods for analyzing skewed longitudinal count data with many zeros: An example of smoking cessation
2013, Journal of Substance Abuse TreatmentCitation Excerpt :It was impractical for us to cover all these models in one paper. Second, with respect to selecting the best-fit models, at least two other modeling evaluation statistics have been proposed: a score test for testing both zero-inflation and overdispersion, and for comparing nested models (Hall & Berenhaut, 2002; Moghimbeigi, Eshraghian, Mohammad, & McArdle, 2009; Ridout et al., 2001), and Vuong's test for comparing non-nested models (Vuong, 1989). The model selection process should also be accompanied by judgment based on expert knowledge and by model diagnosis based on residual analysis.
Score tests for zero-inflation and overdispersion in two-level count data
2013, Computational Statistics and Data AnalysisCitation Excerpt :Xiang et al. (2006) propose a score test for zero-inflation in correlated count data, and Lee et al. (2006) extend the ZIP regression model to a multilevel ZIP regression model with random effects. Recently, Moghimbeigi et al. (2009) propose a score test for zero-inflation in multilevel count data. In Poisson data with too many zeros, the variance often exceeds the mean, causing overdispersion.
An extension of an over-dispersion test for count data
2011, Computational Statistics and Data AnalysisMultilevel log linear model to estimate the risk factors associated with infant mortality in Ethiopia: further analysis of 2016 EDHS
2022, BMC Pregnancy and Childbirth