Power and sample size calculations for Poisson and zero-inflated Poisson regression models

doi:10.1016/j.csda.2013.09.029

Computational Statistics & Data Analysis

Volume 72, April 2014, Pages 241-251

https://doi.org/10.1016/j.csda.2013.09.029 Get rights and content

Abstract

Although sample size calculations for testing a parameter in the Poisson regression model have been previously done, very little attention has been given to the effect of the correlation structure of the explanatory covariates on the sample size. A method to calculate the sample size for the Wald test in the Poisson regression model is proposed, assuming that the covariates may be correlated and have a multivariate normal distribution. Although this method of calculation works with any pre-specified correlation structure, the exchangeable and the AR(1) correlation matrices with different values for the correlation are used to illustrate the approach. The method used here to calculate the sample size is based on a modification of a methodology already proposed in the literature. Rather than using a discrete approximation to the normal distribution which may be much more problematic in higher dimensions, Monte Carlo simulations are used. It is observed that the sample size depends on the number of covariates for the exchangeable correlation matrix, but much more so on the correlation structure of the covariates. The sample size for the AR(1) correlation matrix changes less substantially as the dimension increases, and it also depends on the correlation structure of the covariates, but to a much lesser extent. The methodology is also extended to the case of the zero-inflated Poisson regression model in order to obtain analogous results.

Introduction

For scientific studies in diverse fields, calculation of the sample size needed for a test with pre-specified power and size is usually a necessary part of the design. This is especially true in biomedical research, where Poisson regression models are widely used. It can often happen that the observed count data may have many zeros and that the Poisson model may not be an adequate model for the counts in such situations. The more complex zero-inflated Poisson (ZIP) model could be a better choice here. In general, when the model involves multiple parameters, the inference becomes more complex, even more so for the zero-inflated Poisson models, when additional covariates are considered for the excess zero process.

Sample size determination and power calculations for generalized linear models are usually based on the Wald test (Whittemore, 1981, Signorini, 1991, Shieh, 2001, Shieh, 2005), the score test (Self and Mauritsen, 1988) or the likelihood ratio test (Self et al., 1992, Shieh, 2000). However, Matsui (2005) used a nonparametric test for sample size determination and Lyles et al. (2007) used a method of discrete approximation, originally proposed by Blom (1958) for transformed beta variables, in order to do power calculations but they did not include sample size calculations. Many of these authors concentrated on the univariate case. However, in most applications there are several explanatory covariates that are usually correlated. We propose to study the effect of this correlation structure of those covariates on the power and sample size calculations. It should be noted that we are only assuming that the covariates are correlated but that the responses are independent.

In our calculations, we propose a more flexible method based on Monte Carlo simulations. We do not need to proceed by using the type of discrete approximation as in Shieh (2001) or Lyles et al. (2007) when considering continuous random variables or discrete random variables with infinite support for the covariates. For example these authors used 10 or 11 points to approximate the Poisson distribution although Lyles et al. (2007) increased this number by adding pseudo-observations at each of the points.

Although Lambert (1992), Hall (2000) and Yau and Lee (2001) present complete and detailed descriptions of the Poisson (ZIP) regression model and its appropriate estimation procedures, there does not seem to be any previous research on sample size calculations for this model. However, Williamson et al. (2007) used the approximations of Blom (1958) to perform power calculations in this case.

We have chosen to study the Wald test here because it is frequently used due to its accessibility and its intuitiveness and because of several major recent contributions in sample size calculations for this test. For the logistic regression model, Whittemore (1981) approximated the Fisher information matrix in order to do sample size calculations. Signorini (1991) extended her technique to the Poisson regression model with one covariate in order to perform sample size calculations for the Wald test. The method presented by Shieh (2001) is a modification and a generalization of the methods of Signorini (1991) for testing one parameter in a Poisson regression model.

We consider here the case of testing in the Poisson regression model with more than one covariate whether one parameter is zero using the Wald test by introducing our modification to the method of Shieh (2001) for sample size calculations. We then extend the methodology to the zero-inflated Poisson regression model. In particular, we study the influence of the correlation structure of the explanatory covariates on the sample size using the two most popular types of correlation matrices. We also study the difference in sample size, not only as the correlation changes but also as the number of additional covariates increases.

This paper has two major contributions. We are able to modify the method of Shieh (2001) in order to study the effect of the correlation structure of the covariates in the higher dimensional case for the Poisson regression model by the use of Monte Carlo simulations and we present an extension of our methodology for sample size calculations to the ZIP regression model. We also studied the effect of the correlation between the covariates and the number of covariates on sample size calculations for both models. This latter contribution was motivated by two important features. First, although studies often have control variables, the impact on the sample size of having additional covariates is often neglected. Also, it is motivated by the importance of the correlation structure between covariates, especially in clinical trials, where the exchangeable and the AR correlation matrices are widely used. In some situations, the covariates may be equally correlated and the order has no importance, so the exchangeable structure can be considered for the correlation matrix. In other cases, when there is a time dependence and the order is important, the autoregressive structure can be more appropriate, and the matrix is build from an autoregressive process with a specific order.

The remainder of the paper is organized as follows. In the next section we introduce the Poisson regression model and describe our approach to sample size calculations for this model, with covariates having a multivariate normal distribution with different forms for the correlation matrix. It is important to emphasize that the method works with any type of correlation structure. In Section 3, we present the extension of our approach to the zero-inflated Poisson regression models. Section 4 contains numerical examples with different correlation structures for the covariates and with an accompanying discussion of these results. Section 5 contains an illustrative example. Finally, in Section 6, we draw some conclusions and discuss future directions of research related to this work.

Section snippets

Poisson regression model

The Poisson regression model is an important member of the family of generalized linear models (McCullagh and Nelder, 1983). In these models, the density of the response random variable $Y$ takes the form: $f_{Y} (y; θ, ϕ) = exp {\frac{y θ - b (θ)}{a (ϕ)} + c (y, ϕ)},$ where $a, b$ and $c$ are specified functions, $θ$ is the canonical parameter, $ϕ$ is the dispersion parameter and the linear predictor takes the following form: $g (λ) = β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p},$ where $g$ is the link function, $λ = E [Y], X = {(1, X_{1}, \dots, X_{p})}^{T}$ is a vector of $(p + 1)$ covariates

Zero-inflated Poisson model

It is well known that outcomes which consist of counts in biomedical research may have many zeros and the Poisson model may not be an adequate model in such a situation. Several approaches are introduced for these zero-inflated models. We cite the zero-inflated Poisson model (Lambert, 1992), the zero-inflated negative binomial model and zero-inflated binomial model (Hall, 2000), and the zero-inflated gamma model (Yau et al., 2002).

We concentrate here on the zero-inflated Poisson regression

Simulation study

In the study of the Poisson regression model, for the vector of covariates $X = {(X_{1}, \dots, X_{p})}^{T}$ , we considered two special cases for the parameterization of the covariance matrix $V_{l}, l = 1, 2$ , which depend only on a single parameter $ρ$ . The first one represents the exchangeable case and the second is the AR(1) model. Each matrix is defined as follows: $V_{1} : Cov (X) = (\begin{matrix} 1 & ρ & \dots & ρ & \dots & ρ \\ ρ & 1 & ρ & \dots & ρ \\ ⋮ & ⋱ & ⋮ \\ ρ & ρ & 1 & ρ \\ ⋮ & ⋮ & ⋱ \\ ρ & ρ & \dots & ρ & 1 \end{matrix}) .$ $V_{2} : Cov (X) = (\begin{matrix} \frac{1}{1 - ρ^{2}} & \frac{ρ}{1 - ρ^{2}} & \dots & \frac{ρ^{j - 1}}{1 - ρ^{2}} & \dots & \frac{ρ^{p - 1}}{1 - ρ^{2}} \\ \frac{ρ}{1 - ρ^{2}} & \frac{1}{1 - ρ^{2}} & \dots & \frac{ρ^{j - 2}}{1 - ρ^{2}} & \dots & \frac{ρ^{p - 2}}{1 - ρ^{2}} \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ & ⋮ \\ \frac{ρ^{j - 1}}{1 - ρ^{2}} & \frac{ρ^{j - 2}}{1 - ρ^{2}} & ⋮ & \frac{1}{1 - ρ^{2}} & ⋮ & \frac{ρ^{p - j}}{1 - ρ^{2}} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{ρ^{p - 1}}{1 - ρ^{2}} & ρ^{p} \end{matrix}$

Illustrative example

In order to illustrate our methodology with a real example, we reconsider here an application first presented in Signorini (1991). During a study of water pollution around Sydney, Australia, the number of illnesses and infections contracted during a swimming season was examined by the Sydney Water Board. The objective was to determine if there was a significant difference between non-ocean or infrequent swimmers $(X_{1} = 0)$ and ocean swimmers $(X_{1} = 1)$ . Using a Poisson regression to model the number of

Conclusion

Our objective here was to study the effect of the correlation structure of the covariates and the number of covariates on the sample size required to attain certain levels of power and size for the Wald test when testing whether one parameter is zero in a multidimensional Poisson regression model and the zero-inflated Poisson regression model. We introduced Monte Carlo simulation techniques in order to adapt the approach of Shieh (2001) to do sample size calculations for the Poisson regression

Acknowledgments

The first author was supported by a GERAD postdoctoral fellowship and by the NSERC Discovery Grants of the second and third authors. All three authors would like to thank Claude Gravel for some of the preliminary calculations for this research. They would also like to thank the referees for their helpful comments which improved the original manuscript.

References (20)

G. Shieh
On power and sample size calculations for Wald tests in generalized linear models
Journal of Statistical Planning and Inference
(2005)
W. Tu et al.
Empirical Bayes analysis for a hierarchical Poisson generalized linear model
Journal of Statistical Planning and Inference
(2003)
G. Blom
Statistical Estimates and Transformed Beta-Variables
(1958)
C.L. Christiansen et al.
Fitting and checking a two-level Poisson model: modeling patient mortality rates in heart transplant patients
G.H. Golub et al.
Matrix Computations
(1989)
D.B. Hall
Zero-inflated Poisson and binomial regression with random effects: a case study
Biometrics
(2000)
D. Lambert
Zero-inflated Poisson regression, with an application to defects in manufacturing
Technometrics
(1992)
R.H. Lyles et al.
A practical approach to computing power for generalized linear models with nominal, count, or ordinal responses
Statistics in Medicine
(2007)
S. Matsui
Sample size calculations for comparative clinical trials with over-dispersed Poisson process data
Statistics in Medicine
(2005)
P. McCullagh et al.
Generalized Linear Models
(1983)

There are more references available in the full text version of this article.

Cited by (8)

Dysconnectivity between the anterior insula and the dorsal anterior cingulate cortex during an emotion go/nogo paradigm is associated with aggressive behaviors in male schizophrenia patients
2023, Psychiatry Research - Neuroimaging
Citation Excerpt :
As our patient sample exhibited a notable range of number of aggressive/violent behaviors (0–14) (Table 1) and given the distribution of this data did not follow a normal distribution, we evaluated the dimensional relationship between such behaviors and neural alterations using a negative binomial regression with log link model. For count variables that are over-dispersed, the negative binomial regression is preferred to the zero-inflated Poisson regression in studies involving relatively small samples (Channouf et al., 2014). We coded the total number of violent acts (first 5 questions of the MACVI) and total number of other aggressive acts (subsequent 17 questions of the MACVI) as dependent variables in 2 separate regression models.
This study aimed to investigate the association between past-reported violent/aggressive behaviors and brain functional connectivity in male patients suffering from schizophrenia using a task modeling the interaction between negative emotion processing and response inhibition. Forty-four male patients with schizophrenia and twenty-two healthy male controls performed an emotional go/no-go task using angry and neutral faces during a functional magnetic resonance imaging session. Generalized psycho-physiological interaction was conducted to explore task-based functional connectivity and a negative binomial regression was used to evaluate the relationship between neural alterations and violent/aggressive behaviors. Regions involved in response inhibition and emotion regulation, such as the anterior insula, dorsal anterior cingulate cortex (dACC) and dorsolateral prefrontal cortex (DLPFC), were used as seed regions. During emotion-related response inhibition, patients with schizophrenia displayed altered connectivity between the anterior insula and amygdala, the DLPFC and lateral orbitofrontal cortex (OFC), as well as the anterior insula and the dACC when compared to healthy individuals. The latter was negatively associated with aggressive behaviors in participants with schizophrenia (Wald χ² = 9.51; p < 0.05, p-FDR corrected). Our results highlight alterations in functional connectivity in brain regions involved in cognitive control and emotion processing which are associated with aggressive behaviors in schizophrenia.
I think therefore I Am? Examining the relationship between exercise identity and exercise behavior during behavioral weight loss treatment
2019, Psychology of Sport and Exercise
Citation Excerpt :
A power analysis for a repeated measures ANOVA assessing change in EI from baseline to month 6 (objective 3) revealed a necessary sample size of 120 participants to have 80% power to detect a medium effect (West et al., 2011) given a moderate-to-large correlation between EI scores (Cardinal & Cardinal, 1997) and a type I error rate of 0.05. Based on published tables for determining sample size in zero-inflated Poisson models (Channouf, Fredette, & MacGibbon, 2014; Williamson, Lin, Lyles, & Hightower, 2007), the present study was adequately powered for models predicting MVPA (objectives 2 and 5). Thus, the study may have been underpowered to detect small correlations but was otherwise sufficiently powered.
Identification as an exerciser may promote physical activity. This study examined exercise identity (EI) and its relationship with demographic characteristics and exercise among adults participating in behavioral weight loss treatment, which is a key target population for increasing exercise.
Longitudinal.
Participants (N = 320) completed a measure of EI and exercise was assessed with accelerometers at baseline and 6 months.
Baseline EI and exercise were positively related and EI and exercise increased over time. However, change in EI was not meaningfully related to change in exercise, baseline EI did not predict change in exercise, and 6-month EI was not related to 6-month exercise. Participants identifying as non-White reported greater EI but lower exercise.
Although EI and exercise may increase among weight loss participants, the two may not be meaningfully related during active weight loss treatment. The relationship between EI and exercise may also differ based on race.
Sample size calculations for clustered count data based on zero-inflated discrete Weibull regression models
2024, Communications for Statistical Applications and Methods
Sample size for clustered count data based on discrete Weibull regression model
2023, Communications in Statistics: Simulation and Computation
Sample size calculation for cluster randomized trials with zero-inflated count outcomes
2022, Statistics in Medicine
Sample size calculation based on discrete Weibull and zero-inflated discrete Weibull regression models
2022, Communications in Statistics: Simulation and Computation

View all citing articles on Scopus

View full text

Power and sample size calculations for Poisson and zero-inflated Poisson regression models

Abstract

Introduction

Section snippets

Poisson regression model

Zero-inflated Poisson model

Simulation study

Illustrative example

Conclusion

Acknowledgments

Journal of Statistical Planning and Inference

Journal of Statistical Planning and Inference

Statistical Estimates and Transformed Beta-Variables

Fitting and checking a two-level Poisson model: modeling patient mortality rates in heart transplant patients

Matrix Computations

Zero-inflated Poisson and binomial regression with random effects: a case study

Biometrics

Zero-inflated Poisson regression, with an application to defects in manufacturing

Technometrics

A practical approach to computing power for generalized linear models with nominal, count, or ordinal responses

Statistics in Medicine

Sample size calculations for comparative clinical trials with over-dispersed Poisson process data

Statistics in Medicine

Generalized Linear Models