Power and sample size calculations for Poisson and zero-inflated Poisson regression models

https://doi.org/10.1016/j.csda.2013.09.029Get rights and content

Abstract

Although sample size calculations for testing a parameter in the Poisson regression model have been previously done, very little attention has been given to the effect of the correlation structure of the explanatory covariates on the sample size. A method to calculate the sample size for the Wald test in the Poisson regression model is proposed, assuming that the covariates may be correlated and have a multivariate normal distribution. Although this method of calculation works with any pre-specified correlation structure, the exchangeable and the AR(1) correlation matrices with different values for the correlation are used to illustrate the approach. The method used here to calculate the sample size is based on a modification of a methodology already proposed in the literature. Rather than using a discrete approximation to the normal distribution which may be much more problematic in higher dimensions, Monte Carlo simulations are used. It is observed that the sample size depends on the number of covariates for the exchangeable correlation matrix, but much more so on the correlation structure of the covariates. The sample size for the AR(1) correlation matrix changes less substantially as the dimension increases, and it also depends on the correlation structure of the covariates, but to a much lesser extent. The methodology is also extended to the case of the zero-inflated Poisson regression model in order to obtain analogous results.

Introduction

For scientific studies in diverse fields, calculation of the sample size needed for a test with pre-specified power and size is usually a necessary part of the design. This is especially true in biomedical research, where Poisson regression models are widely used. It can often happen that the observed count data may have many zeros and that the Poisson model may not be an adequate model for the counts in such situations. The more complex zero-inflated Poisson (ZIP) model could be a better choice here. In general, when the model involves multiple parameters, the inference becomes more complex, even more so for the zero-inflated Poisson models, when additional covariates are considered for the excess zero process.

Sample size determination and power calculations for generalized linear models are usually based on the Wald test (Whittemore, 1981, Signorini, 1991, Shieh, 2001, Shieh, 2005), the score test (Self and Mauritsen, 1988) or the likelihood ratio test (Self et al., 1992, Shieh, 2000). However, Matsui (2005) used a nonparametric test for sample size determination and Lyles et al. (2007) used a method of discrete approximation, originally proposed by Blom (1958) for transformed beta variables, in order to do power calculations but they did not include sample size calculations. Many of these authors concentrated on the univariate case. However, in most applications there are several explanatory covariates that are usually correlated. We propose to study the effect of this correlation structure of those covariates on the power and sample size calculations. It should be noted that we are only assuming that the covariates are correlated but that the responses are independent.

In our calculations, we propose a more flexible method based on Monte Carlo simulations. We do not need to proceed by using the type of discrete approximation as in Shieh (2001) or Lyles et al. (2007) when considering continuous random variables or discrete random variables with infinite support for the covariates. For example these authors used 10 or 11 points to approximate the Poisson distribution although Lyles et al. (2007) increased this number by adding pseudo-observations at each of the points.

Although Lambert (1992), Hall (2000) and Yau and Lee (2001) present complete and detailed descriptions of the Poisson (ZIP) regression model and its appropriate estimation procedures, there does not seem to be any previous research on sample size calculations for this model. However, Williamson et al. (2007) used the approximations of Blom (1958) to perform power calculations in this case.

We have chosen to study the Wald test here because it is frequently used due to its accessibility and its intuitiveness and because of several major recent contributions in sample size calculations for this test. For the logistic regression model, Whittemore (1981) approximated the Fisher information matrix in order to do sample size calculations. Signorini (1991) extended her technique to the Poisson regression model with one covariate in order to perform sample size calculations for the Wald test. The method presented by Shieh (2001) is a modification and a generalization of the methods of Signorini (1991) for testing one parameter in a Poisson regression model.

We consider here the case of testing in the Poisson regression model with more than one covariate whether one parameter is zero using the Wald test by introducing our modification to the method of Shieh (2001) for sample size calculations. We then extend the methodology to the zero-inflated Poisson regression model. In particular, we study the influence of the correlation structure of the explanatory covariates on the sample size using the two most popular types of correlation matrices. We also study the difference in sample size, not only as the correlation changes but also as the number of additional covariates increases.

This paper has two major contributions. We are able to modify the method of Shieh (2001) in order to study the effect of the correlation structure of the covariates in the higher dimensional case for the Poisson regression model by the use of Monte Carlo simulations and we present an extension of our methodology for sample size calculations to the ZIP regression model. We also studied the effect of the correlation between the covariates and the number of covariates on sample size calculations for both models. This latter contribution was motivated by two important features. First, although studies often have control variables, the impact on the sample size of having additional covariates is often neglected. Also, it is motivated by the importance of the correlation structure between covariates, especially in clinical trials, where the exchangeable and the AR correlation matrices are widely used. In some situations, the covariates may be equally correlated and the order has no importance, so the exchangeable structure can be considered for the correlation matrix. In other cases, when there is a time dependence and the order is important, the autoregressive structure can be more appropriate, and the matrix is build from an autoregressive process with a specific order.

The remainder of the paper is organized as follows. In the next section we introduce the Poisson regression model and describe our approach to sample size calculations for this model, with covariates having a multivariate normal distribution with different forms for the correlation matrix. It is important to emphasize that the method works with any type of correlation structure. In Section  3, we present the extension of our approach to the zero-inflated Poisson regression models. Section  4 contains numerical examples with different correlation structures for the covariates and with an accompanying discussion of these results. Section  5 contains an illustrative example. Finally, in Section  6, we draw some conclusions and discuss future directions of research related to this work.

Section snippets

Poisson regression model

The Poisson regression model is an important member of the family of generalized linear models (McCullagh and Nelder, 1983). In these models, the density of the response random variable Y takes the form: fY(y;θ,ϕ)=exp{yθb(θ)a(ϕ)+c(y,ϕ)}, where a,b and c are specified functions, θ is the canonical parameter, ϕ is the dispersion parameter and the linear predictor takes the following form: g(λ)=β0+β1X1++βpXp, where g is the link function, λ=E[Y],X=(1,X1,,Xp)T is a vector of (p+1) covariates

Zero-inflated Poisson model

It is well known that outcomes which consist of counts in biomedical research may have many zeros and the Poisson model may not be an adequate model in such a situation. Several approaches are introduced for these zero-inflated models. We cite the zero-inflated Poisson model (Lambert, 1992), the zero-inflated negative binomial model and zero-inflated binomial model (Hall, 2000), and the zero-inflated gamma model (Yau et al., 2002).

We concentrate here on the zero-inflated Poisson regression

Simulation study

In the study of the Poisson regression model, for the vector of covariates X=(X1,,Xp)T, we considered two special cases for the parameterization of the covariance matrix Vl,l=1,2, which depend only on a single parameter ρ. The first one represents the exchangeable case and the second is the AR(1) model. Each matrix is defined as follows: V1:Cov(X)=(1ρρρρ1ρρρρ1ρρρρ1).V2:Cov(X)=(11ρ2ρ1ρ2ρj11ρ2ρp11ρ2ρ1ρ211ρ2ρj21ρ2ρp21ρ2ρj11ρ2ρj21ρ211ρ2ρpj1ρ2ρp11ρ2ρp

Illustrative example

In order to illustrate our methodology with a real example, we reconsider here an application first presented in Signorini (1991). During a study of water pollution around Sydney, Australia, the number of illnesses and infections contracted during a swimming season was examined by the Sydney Water Board. The objective was to determine if there was a significant difference between non-ocean or infrequent swimmers (X1=0) and ocean swimmers (X1=1). Using a Poisson regression to model the number of

Conclusion

Our objective here was to study the effect of the correlation structure of the covariates and the number of covariates on the sample size required to attain certain levels of power and size for the Wald test when testing whether one parameter is zero in a multidimensional Poisson regression model and the zero-inflated Poisson regression model. We introduced Monte Carlo simulation techniques in order to adapt the approach of Shieh (2001) to do sample size calculations for the Poisson regression

Acknowledgments

The first author was supported by a GERAD postdoctoral fellowship and by the NSERC Discovery Grants of the second and third authors. All three authors would like to thank Claude Gravel for some of the preliminary calculations for this research. They would also like to thank the referees for their helpful comments which improved the original manuscript.

References (20)

  • G. Shieh

    On power and sample size calculations for Wald tests in generalized linear models

    Journal of Statistical Planning and Inference

    (2005)
  • W. Tu et al.

    Empirical Bayes analysis for a hierarchical Poisson generalized linear model

    Journal of Statistical Planning and Inference

    (2003)
  • G. Blom

    Statistical Estimates and Transformed Beta-Variables

    (1958)
  • C.L. Christiansen et al.

    Fitting and checking a two-level Poisson model: modeling patient mortality rates in heart transplant patients

  • G.H. Golub et al.

    Matrix Computations

    (1989)
  • D.B. Hall

    Zero-inflated Poisson and binomial regression with random effects: a case study

    Biometrics

    (2000)
  • D. Lambert

    Zero-inflated Poisson regression, with an application to defects in manufacturing

    Technometrics

    (1992)
  • R.H. Lyles et al.

    A practical approach to computing power for generalized linear models with nominal, count, or ordinal responses

    Statistics in Medicine

    (2007)
  • S. Matsui

    Sample size calculations for comparative clinical trials with over-dispersed Poisson process data

    Statistics in Medicine

    (2005)
  • P. McCullagh et al.

    Generalized Linear Models

    (1983)
There are more references available in the full text version of this article.

Cited by (8)

  • Dysconnectivity between the anterior insula and the dorsal anterior cingulate cortex during an emotion go/nogo paradigm is associated with aggressive behaviors in male schizophrenia patients

    2023, Psychiatry Research - Neuroimaging
    Citation Excerpt :

    As our patient sample exhibited a notable range of number of aggressive/violent behaviors (0–14) (Table 1) and given the distribution of this data did not follow a normal distribution, we evaluated the dimensional relationship between such behaviors and neural alterations using a negative binomial regression with log link model. For count variables that are over-dispersed, the negative binomial regression is preferred to the zero-inflated Poisson regression in studies involving relatively small samples (Channouf et al., 2014). We coded the total number of violent acts (first 5 questions of the MACVI) and total number of other aggressive acts (subsequent 17 questions of the MACVI) as dependent variables in 2 separate regression models.

  • I think therefore I Am? Examining the relationship between exercise identity and exercise behavior during behavioral weight loss treatment

    2019, Psychology of Sport and Exercise
    Citation Excerpt :

    A power analysis for a repeated measures ANOVA assessing change in EI from baseline to month 6 (objective 3) revealed a necessary sample size of 120 participants to have 80% power to detect a medium effect (West et al., 2011) given a moderate-to-large correlation between EI scores (Cardinal & Cardinal, 1997) and a type I error rate of 0.05. Based on published tables for determining sample size in zero-inflated Poisson models (Channouf, Fredette, & MacGibbon, 2014; Williamson, Lin, Lyles, & Hightower, 2007), the present study was adequately powered for models predicting MVPA (objectives 2 and 5). Thus, the study may have been underpowered to detect small correlations but was otherwise sufficiently powered.

  • Sample size for clustered count data based on discrete Weibull regression model

    2023, Communications in Statistics: Simulation and Computation
View all citing articles on Scopus
View full text