Bayesian beta regression for bounded responses with unknown supports

https://doi.org/10.1016/j.csda.2021.107345Get rights and content

Abstract

A new Bayesian regression framework is presented for the analysis of continuous response data with support restricted to an unknown finite interval. A four-parameter beta distribution is assumed for the response conditioning on covariates, with the mean or mode depending linearly on covariates through a known link function. An informative g-prior is proposed to incorporate the prior distribution for the marginal mean or mode of the response. Byproducts of the Markov chain Monte Carlo sampling for implementing the proposed method lead to model criteria useful for model selection. Goodness-of-fit of the model is assessed using Cox-Snell residual plots. The methodology is illustrated in simulations and demonstrated in two real-life data applications. An R package, betaBayes, is developed for easy implementation of the proposed regression methodology.

Introduction

Researchers in a wide range of fields encounter bounded data in their studies. For example, environmental scientists monitor the proportion of hygienic waste in residential solid waste. Asset allocations in a portfolio and the share of household income spent on food are bounded data of interest in economics. Psychologists analyze confidence ratings and bounded scores from cognitive tests administered to study subjects. Examples of bounded data in the biomedical field include prevalence rates and death rates of the coronavirus disease 2019 (COVID-19), and body fat percentages of athletes. Different from unbounded data, central tendency measures, skewness, and other features of the underlying distribution for bounded data are inextricable from the support of the distribution. Consequently, more caution is necessary when drawing inference for these features based on bounded data, especially when the support is unknown.

Existing approaches for analyzing bounded data typically assume a prefixed support such as (0,1), sometimes after scaling the raw data. The beta mean regression model proposed by Ferrari and Cribari-Neto (2004) probably has received the most attention for modeling response data bounded on the unit interval, where the mean parameter of the beta distribution depends linearly on covariates through a known link function. Model diagnostic methods for a beta mean regression were considered in Espinheira et al., 2008a, Espinheira et al., 2008b, Ferrari et al. (2011), and Rocha and Simas (2011). The model has also been extended to allow the precision parameter to vary with covariates (Smithson and Verkuilen, 2006; Ferrari et al., 2011). An R package betareg (Cribari-Neto and Zeileis, 2010; Grün et al., 2012) is available on CRAN for fitting the beta mean regression model with varying precision and performing model diagnostics. This R package also allows for fitting a finite mixture of beta regression models (Verkuilen and Smithson, 2012). Time series analysis of bounded data via a beta mean regression is presented in Guolo et al. (2014), which incorporates a serial dependence between responses via a Gaussian copula. All the aforementioned works carry out frequentist inference, mostly based on maximum likelihood. Bayesian treatments for modeling the response data bounded on (0,1) include the Bayesian beta mean regression model (Branscum et al., 2007), a beta rectangular regression model based on a mixture of a beta distribution and a uniform distribution (Bayes et al., 2012), a mixed effects beta model (Figueroa-Zúñiga et al., 2013), and a flexible beta model based on a special mixture of two beta distributions (Migliorati et al., 2018). Unlike all the above regression models which focus on inferring the conditional mean of a bounded response, Bayes et al. (2017) developed quantile regression models for bounded responses built upon beta distributions. Barrientos et al. (2017) proposed a fully nonparametric Bayesian approach to model the covariates-dependent distribution of a bounded response. Recently, Zhou et al. (2020) considered a beta mode regression model where the mode of the response is related to covariates through a link function.

All existing works mentioned above assume that the response variable is bounded on a prefixed interval such as (0, 1), which may not be appropriate. For example, a human being's body fat percentage can never reach a value close to zero or one. Google results show that the lowest body fat percentage is 2% in a human being; although the highest body fat percentage is not available, it is probably much less than one. In cases like this, misspecifying the support can degrade inference for a central tendency measure of the response conditioning on covariates, for instance. In some applications, inferring the support is the focal point of interest. For example, an accurate prediction for the support of the prevalence rate of COVID-19, that is more refined than the unit interval, in an upcoming flu season is important to local health officials. Other examples where the support of a response is unknown yet is of practical interest include models for survival analysis to study the minimum possible life time (Smith, 1994), the job-search problem (Flinn and Heckman, 1982; Christensen and Kiefer, 1991), and the procurement-auction problem (Paarsch, 1992; Donald and Paarsch, 2002). In these and many other existing works on regression models with the support of the response depending on unknown parameters, the authors established some unusual, often unappealing, properties of maximum likelihood estimators for the support parameters and other model parameters (e.g., Donald and Paarsch, 1993; Smith, 1994). These theoretical findings motivated alternative estimators for parameters in these nonregular regression models, many of which were proposed in the Bayesian paradigm.

To allow for inference on the support along with other features of the response, we consider in this study the four-parameter beta distribution, which extends the beta distribution by introducing two parameters to define the support, in addition to the two shape parameters. As noted above, statisticians have long recognized that estimating the support creates a non-regular problem, where the maximum likelihood estimation may fail to yield consistent estimators (Smith, 1985; Cheng and Traylor, 1995). Existing methods for estimating the four-parameter beta distribution include the moment-based estimation (Johnson et al., 1995; McGarvey et al., 2002), the maximum likelihood estimation when both shape parameters are greater than two (Carnahan, 1989), the corrected maximum likelihood method when both shape parameters are greater than one (Cheng and Iles, 1987), and the penalized likelihood approach (Wang, 2005), among others. The penalized likelihood approach by Wang (2005) is applicable without restricting the shape parameters to be above one or two, but standard error estimation for estimators of the four parameters are not provided.

These existing works on four-parameter beta distributions are not in a regression context. In fact, we can find little research on the four-parameter beta distribution in a regression setting. In this article, we present a class of Bayesian regression models that permit an inference for the support boundaries by considering the four-parameter beta distribution supported on (θ1,θ2), and introducing either a mean or mode parameter that linearly depends on covariates through a known link function. To facilitate Bayesian inference, we adopt an informative g-prior on the regression coefficients that leads to more efficient posterior sampling, especially when the data provide relatively weak information on the conditional mode or when multicollinearity is present. With a careful choice of blocking, we develop a fully automated (no manual “tuning” is required) Markov chain Monte Carlo (MCMC) algorithm for the posterior sampling. A new variation of the Cox-Snell residual plot (Cox and Snell, 1968) is provided for gross assessment of the model fit. Furthermore, all methods developed in the paper can be easily implemented in a freely-available R package, betaBayes, calling complied C++. The ready availability of software allows researchers to empirically compare various competing beta regression models on their own data with a continuous bounded response.

The remaining of the article is organized as follows. Section 2 describes the four-parameter beta regression models, including prior development and posterior inference. We consider in Section 3 model selection criteria and model diagnostics. Section 4 presents simulations to illustrate the quality of inference results when comparing to relevant existing methods. Section 5 comprises two illustrative data analyses with software implementation. The paper is concluded in Section 6 where we summarize the contributions of our study and discuss future research directions.

Section snippets

The regression models

Consider observed data consisting of n independent realizations of the response-covariates pair, D={(yi,xi),i=1,,n}, where yi is the response supported on an unknown interval (θ1,θ2), and xi=(1,xi1,,xip) is a vector of covariates with the intercept. For a random variable Y that follows a four-parameter beta distribution, Ybeta4(α1,α2,θ1,θ2) in short, its probability density function (pdf) is given byfbeta(y;α1,α2,θ1,θ2)=Γ(α1+α2)Γ(α1)Γ(α2)(yθ1)α11(θ2y)α21(θ2θ1)α1+α21,for y(θ1,θ2),

Model comparison and diagnostics

To compare different regression models that one may fit to the same data set, we adopt three model criteria described next, all of which are readily computed from the MCMC output. This computational convenience partly motivates our choice among many existing and well-accepted model criteria that may be used here (Mills and Prasad, 1992; Claeskens, 2016). To set the notations, denote by Di the ith data point, and by Di the data set with Di removed, for i=1,,n. Let Li(|Ω) be the likelihood

Simulation studies

We design three simulation experiments to illustrate the implementation of the proposed regression methodology, to demonstrate the performance of the posterior inference, and to evaluate the effectiveness of model selection via DIC, WAIC, and LPML, and the graphical diagnostic method.

Real-life data applications

In this section we apply the proposed Bayesian regression methodology to analyze two data sets from real-life applications. A sample R code for implementing the proposed models using the provided R package betaBayes is available in supplementary Appendix A.

Discussion

We propose a class of four-parameter beta regression models for studying the association between a continuous response bounded on an unknown interval and covariates via inferring either the conditional mean or mode of the response. Almost all existing approaches for analyzing bounded data assume a prefixed interval, which may not be accurate in many applications. To the best of our knowledge, the proposed regression models in this paper are the first regression framework allowing for an

Acknowledgements

The authors wish to thank the Co-Editor, anonymous Associate Editor, and two referees for their insightful comments and suggestions that greatly improved the manuscript.

References (77)

  • H. Zhang et al.

    Gaussian Bayesian network comparisons with graph ordering unknown

    Comput. Stat. Data Anal.

    (2021)
  • H. Akaike

    Information theory and an extension of the maximum likelihood principle

  • I. Baltazar-Aban et al.

    Properties of hazard-based residuals and implications in model diagnostics

    J. Am. Stat. Assoc.

    (1995)
  • A.F. Barrientos et al.

    Fully nonparametric regression for bounded data using dependent Bernstein polynomials

    J. Am. Stat. Assoc.

    (2017)
  • C.L. Bayes et al.

    A quantile parametric mixed regression model for bounded response variables

    Stat. Interface

    (2017)
  • C.L. Bayes et al.

    A new robust regression model for proportions

    Bayesian Anal.

    (2012)
  • A.J. Branscum et al.

    Bayesian beta regression: applications to household expenditure data and genetic distance between foot-and-mouth disease viruses

    Aust. N. Z. J. Stat.

    (2007)
  • J. Carnahan

    Maximum likelihood estimation for the 4-parameter beta distribution

    Commun. Stat., Simul. Comput.

    (1989)
  • S. Chen et al.

    Fast Bayesian variable selection for high dimensional linear models: marginal solo spike and slab priors

    Electron. J. Stat.

    (2019)
  • R. Cheng et al.

    Corrected maximum likelihood in non-regular problems

    J. R. Stat. Soc. B

    (1987)
  • R. Cheng et al.

    Non-regular maximum likelihood problems

    J. R. Stat. Soc. B

    (1995)
  • V. Chernozhukov et al.

    Likelihood estimation and inference in a class of nonregular econometric models

    Econometrica

    (2004)
  • B.J. Christensen et al.

    The exact likelihood function for an empirical job search model

    Econom. Theory

    (1991)
  • G. Claeskens

    Statistical model choice

    Annu. Rev. Stat. Appl.

    (2016)
  • P. Congdon

    Bayesian Models for Categorical Data

    (2005)
  • D.R. Cox et al.

    A general definition of residuals

    J. R. Stat. Soc. B

    (1968)
  • F. Cribari-Neto et al.

    Beta regression in R

    J. Stat. Softw.

    (2010)
  • S.G. Donald et al.

    Piecewise pseudo-maximum likelihood estimation in empirical models of auctions

    Int. Econ. Rev.

    (1993)
  • Dunn, P.K., Smyth, G.K., 2018. GLMsData: generalized linear model data sets. R package version...
  • I. Epifani et al.

    Case-deletion importance sampling estimators: central limit theorems and related results

    Electron. J. Stat.

    (2008)
  • P.L. Espinheira et al.

    On beta regression residuals

    J. Appl. Stat.

    (2008)
  • S. Ferrari et al.

    Beta regression for modelling rates and proportions

    J. Appl. Stat.

    (2004)
  • S.L. Ferrari et al.

    Diagnostic tools in beta regression with varying dispersion

    Stat. Neerl.

    (2011)
  • C.K. Fisher et al.

    Fast Bayesian feature selection for high dimensional linear regression in genomics via the ising approximation

  • S. Geisser et al.

    A predictive approach to model selection

    J. Am. Stat. Assoc.

    (1979)
  • A.E. Gelfand et al.

    Bayesian model choice: asymptotics and exact calculations

    J. R. Stat. Soc. B

    (1994)
  • A. Gelman et al.

    Understanding predictive information criteria for Bayesian models

    Stat. Comput.

    (2014)
  • B. Grün et al.

    Extended beta regression in R: shaken, stirred, mixed, and partitioned

    J. Stat. Softw.

    (2012)
  • Cited by (0)

    View full text