Bayesian beta regression for bounded responses with unknown supports
Introduction
Researchers in a wide range of fields encounter bounded data in their studies. For example, environmental scientists monitor the proportion of hygienic waste in residential solid waste. Asset allocations in a portfolio and the share of household income spent on food are bounded data of interest in economics. Psychologists analyze confidence ratings and bounded scores from cognitive tests administered to study subjects. Examples of bounded data in the biomedical field include prevalence rates and death rates of the coronavirus disease 2019 (COVID-19), and body fat percentages of athletes. Different from unbounded data, central tendency measures, skewness, and other features of the underlying distribution for bounded data are inextricable from the support of the distribution. Consequently, more caution is necessary when drawing inference for these features based on bounded data, especially when the support is unknown.
Existing approaches for analyzing bounded data typically assume a prefixed support such as , sometimes after scaling the raw data. The beta mean regression model proposed by Ferrari and Cribari-Neto (2004) probably has received the most attention for modeling response data bounded on the unit interval, where the mean parameter of the beta distribution depends linearly on covariates through a known link function. Model diagnostic methods for a beta mean regression were considered in Espinheira et al., 2008a, Espinheira et al., 2008b, Ferrari et al. (2011), and Rocha and Simas (2011). The model has also been extended to allow the precision parameter to vary with covariates (Smithson and Verkuilen, 2006; Ferrari et al., 2011). An R package betareg (Cribari-Neto and Zeileis, 2010; Grün et al., 2012) is available on CRAN for fitting the beta mean regression model with varying precision and performing model diagnostics. This R package also allows for fitting a finite mixture of beta regression models (Verkuilen and Smithson, 2012). Time series analysis of bounded data via a beta mean regression is presented in Guolo et al. (2014), which incorporates a serial dependence between responses via a Gaussian copula. All the aforementioned works carry out frequentist inference, mostly based on maximum likelihood. Bayesian treatments for modeling the response data bounded on include the Bayesian beta mean regression model (Branscum et al., 2007), a beta rectangular regression model based on a mixture of a beta distribution and a uniform distribution (Bayes et al., 2012), a mixed effects beta model (Figueroa-Zúñiga et al., 2013), and a flexible beta model based on a special mixture of two beta distributions (Migliorati et al., 2018). Unlike all the above regression models which focus on inferring the conditional mean of a bounded response, Bayes et al. (2017) developed quantile regression models for bounded responses built upon beta distributions. Barrientos et al. (2017) proposed a fully nonparametric Bayesian approach to model the covariates-dependent distribution of a bounded response. Recently, Zhou et al. (2020) considered a beta mode regression model where the mode of the response is related to covariates through a link function.
All existing works mentioned above assume that the response variable is bounded on a prefixed interval such as (0, 1), which may not be appropriate. For example, a human being's body fat percentage can never reach a value close to zero or one. Google results show that the lowest body fat percentage is 2% in a human being; although the highest body fat percentage is not available, it is probably much less than one. In cases like this, misspecifying the support can degrade inference for a central tendency measure of the response conditioning on covariates, for instance. In some applications, inferring the support is the focal point of interest. For example, an accurate prediction for the support of the prevalence rate of COVID-19, that is more refined than the unit interval, in an upcoming flu season is important to local health officials. Other examples where the support of a response is unknown yet is of practical interest include models for survival analysis to study the minimum possible life time (Smith, 1994), the job-search problem (Flinn and Heckman, 1982; Christensen and Kiefer, 1991), and the procurement-auction problem (Paarsch, 1992; Donald and Paarsch, 2002). In these and many other existing works on regression models with the support of the response depending on unknown parameters, the authors established some unusual, often unappealing, properties of maximum likelihood estimators for the support parameters and other model parameters (e.g., Donald and Paarsch, 1993; Smith, 1994). These theoretical findings motivated alternative estimators for parameters in these nonregular regression models, many of which were proposed in the Bayesian paradigm.
To allow for inference on the support along with other features of the response, we consider in this study the four-parameter beta distribution, which extends the beta distribution by introducing two parameters to define the support, in addition to the two shape parameters. As noted above, statisticians have long recognized that estimating the support creates a non-regular problem, where the maximum likelihood estimation may fail to yield consistent estimators (Smith, 1985; Cheng and Traylor, 1995). Existing methods for estimating the four-parameter beta distribution include the moment-based estimation (Johnson et al., 1995; McGarvey et al., 2002), the maximum likelihood estimation when both shape parameters are greater than two (Carnahan, 1989), the corrected maximum likelihood method when both shape parameters are greater than one (Cheng and Iles, 1987), and the penalized likelihood approach (Wang, 2005), among others. The penalized likelihood approach by Wang (2005) is applicable without restricting the shape parameters to be above one or two, but standard error estimation for estimators of the four parameters are not provided.
These existing works on four-parameter beta distributions are not in a regression context. In fact, we can find little research on the four-parameter beta distribution in a regression setting. In this article, we present a class of Bayesian regression models that permit an inference for the support boundaries by considering the four-parameter beta distribution supported on , and introducing either a mean or mode parameter that linearly depends on covariates through a known link function. To facilitate Bayesian inference, we adopt an informative g-prior on the regression coefficients that leads to more efficient posterior sampling, especially when the data provide relatively weak information on the conditional mode or when multicollinearity is present. With a careful choice of blocking, we develop a fully automated (no manual “tuning” is required) Markov chain Monte Carlo (MCMC) algorithm for the posterior sampling. A new variation of the Cox-Snell residual plot (Cox and Snell, 1968) is provided for gross assessment of the model fit. Furthermore, all methods developed in the paper can be easily implemented in a freely-available R package, betaBayes, calling complied C++. The ready availability of software allows researchers to empirically compare various competing beta regression models on their own data with a continuous bounded response.
The remaining of the article is organized as follows. Section 2 describes the four-parameter beta regression models, including prior development and posterior inference. We consider in Section 3 model selection criteria and model diagnostics. Section 4 presents simulations to illustrate the quality of inference results when comparing to relevant existing methods. Section 5 comprises two illustrative data analyses with software implementation. The paper is concluded in Section 6 where we summarize the contributions of our study and discuss future research directions.
Section snippets
The regression models
Consider observed data consisting of n independent realizations of the response-covariates pair, , where is the response supported on an unknown interval , and is a vector of covariates with the intercept. For a random variable Y that follows a four-parameter beta distribution, in short, its probability density function (pdf) is given by
Model comparison and diagnostics
To compare different regression models that one may fit to the same data set, we adopt three model criteria described next, all of which are readily computed from the MCMC output. This computational convenience partly motivates our choice among many existing and well-accepted model criteria that may be used here (Mills and Prasad, 1992; Claeskens, 2016). To set the notations, denote by the ith data point, and by the data set with removed, for . Let be the likelihood
Simulation studies
We design three simulation experiments to illustrate the implementation of the proposed regression methodology, to demonstrate the performance of the posterior inference, and to evaluate the effectiveness of model selection via DIC, WAIC, and LPML, and the graphical diagnostic method.
Real-life data applications
In this section we apply the proposed Bayesian regression methodology to analyze two data sets from real-life applications. A sample R code for implementing the proposed models using the provided R package betaBayes is available in supplementary Appendix A.
Discussion
We propose a class of four-parameter beta regression models for studying the association between a continuous response bounded on an unknown interval and covariates via inferring either the conditional mean or mode of the response. Almost all existing approaches for analyzing bounded data assume a prefixed interval, which may not be accurate in many applications. To the best of our knowledge, the proposed regression models in this paper are the first regression framework allowing for an
Acknowledgements
The authors wish to thank the Co-Editor, anonymous Associate Editor, and two referees for their insightful comments and suggestions that greatly improved the manuscript.
References (77)
- et al.
A Bayesian goodness-of-fit test for regression
Comput. Stat. Data Anal.
(2021) Universal residuals: a multivariate transformation
Stat. Probab. Lett.
(2007)- et al.
Superconsistent estimation and inference in structural econometric models using extreme order statistics
J. Econom.
(2002) - et al.
Hybrid Monte Carlo
Phys. Lett. B
(1987) - et al.
Influence diagnostics in beta regression
Comput. Stat. Data Anal.
(2008) - et al.
Mixed beta regression: a Bayesian perspective
Comput. Stat. Data Anal.
(2013) - et al.
New methods for analyzing structural models of labor force dynamics
J. Econom.
(1982) Deciding between the common and private value paradigms in empirical models of auctions
J. Econom.
(1992)- et al.
Unimodal density estimation using Bernstein polynomials
Comput. Stat. Data Anal.
(2014) A note on estimation in the four-parameter beta distribution
Commun. Stat., Simul. Comput.
(2005)