Skip to main content
Log in

Structured priors for sparse probability vectors with application to model selection in Markov chains

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

We develop two prior distributions for probability vectors which, in contrast to the popular Dirichlet distribution, retain sparsity properties in the presence of data. Our models are appropriate for count data with many categories, most of which are expected to have negligible probability. Both models are tractable, allowing for efficient posterior sampling and marginalization. Consequently, they can replace the Dirichlet prior in hierarchical models without sacrificing convenient Gibbs sampling schemes. We derive both models and demonstrate their properties. We then illustrate their use for model-based selection with a hierarchical model in which we infer the active lag from time-series data. Using a squared-error loss, we demonstrate the utility of the models for data simulated from a nearly deterministic dynamical system. We also apply the prior models to an ecological time series of Chinook salmon abundance, demonstrating their ability to extract insights into the lag dependence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Agresti, A., Hitchcock, D.B.: Bayesian inference for categorical data analysis. Stat. Methods Appl. 14(3), 297–330 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • Albert, J.H., Gupta, A.K.: Mixtures of Dirichlet distributions and estimation in contingency tables. Ann. Stat. 10, 1261–1268 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  • Atchison, J., Shen, S.M.: Logistic-normal distributions: some properties and uses. Biometrika 67(2), 261–272 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  • Azat, J.: Grandtab.2016.04.11. California Central Valley Chinook Population Database Report. California Department of Fish and Wildlife. http://www.calfish.org/ProgramsData/Species/CDFWAnadromousResourceAssessment.aspx (2016). Accessed 9 Feb 2019

  • Berchtold, A., Raftery, A.E.: The mixture transition distribution model for high-order Markov chains and non-Gaussian time series. Stat. Sci. 17, 328–356 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  • Besag, J., Mondal, D.: Exact goodness-of-fit tests for Markov chains. Biometrics 69(2), 488–496 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  • Bouguila, N., Ziou, D.: Dirichlet-based probability model applied to human skin detection [image skin detection]. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04). IEEE, vol. 5, pp. V–521 (2004)

  • Connor, R.J., Mosimann, J.E.: Concepts of independence for proportions with a generalization of the Dirichlet distribution. J. Am. Stat. Assoc. 64(325), 194–206 (1969)

    Article  MathSciNet  MATH  Google Scholar 

  • Elfadaly, F.G., Garthwaite, P.H.: Eliciting Dirichlet and Gaussian copula prior distributions for multinomial models. Stat. Comput. 27(2), 449–467 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  • George, E.I., McCulloch, R.E.: Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 88(423), 881–889 (1993)

    Article  Google Scholar 

  • George, E.I., McCulloch, R.E.: Approaches for Bayesian variable selection. Stat. Sin. 7, 339–373 (1997)

    MATH  Google Scholar 

  • Good, I.J.: On the application of symmetric Dirichlet distributions and their mixtures to contingency tables. Ann. Stat. 4, 1159–1189 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  • Green, P.J.: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 711–732 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  • Hamilton, N.: ggtern: An Extension to ‘ggplot2’, for the Creation of Ternary Diagrams. R package version 2.2.1. https://CRAN.R-project.org/package=ggtern (2017). Accessed 9 Feb 2019

  • Hare, S.R., Mantua, N.J.: An historical narrative on the Pacific Decadal Oscillation, interdecadal climate variability and ecosystem impacts. Report of a Talk Presented at the 20th NE Pacific Pink and Chum Workshop, Seattle, WA (2001)

  • Hjort, N.L.: Bayesian approaches to non- and semiparametric density estimation. Preprint Series Statistical Research Report. http://urn.nb.no/URN:NBN:no-51774 (1994). Accessed 9 Feb 2019

  • Insua, D., Ruggeri, F., Wiper, M.: Bayesian Analysis of Stochastic Process Models, vol. 978. Wiley, New York (2012)

    Book  MATH  Google Scholar 

  • Lochner, R.H.: A generalized Dirichlet distribution in Bayesian life testing. J. R. Stat. Soc. Ser. B Methodol. 37, 103–113 (1975)

    MathSciNet  MATH  Google Scholar 

  • Park, T., Casella, G.: The Bayesian lasso. J. Am. Stat. Assoc. 103(482), 681–686 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Prado, R., West, M.: Time Series: Modeling, Computation, and Inference. CRC Press, Boca Raton (2010)

    Book  MATH  Google Scholar 

  • Quinn, T.J., Deriso, R.B.: Quantitative Fish Dynamics. Oxford University Press, Oxford (1999)

    Google Scholar 

  • Raftery, A.E.: A model for high-order Markov chains. J. R. Stat. Soc. Ser. B Methodol. 47, 528–539 (1985)

    MathSciNet  MATH  Google Scholar 

  • Raftery, A., Tavaré, S.: Estimation and modelling repeated patterns in high order Markov chains with the mixture transition distribution model. Appl. Stat. 43, 179–199 (1994)

    Article  MATH  Google Scholar 

  • Satterthwaite, W.H., Carlson, S.M., Criss, A.: Ocean size and corresponding life history diversity among the four run timings of California Central Valley Chinook salmon. Trans. Am. Fish. Soc. 146(4), 594–610 (2017)

    Article  Google Scholar 

  • Sethuraman, J.: A constructive definition of Dirichlet priors. Stat. Sin. 4, 639–650 (1994)

    MathSciNet  MATH  Google Scholar 

  • Tank, A., Fox, E.B., Shojaie, A.: Granger causality networks for categorical time series (2017). arXiv preprint arXiv:1706.02781

  • Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  • Wong, T.T.: Generalized Dirichlet distribution in Bayesian analysis. Appl. Math. Comput. 97(2–3), 165–181 (1998)

    MathSciNet  MATH  Google Scholar 

  • Zucchini, W., MacDonald, I.L.: Hidden Markov Models for Time Series: An Introduction Using R, vol. 22. CRC press, Boca Raton (2009)

    Book  MATH  Google Scholar 

Download references

Acknowledgements

The work of the first and the second author was supported in part by the National Science Foundation under award DMS 1310438. The authors gratefully acknowledge helpful comments from an editor and two anonymous referees.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matthew Heiner.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work of the first and the second author was supported in part by the National Science Foundation under award DMS 1310438.

Appendices

A SBM correction for \(\delta _k\)

Here we derive correction factor (7) proposed to account for sparsity when using the SBM prior as an extension of a Dirichlet prior. We define a probability \(\theta _k\) to be negligible if the corresponding \(Z_k\) is drawn from the first \(\text {Beta}(1,\eta )\) component, or if \(Z_j\) was drawn from the third \(\text {Beta}(\eta , 1)\) distribution for some \(j < k\). Let W be the total number of nonnegligible probabilities in \(\varvec{\theta }\). We can write \(W = W_1 - W_2\) where \(W_1\) is the minimum k such that \(\xi _k=3\) (\(Z_k\) comes from the large component) or K, whichever is smaller, and \(W_2\) counts the number of times \(\xi _k=1\) (\(Z_k\) comes from the small component) among \(k = 1, \ldots , W_1 - 1\). \(W_1\) comes from a truncated geometric distribution with success probability \(\pi _3\) and has expectation \(\text {E}(W_1) = (1 - (1 - \pi _3)^K)/\pi _3\). Conditional on \(W_1\), \(W_2\) is binomial with \(W_1 - 1\) trials and success probability \(\pi _1\) and thus has conditional expectation \(\text {E}(W_2 \mid W_1) = (W_1 - 1)\pi _1\). Putting these together, we find

$$\begin{aligned} \text {E}(W)&= \text {E}(W_1) - \text {E}_{W_1}[\text {E}(W_2 \mid W_1)] \\&= \text {E}(W_1) - \text {E}_{W_1}[(W_1 - 1)\pi _1] \\&= \text {E}(W_1) - \pi _1 \text {E}(W_1) + \pi _1. \end{aligned}$$

Substituting \(\text {E}(W_1)\) and dividing by K yields (7). In the \(\pi _3 = 0\) case, we have \(W_1 = K\) with probability 1, so that \(\text {E}(W_1) = K\).

B Marginal distributions

We report the marginal distributions of observations associated with the Dirichlet, SDM and SBM models for probability vectors. These distributions can be useful for computing Bayes factors in addition to facilitating the MCMC algorithm described in “Appendix C.2.”

Consider a sequence of independent random variables \(\{s_t\} \in \{1, \ldots , K\}^T\) with common distribution \(\varvec{\theta } \equiv (\theta _1, \ldots , \theta _K)\). Given \(\varvec{\theta }\), the probability of the sequence is \(\prod _{t} \theta _{s_t} = \theta _1^{n_1} \cdots \theta _K^{n_K}\) where the sufficient statistics in \(\varvec{n} = (n_1, \ldots , n_K)\) count the occurrences of each category. If the ordering t is not important, the probability is multiplied by the multinomial coefficient \(T! / (n_1! \cdots n_K!)\).

If \(\varvec{\theta }\) follows a Dirichlet distribution with shape parameter vector \(\varvec{\alpha }\), then the marginal (prior predictive) distribution of \(\{s_t\}\) is given by

$$\begin{aligned} p( \{s_t\} )&= \int p( \{s_t\} \mid \varvec{\theta })\, p(\varvec{\theta }) \, \mathrm{d}\varvec{\theta } \nonumber \\&= \frac{ \varGamma ( \sum _k \alpha _k ) }{ \prod _k \varGamma (\alpha _k) } \int \theta _1^{\alpha _1 + n_1} \cdots \theta _K^{\alpha _K + n_K} \, \mathrm{d}\varvec{\theta } \nonumber \\&= \frac{ \varGamma ( \sum _k \alpha _k ) }{ \prod _k \varGamma (\alpha _k) } \frac{ \prod _k \varGamma (\alpha _k + n_k) }{ \varGamma ( \sum _k \alpha _k + n_k ) } = \frac{ \text {MVB}( \varvec{\alpha } + \varvec{n}) }{ \text {MVB}( \varvec{\alpha }) }, \end{aligned}$$
(12)

where \(\text {MVB}(\cdot )\) denotes the multivariate beta function and the integral is defined on the simplex supporting \(\varvec{\theta }\).

If \(\varvec{\theta }\) follows the SDM distribution with parameters \(\varvec{\alpha }\) and \(\beta \), and \(w_k = \prod _{j=1}^K \varGamma (\alpha _j + \beta 1_{(j=k)})\), then the marginal distribution of \(\{s_t\}\) is given by

$$\begin{aligned} p( \{s_t\} )&= \int \theta _1^{n_1} \cdots \theta _K^{n_K} \sum _{k=1}^K \frac{w_k}{\sum _{j=1}^K w_j} \text {Dir}(\varvec{\theta }; \varvec{\alpha } + \beta \varvec{e}_k) \, \mathrm{d}\varvec{\theta } \nonumber \\&= \sum _{k=1}^K \frac{w_k}{\sum _{j=1}^K w_j} \frac{ \text {MVB}( \varvec{\alpha } + \beta \varvec{e}_k + \varvec{n}) }{ \text {MVB}( \varvec{\alpha } + \beta \varvec{e}_k ) }, \end{aligned}$$
(13)

where \(\varvec{e}_k\) denotes a vector of 0s with a 1 in the kth entry.

Before considering the SBM model, we first obtain the marginal distribution under the generalized Dirichlet (Connor and Mosimann 1969). Working directly with the stick-breaking weights, we have \(p(\varvec{Z}) = \prod _{k=1}^{K-1} \text {Beta}( Z_k ; a_k, b_k )\) and \(p( \{s_t\} \mid \varvec{Z}) = Z_1^{n_1} (1-Z_1)^{\sum _{k=2}^K n_k} \cdots Z_{K-1}^{n_{K-1}} (1-Z_{K-1})^{n_K} \). Putting these together and integrating over \(\varvec{Z}\) results in \(p(\{s_t\}) = \prod _{k=1}^{K-1} g_k(a_k, b_k, \varvec{n}) \) where

$$\begin{aligned} g_k(a_k, b_k, \varvec{n}) = \frac{\varGamma (a_k + b_k)}{\varGamma (a_k^* + b_k^*)} \frac{\varGamma (a_k^*)}{\varGamma (a_k)} \frac{\varGamma (b_k^*)}{\varGamma (b_k)}, \end{aligned}$$

with \(a_k^* = a_k + n_k\) and \(b_k^* = b_k + \sum _{j=k+1}^K n_j \, \). Using a similar approach, it can be shown that under the SBM model with parameters \(\varvec{\pi }\), \(\eta \), \(\varvec{\gamma }\) and \(\varvec{\delta }\), we have

$$\begin{aligned}&p(\{s_t\}) = \prod _{k=1}^{K-1} [\pi _1 g_k(1, \eta , \varvec{n})+\pi _2 g_k(\gamma _k, \delta _k, \varvec{n}) \nonumber \\&\quad + \pi _3 g_k(\eta , 1, \varvec{n})]. \end{aligned}$$
(14)

C MCMC algorithm details

Following the hierarchical MTD model outlined in (10), the joint posterior distribution of all unknown parameters is given up to proportionality:

$$\begin{aligned} p(\{\zeta _t\}, \varvec{\lambda }, \varvec{Q} \mid \{s_t\})&\propto p(\varvec{\lambda }) \, \prod _k[ p(\varvec{Q}_{k,\cdot }) ] \nonumber \\&\times \prod _t [ p(\zeta _t \mid \varvec{\lambda }) \, p(s_t \mid \zeta _t, \varvec{Q}) ]. \end{aligned}$$
(15)

1.1 C.1 Original algorithm

MCMC sampling for the original hierarchical MTD structure (Insua et al. 2012) can be achieved entirely with Gibbs updates. This is also the case when substituting in the SDM and SBM priors. A Gibbs sampler cycles through the parameters, drawing updates from the conditional distributions given below.

  • \(\Pr ( \zeta _t = \ell \mid \cdots ) \propto \Pr ( \zeta _t = \ell \mid \varvec{\lambda } ) \, p( s_t \mid \zeta _t = \ell , \varvec{Q} ) = \lambda _\ell \, (\varvec{Q})_{s_{t-\ell }, s_t}\), independently for each \(t \in \{R+1, \ldots , T\}\).

  • \(p(\varvec{\lambda } \mid \cdots ) \propto p(\varvec{\lambda })\, \prod _t p(\zeta _t \mid \varvec{\lambda }) = \text {Dir}( \varvec{\lambda }; \varvec{\alpha }_\lambda ) \, \prod _t \lambda _{\zeta _t} \), a standard Dirichlet/multinomial update using the counts of \(\zeta _t\) in each of \(\{1, \ldots , R\}\). In the case of a SDM prior for \(\varvec{\lambda }\), this becomes the standard SDM update given in Sect. 2.1. In the case of a SBM prior for \(\varvec{\lambda }\), this becomes the standard SBM update given in Sect. 2.2 which draws updated latent stick-breaking weights and constructs \(\varvec{\lambda }\).

  • \(p( \varvec{Q}_{k,\cdot } \mid \cdots ) \propto p(\varvec{Q}_{k,\cdot }) \prod _{t : s_{t-\zeta _t} = k} p(s_t \mid \zeta _t, \varvec{Q}_{k,\cdot }) = \text {Dir}( \varvec{Q}_{k, \cdot }; \varvec{\alpha }_Q ) \, \prod _{t : s_{t-\zeta _t} = k} (\varvec{Q})_{k, s_t} \), which is again a standard Dirichlet/multinomial update using transition counts. In the case of a SBM prior for the kth row of \(\varvec{Q}\), this becomes the standard SBM update given in Sect. 2.2 which draws updated latent stick-breaking weights and constructs the row of \(\varvec{Q}\).

This Gibbs sampler can be modified to include an update for \(\varvec{\pi }_k\) together with the update for the kth row of \(\varvec{Q}\) since the only component of the posterior in (15) dependent on \(\varvec{\pi }_k\) is \(p(\varvec{Q}_{k,\cdot })\). A Dirichlet prior for \(\varvec{\pi }_k\) results in a Dirichlet full conditional, using the counts of latent \(\xi \) variables drawn using (6). Updates for any other hyperparameters would require an alternate sampling scheme.

1.2 C.2 Modified algorithm

Sampling \(\{\zeta _t \}\) conditional on \(\varvec{Q}\), followed by \(\varvec{Q}\) conditional on \(\{\zeta _t\}\) results in a chain that tends to get stuck. To improve mixing, we instead integrate \(\varvec{Q}\) out of joint posterior (15) and conduct Gibbs sampling between \(\{\zeta _t \}\) and \(\varvec{\lambda }\). At each iteration, it is then straightforward to draw \(\varvec{Q}\) from the conditional distribution given in 1.

Let \(\varvec{N}\) be a \(K\times K\) matrix of transition counts for which the \((k_1, k_2)\) entry is the cardinality of \(\{ t : s_{t - \zeta _t} = k_1 \ \text {and} \ s_t = k_2 \}\). Integrating \(\varvec{Q}\) from (15) yields \( p(\{\zeta _t\}, \varvec{\lambda } \mid \{s_t\}) \propto p(\varvec{\lambda }) \, \prod _t [ p(\zeta _t \mid \varvec{\lambda }) \, p(s_t \mid \zeta _t) ] \) which differs from the original only in that

$$\begin{aligned} \prod _{t=R+1}^T [ p(s_t \mid \zeta _t) ] = \prod _{k=1}^K h( \varvec{N}_{k,\cdot }, \phi _k ), \end{aligned}$$
(16)

where the \(h(\cdot , \phi )\) takes the form of (12) if the rows of \(\varvec{Q}\) are independent Dirichlet, (13) if the rows of \(\varvec{Q}\) are independent SDM, and (14) if the rows of \(\varvec{Q}\) are independent SBM; and \(\phi \) refers to generic hyperparameters appropriate for the choice of prior.

The modified algorithm then proceeds with the standard update for \(\varvec{\lambda }\) given in C.1. Each \(\zeta _t \) is then updated individually using its full conditional involving \(\varvec{\lambda }\) and (16) by modifying \(\varvec{N}\) to reflect each possible value of \(\zeta _t \in \{1, \ldots , R\}\). Unnecessary computation can be avoided by noting redundancies in the denominators of (12) and (13), and computing (16) only over values of k such that \(s_{t-\ell } = k\) for \(\ell = 1, \ldots , R\).

Allowing inference for \(\{\varvec{\pi }_k\}\) when utilizing SBM priors for \(\varvec{Q}\) is no more complicated than in the original algorithm. Simply draw \(\varvec{\pi }_k\) when sampling from the conditional for the kth row of \(\varvec{Q}\). Note however that integrating over \(\varvec{Q}\) leaves the update for \(\varvec{\lambda }\) conditional on \(\{\varvec{\pi }_k\}\). Nevertheless, the result is still a valid Gibbs sampler which in practice produces acceptable mixing.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Heiner, M., Kottas, A. & Munch, S. Structured priors for sparse probability vectors with application to model selection in Markov chains. Stat Comput 29, 1077–1093 (2019). https://doi.org/10.1007/s11222-019-09856-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-019-09856-2

Keywords

Navigation