Structured priors for sparse probability vectors with application to model selection in Markov chains

Heiner, Matthew; Kottas, Athanasios; Munch, Stephan

doi:10.1007/s11222-019-09856-2

Structured priors for sparse probability vectors with application to model selection in Markov chains

Published: 12 February 2019

Volume 29, pages 1077–1093, (2019)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

499 Accesses
5 Citations
Explore all metrics

Abstract

We develop two prior distributions for probability vectors which, in contrast to the popular Dirichlet distribution, retain sparsity properties in the presence of data. Our models are appropriate for count data with many categories, most of which are expected to have negligible probability. Both models are tractable, allowing for efficient posterior sampling and marginalization. Consequently, they can replace the Dirichlet prior in hierarchical models without sacrificing convenient Gibbs sampling schemes. We derive both models and demonstrate their properties. We then illustrate their use for model-based selection with a hierarchical model in which we infer the active lag from time-series data. Using a squared-error loss, we demonstrate the utility of the models for data simulated from a nearly deterministic dynamical system. We also apply the prior models to an ecological time series of Chinook salmon abundance, demonstrating their ability to extract insights into the lag dependence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 8

A simple algorithm for computing the probabilities of count models based on pure birth processes

Article 10 April 2024

A simple introduction to Markov Chain Monte–Carlo sampling

Article Open access 11 March 2016

A Systematic Review of Hidden Markov Models and Their Applications

Article 12 May 2020

References

Agresti, A., Hitchcock, D.B.: Bayesian inference for categorical data analysis. Stat. Methods Appl. 14(3), 297–330 (2005)
Article MathSciNet MATH Google Scholar
Albert, J.H., Gupta, A.K.: Mixtures of Dirichlet distributions and estimation in contingency tables. Ann. Stat. 10, 1261–1268 (1982)
Article MathSciNet MATH Google Scholar
Atchison, J., Shen, S.M.: Logistic-normal distributions: some properties and uses. Biometrika 67(2), 261–272 (1980)
Article MathSciNet MATH Google Scholar
Azat, J.: Grandtab.2016.04.11. California Central Valley Chinook Population Database Report. California Department of Fish and Wildlife. http://www.calfish.org/ProgramsData/Species/CDFWAnadromousResourceAssessment.aspx (2016). Accessed 9 Feb 2019
Berchtold, A., Raftery, A.E.: The mixture transition distribution model for high-order Markov chains and non-Gaussian time series. Stat. Sci. 17, 328–356 (2002)
Article MathSciNet MATH Google Scholar
Besag, J., Mondal, D.: Exact goodness-of-fit tests for Markov chains. Biometrics 69(2), 488–496 (2013)
Article MathSciNet MATH Google Scholar
Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017)
Article MathSciNet MATH Google Scholar
Bouguila, N., Ziou, D.: Dirichlet-based probability model applied to human skin detection [image skin detection]. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04). IEEE, vol. 5, pp. V–521 (2004)
Connor, R.J., Mosimann, J.E.: Concepts of independence for proportions with a generalization of the Dirichlet distribution. J. Am. Stat. Assoc. 64(325), 194–206 (1969)
Article MathSciNet MATH Google Scholar
Elfadaly, F.G., Garthwaite, P.H.: Eliciting Dirichlet and Gaussian copula prior distributions for multinomial models. Stat. Comput. 27(2), 449–467 (2017)
Article MathSciNet MATH Google Scholar
George, E.I., McCulloch, R.E.: Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 88(423), 881–889 (1993)
Article Google Scholar
George, E.I., McCulloch, R.E.: Approaches for Bayesian variable selection. Stat. Sin. 7, 339–373 (1997)
MATH Google Scholar
Good, I.J.: On the application of symmetric Dirichlet distributions and their mixtures to contingency tables. Ann. Stat. 4, 1159–1189 (1976)
Article MathSciNet MATH Google Scholar
Green, P.J.: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 711–732 (1995)
Article MathSciNet MATH Google Scholar
Hamilton, N.: ggtern: An Extension to ‘ggplot2’, for the Creation of Ternary Diagrams. R package version 2.2.1. https://CRAN.R-project.org/package=ggtern (2017). Accessed 9 Feb 2019
Hare, S.R., Mantua, N.J.: An historical narrative on the Pacific Decadal Oscillation, interdecadal climate variability and ecosystem impacts. Report of a Talk Presented at the 20th NE Pacific Pink and Chum Workshop, Seattle, WA (2001)
Hjort, N.L.: Bayesian approaches to non- and semiparametric density estimation. Preprint Series Statistical Research Report. http://urn.nb.no/URN:NBN:no-51774 (1994). Accessed 9 Feb 2019
Insua, D., Ruggeri, F., Wiper, M.: Bayesian Analysis of Stochastic Process Models, vol. 978. Wiley, New York (2012)
Book MATH Google Scholar
Lochner, R.H.: A generalized Dirichlet distribution in Bayesian life testing. J. R. Stat. Soc. Ser. B Methodol. 37, 103–113 (1975)
MathSciNet MATH Google Scholar
Park, T., Casella, G.: The Bayesian lasso. J. Am. Stat. Assoc. 103(482), 681–686 (2008)
Article MathSciNet MATH Google Scholar
Prado, R., West, M.: Time Series: Modeling, Computation, and Inference. CRC Press, Boca Raton (2010)
Book MATH Google Scholar
Quinn, T.J., Deriso, R.B.: Quantitative Fish Dynamics. Oxford University Press, Oxford (1999)
Google Scholar
Raftery, A.E.: A model for high-order Markov chains. J. R. Stat. Soc. Ser. B Methodol. 47, 528–539 (1985)
MathSciNet MATH Google Scholar
Raftery, A., Tavaré, S.: Estimation and modelling repeated patterns in high order Markov chains with the mixture transition distribution model. Appl. Stat. 43, 179–199 (1994)
Article MATH Google Scholar
Satterthwaite, W.H., Carlson, S.M., Criss, A.: Ocean size and corresponding life history diversity among the four run timings of California Central Valley Chinook salmon. Trans. Am. Fish. Soc. 146(4), 594–610 (2017)
Article Google Scholar
Sethuraman, J.: A constructive definition of Dirichlet priors. Stat. Sin. 4, 639–650 (1994)
MathSciNet MATH Google Scholar
Tank, A., Fox, E.B., Shojaie, A.: Granger causality networks for categorical time series (2017). arXiv preprint arXiv:1706.02781
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 58, 267–288 (1996)
MathSciNet MATH Google Scholar
Wong, T.T.: Generalized Dirichlet distribution in Bayesian analysis. Appl. Math. Comput. 97(2–3), 165–181 (1998)
MathSciNet MATH Google Scholar
Zucchini, W., MacDonald, I.L.: Hidden Markov Models for Time Series: An Introduction Using R, vol. 22. CRC press, Boca Raton (2009)
Book MATH Google Scholar

Download references

Acknowledgements

The work of the first and the second author was supported in part by the National Science Foundation under award DMS 1310438. The authors gratefully acknowledge helpful comments from an editor and two anonymous referees.

Author information

Authors and Affiliations

Department of Statistics, University of California, Santa Cruz, California, USA
Matthew Heiner & Athanasios Kottas
Fisheries Ecology Division, Southwest Fisheries Science Center, National Marine Fisheries Service, NOAA, Santa Cruz, California, USA
Stephan Munch

Authors

Matthew Heiner
View author publications
You can also search for this author in PubMed Google Scholar
Athanasios Kottas
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Munch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matthew Heiner.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work of the first and the second author was supported in part by the National Science Foundation under award DMS 1310438.

Appendices

A SBM correction for $\delta _k$

Here we derive correction factor (7) proposed to account for sparsity when using the SBM prior as an extension of a Dirichlet prior. We define a probability $\theta _k$ to be negligible if the corresponding $Z_k$ is drawn from the first $\text {Beta}(1,\eta )$ component, or if $Z_j$ was drawn from the third $\text {Beta}(\eta , 1)$ distribution for some $j < k$. Let W be the total number of nonnegligible probabilities in $\varvec{\theta }$. We can write $W = W_1 - W_2$ where $W_1$ is the minimum k such that $\xi _k=3$ ($Z_k$ comes from the large component) or K, whichever is smaller, and $W_2$ counts the number of times $\xi _k=1$ ($Z_k$ comes from the small component) among $k = 1, \ldots , W_1 - 1$. $W_1$ comes from a truncated geometric distribution with success probability $\pi _3$ and has expectation $\text {E}(W_1) = (1 - (1 - \pi _3)^K)/\pi _3$. Conditional on $W_1$, $W_2$ is binomial with $W_1 - 1$ trials and success probability $\pi _1$ and thus has conditional expectation $\text {E}(W_2 \mid W_1) = (W_1 - 1)\pi _1$. Putting these together, we find

$$\begin{aligned} \text {E}(W)&= \text {E}(W_1) - \text {E}_{W_1}[\text {E}(W_2 \mid W_1)] \\&= \text {E}(W_1) - \text {E}_{W_1}[(W_1 - 1)\pi _1] \\&= \text {E}(W_1) - \pi _1 \text {E}(W_1) + \pi _1. \end{aligned}$$

Substituting $\text {E}(W_1)$ and dividing by K yields (7). In the $\pi _3 = 0$ case, we have $W_1 = K$ with probability 1, so that $\text {E}(W_1) = K$.

B Marginal distributions

We report the marginal distributions of observations associated with the Dirichlet, SDM and SBM models for probability vectors. These distributions can be useful for computing Bayes factors in addition to facilitating the MCMC algorithm described in “Appendix C.2.”

Consider a sequence of independent random variables $\{s_t\} \in \{1, \ldots , K\}^T$ with common distribution $\varvec{\theta } \equiv (\theta _1, \ldots , \theta _K)$. Given $\varvec{\theta }$, the probability of the sequence is $\prod _{t} \theta _{s_t} = \theta _1^{n_1} \cdots \theta _K^{n_K}$ where the sufficient statistics in $\varvec{n} = (n_1, \ldots , n_K)$ count the occurrences of each category. If the ordering t is not important, the probability is multiplied by the multinomial coefficient $T! / (n_1! \cdots n_K!)$.

If $\varvec{\theta }$ follows a Dirichlet distribution with shape parameter vector $\varvec{\alpha }$, then the marginal (prior predictive) distribution of $\{s_t\}$ is given by

$$\begin{aligned} p( \{s_t\} )&= \int p( \{s_t\} \mid \varvec{\theta })\, p(\varvec{\theta }) \, \mathrm{d}\varvec{\theta } \nonumber \\&= \frac{ \varGamma ( \sum _k \alpha _k ) }{ \prod _k \varGamma (\alpha _k) } \int \theta _1^{\alpha _1 + n_1} \cdots \theta _K^{\alpha _K + n_K} \, \mathrm{d}\varvec{\theta } \nonumber \\&= \frac{ \varGamma ( \sum _k \alpha _k ) }{ \prod _k \varGamma (\alpha _k) } \frac{ \prod _k \varGamma (\alpha _k + n_k) }{ \varGamma ( \sum _k \alpha _k + n_k ) } = \frac{ \text {MVB}( \varvec{\alpha } + \varvec{n}) }{ \text {MVB}( \varvec{\alpha }) }, \end{aligned}$$

(12)

where $\text {MVB}(\cdot )$ denotes the multivariate beta function and the integral is defined on the simplex supporting $\varvec{\theta }$.

If $\varvec{\theta }$ follows the SDM distribution with parameters $\varvec{\alpha }$ and $\beta $, and $w_k = \prod _{j=1}^K \varGamma (\alpha _j + \beta 1_{(j=k)})$, then the marginal distribution of $\{s_t\}$ is given by

$$\begin{aligned} p( \{s_t\} )&= \int \theta _1^{n_1} \cdots \theta _K^{n_K} \sum _{k=1}^K \frac{w_k}{\sum _{j=1}^K w_j} \text {Dir}(\varvec{\theta }; \varvec{\alpha } + \beta \varvec{e}_k) \, \mathrm{d}\varvec{\theta } \nonumber \\&= \sum _{k=1}^K \frac{w_k}{\sum _{j=1}^K w_j} \frac{ \text {MVB}( \varvec{\alpha } + \beta \varvec{e}_k + \varvec{n}) }{ \text {MVB}( \varvec{\alpha } + \beta \varvec{e}_k ) }, \end{aligned}$$

(13)

where $\varvec{e}_k$ denotes a vector of 0s with a 1 in the kth entry.

Before considering the SBM model, we first obtain the marginal distribution under the generalized Dirichlet (Connor and Mosimann 1969). Working directly with the stick-breaking weights, we have $p(\varvec{Z}) = \prod _{k=1}^{K-1} \text {Beta}( Z_k ; a_k, b_k )$ and $p( \{s_t\} \mid \varvec{Z}) = Z_1^{n_1} (1-Z_1)^{\sum _{k=2}^K n_k} \cdots Z_{K-1}^{n_{K-1}} (1-Z_{K-1})^{n_K} $. Putting these together and integrating over $\varvec{Z}$ results in $p(\{s_t\}) = \prod _{k=1}^{K-1} g_k(a_k, b_k, \varvec{n}) $ where

$$\begin{aligned} g_k(a_k, b_k, \varvec{n}) = \frac{\varGamma (a_k + b_k)}{\varGamma (a_k^* + b_k^*)} \frac{\varGamma (a_k^*)}{\varGamma (a_k)} \frac{\varGamma (b_k^*)}{\varGamma (b_k)}, \end{aligned}$$

with $a_k^* = a_k + n_k$ and $b_k^* = b_k + \sum _{j=k+1}^K n_j \, $. Using a similar approach, it can be shown that under the SBM model with parameters $\varvec{\pi }$, $\eta $, $\varvec{\gamma }$ and $\varvec{\delta }$, we have

$$\begin{aligned}&p(\{s_t\}) = \prod _{k=1}^{K-1} [\pi _1 g_k(1, \eta , \varvec{n})+\pi _2 g_k(\gamma _k, \delta _k, \varvec{n}) \nonumber \\&\quad + \pi _3 g_k(\eta , 1, \varvec{n})]. \end{aligned}$$

(14)

C MCMC algorithm details

Following the hierarchical MTD model outlined in (10), the joint posterior distribution of all unknown parameters is given up to proportionality:

$$\begin{aligned} p(\{\zeta _t\}, \varvec{\lambda }, \varvec{Q} \mid \{s_t\})&\propto p(\varvec{\lambda }) \, \prod _k[ p(\varvec{Q}_{k,\cdot }) ] \nonumber \\&\times \prod _t [ p(\zeta _t \mid \varvec{\lambda }) \, p(s_t \mid \zeta _t, \varvec{Q}) ]. \end{aligned}$$

(15)

1.1 C.1 Original algorithm

MCMC sampling for the original hierarchical MTD structure (Insua et al. 2012) can be achieved entirely with Gibbs updates. This is also the case when substituting in the SDM and SBM priors. A Gibbs sampler cycles through the parameters, drawing updates from the conditional distributions given below.

$\Pr ( \zeta _t = \ell \mid \cdots ) \propto \Pr ( \zeta _t = \ell \mid \varvec{\lambda } ) \, p( s_t \mid \zeta _t = \ell , \varvec{Q} ) = \lambda _\ell \, (\varvec{Q})_{s_{t-\ell }, s_t}$, independently for each $t \in \{R+1, \ldots , T\}$.
$p(\varvec{\lambda } \mid \cdots ) \propto p(\varvec{\lambda })\, \prod _t p(\zeta _t \mid \varvec{\lambda }) = \text {Dir}( \varvec{\lambda }; \varvec{\alpha }_\lambda ) \, \prod _t \lambda _{\zeta _t} $, a standard Dirichlet/multinomial update using the counts of $\zeta _t$ in each of $\{1, \ldots , R\}$. In the case of a SDM prior for $\varvec{\lambda }$, this becomes the standard SDM update given in Sect. 2.1. In the case of a SBM prior for $\varvec{\lambda }$, this becomes the standard SBM update given in Sect. 2.2 which draws updated latent stick-breaking weights and constructs $\varvec{\lambda }$.
$p( \varvec{Q}_{k,\cdot } \mid \cdots ) \propto p(\varvec{Q}_{k,\cdot }) \prod _{t : s_{t-\zeta _t} = k} p(s_t \mid \zeta _t, \varvec{Q}_{k,\cdot }) = \text {Dir}( \varvec{Q}_{k, \cdot }; \varvec{\alpha }_Q ) \, \prod _{t : s_{t-\zeta _t} = k} (\varvec{Q})_{k, s_t} $, which is again a standard Dirichlet/multinomial update using transition counts. In the case of a SBM prior for the kth row of $\varvec{Q}$, this becomes the standard SBM update given in Sect. 2.2 which draws updated latent stick-breaking weights and constructs the row of $\varvec{Q}$.

This Gibbs sampler can be modified to include an update for $\varvec{\pi }_k$ together with the update for the kth row of $\varvec{Q}$ since the only component of the posterior in (15) dependent on $\varvec{\pi }_k$ is $p(\varvec{Q}_{k,\cdot })$. A Dirichlet prior for $\varvec{\pi }_k$ results in a Dirichlet full conditional, using the counts of latent $\xi $ variables drawn using (6). Updates for any other hyperparameters would require an alternate sampling scheme.

1.2 C.2 Modified algorithm

Sampling $\{\zeta _t \}$ conditional on $\varvec{Q}$, followed by $\varvec{Q}$ conditional on $\{\zeta _t\}$ results in a chain that tends to get stuck. To improve mixing, we instead integrate $\varvec{Q}$ out of joint posterior (15) and conduct Gibbs sampling between $\{\zeta _t \}$ and $\varvec{\lambda }$. At each iteration, it is then straightforward to draw $\varvec{Q}$ from the conditional distribution given in 1.

Let $\varvec{N}$ be a $K\times K$ matrix of transition counts for which the $(k_1, k_2)$ entry is the cardinality of $\{ t : s_{t - \zeta _t} = k_1 \ \text {and} \ s_t = k_2 \}$. Integrating $\varvec{Q}$ from (15) yields $ p(\{\zeta _t\}, \varvec{\lambda } \mid \{s_t\}) \propto p(\varvec{\lambda }) \, \prod _t [ p(\zeta _t \mid \varvec{\lambda }) \, p(s_t \mid \zeta _t) ] $ which differs from the original only in that

$$\begin{aligned} \prod _{t=R+1}^T [ p(s_t \mid \zeta _t) ] = \prod _{k=1}^K h( \varvec{N}_{k,\cdot }, \phi _k ), \end{aligned}$$

(16)

where the $h(\cdot , \phi )$ takes the form of (12) if the rows of $\varvec{Q}$ are independent Dirichlet, (13) if the rows of $\varvec{Q}$ are independent SDM, and (14) if the rows of $\varvec{Q}$ are independent SBM; and $\phi $ refers to generic hyperparameters appropriate for the choice of prior.

The modified algorithm then proceeds with the standard update for $\varvec{\lambda }$ given in C.1. Each $\zeta _t $ is then updated individually using its full conditional involving $\varvec{\lambda }$ and (16) by modifying $\varvec{N}$ to reflect each possible value of $\zeta _t \in \{1, \ldots , R\}$. Unnecessary computation can be avoided by noting redundancies in the denominators of (12) and (13), and computing (16) only over values of k such that $s_{t-\ell } = k$ for $\ell = 1, \ldots , R$.

Allowing inference for $\{\varvec{\pi }_k\}$ when utilizing SBM priors for $\varvec{Q}$ is no more complicated than in the original algorithm. Simply draw $\varvec{\pi }_k$ when sampling from the conditional for the kth row of $\varvec{Q}$. Note however that integrating over $\varvec{Q}$ leaves the update for $\varvec{\lambda }$ conditional on $\{\varvec{\pi }_k\}$. Nevertheless, the result is still a valid Gibbs sampler which in practice produces acceptable mixing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Heiner, M., Kottas, A. & Munch, S. Structured priors for sparse probability vectors with application to model selection in Markov chains. Stat Comput 29, 1077–1093 (2019). https://doi.org/10.1007/s11222-019-09856-2

Download citation

Received: 29 March 2018
Accepted: 02 February 2019
Published: 12 February 2019
Issue Date: 11 September 2019
DOI: https://doi.org/10.1007/s11222-019-09856-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Structured priors for sparse probability vectors with application to model selection in Markov chains

Abstract

Access this article

Similar content being viewed by others

A simple algorithm for computing the probabilities of count models based on pure birth processes

A simple introduction to Markov Chain Monte–Carlo sampling

A Systematic Review of Hidden Markov Models and Their Applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A SBM correction for \(\delta _k\)

B Marginal distributions

C MCMC algorithm details

1.1 C.1 Original algorithm

1.2 C.2 Modified algorithm

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Structured priors for sparse probability vectors with application to model selection in Markov chains

Abstract

Access this article

Similar content being viewed by others

A simple algorithm for computing the probabilities of count models based on pure birth processes

A simple introduction to Markov Chain Monte–Carlo sampling

A Systematic Review of Hidden Markov Models and Their Applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A SBM correction for \(\delta _k\)

B Marginal distributions

C MCMC algorithm details

1.1 C.1 Original algorithm

1.2 C.2 Modified algorithm

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation