Minimizing variable selection criteria by Markov chain Monte Carlo

Chin, Yen-Shiu; Chen, Ting-Li

doi:10.1007/s00180-016-0649-3

Minimizing variable selection criteria by Markov chain Monte Carlo

Original Paper
Published: 12 March 2016

Volume 31, pages 1263–1286, (2016)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Yen-Shiu Chin¹ &
Ting-Li Chen¹

497 Accesses
1 Citation
Explore all metrics

Abstract

Regression models with a large number of predictors arise in diverse fields of social sciences and natural sciences. For proper interpretation, we often would like to identify a smaller subset of the variables that shows the strongest information. In such a large size of candidate predictors setting, one would encounter a computationally cumbersome search in practice by optimizing some criteria for selecting variables, such as AIC, $C_{P}$ and BIC, through all possible subsets. In this paper, we present two efficient optimization algorithms vis Markov chain Monte Carlo (MCMC) approach for searching the global optimal subset. Simulated examples as well as one real data set exhibit that our proposed MCMC algorithms did find better solutions than other popular search methods in terms of minimizing a given criterion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

A Systematic Review of Hidden Markov Models and Their Applications

Article 12 May 2020

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

References

Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723
Article MathSciNet MATH Google Scholar
Bakin S (1999) Adaptive regression and model selection in data mining problems. Ph.D. thesis, Australian National University, Canberra
Breiman L (1995) Better subset regression using the nonnegative garrote. Technometrics 37(4):373–384
Article MathSciNet MATH Google Scholar
Candes E, Tao T (2007) The dantzig selector: statistical estimation when $p$ is much larger than $n$. Ann Stat 35:2313–2351
Article MathSciNet MATH Google Scholar
Chiang A, Beck J, Yen HJ, Tayeh M, Scheetz T, Swiderski R, Nishimura D, Braun T, Kim KY, Huang J, Elbedour K, Carmi R, Slusarski D, Casavant T, Stone E, Sheffield V (2006) Homozygosity mapping with snp arrays identifies trim32, an e3 ubiquitin ligase, as a bardet-biedl syndrome gene (bbs11). Proc Natl Acad Sci 103(16):6287–6292
Article Google Scholar
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
Article MathSciNet MATH Google Scholar
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Article MathSciNet MATH Google Scholar
George EI, McCulloch RE (1993) Variable selection via gibbs sampling. J Am Stat Assoc 88(423):881–889
Article Google Scholar
George EI, McCulloch RE (1997) Approaches for bayesian variable selection. Stat Sin 7(2):339–373
MATH Google Scholar
Huang J, Ma S, Zhang CH (2008) Adaptive lasso for sparse high-dimensional regression models. Stat Sin 18(4):1603–1618
MathSciNet MATH Google Scholar
Irizarry R, Hobbs B, Collin F, Beazer-Barclay Y, Antonellis K, Scherf U, Speed T (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2):249–264
Article MATH Google Scholar
Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4598):671680
Article MathSciNet MATH Google Scholar
Kohn R, Smith M, Chan D (2001) Nonparametric regression using linear combinations of basis functions. Stat Comput 11:313–322
Article MathSciNet Google Scholar
Liang F, Paulo R, Molina G, Clyde MA, Berger JO (2008) Mixtures of g priors for bayesian variable selection. J Am Stat Assoc 103(481):410–423
Article MathSciNet MATH Google Scholar
Mallows C (1973) Some comments on $c_{P}$. Technometrics 15(4):661–675
MATH Google Scholar
Miller A (2002) Subset selection in regression, 2nd edn. Chapman and Hall/CRC, Boca Raton
Book MATH Google Scholar
Muller P, Quintana FA (2004) Nonparametric bayesian data analysis. Stat Sci 19(1):95–110
Article MathSciNet MATH Google Scholar
Rocha G, Zhao P (2006) Lasso Matlab codes. http://www.stat.berkeley.edu/twiki/Research/YuGroup/Software
Scheetz T, Kim KY, Swiderski R, Philp A, Braun T, Knudtson K, Dorrance A, DiBona G, Huang J, Casavant T, Sheffield V, Stone E (2006) Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc Natl Acad Sci 103(39):14,429–14,434
Article Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MathSciNet MATH Google Scholar
Shiryaev A (1996) Probability, 2nd edn. Springer, New York
Book MATH Google Scholar
Smith M, Kohn R (1996) Nonparametric regression using bayesian variable selection. J Econom 75(2):317–343
Article MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
MathSciNet MATH Google Scholar
Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J R Stat Soc Ser B 67(1):91–108
Article MathSciNet MATH Google Scholar
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B 68(1):49–67
Article MathSciNet MATH Google Scholar
Zellner A (1986) On assessing prior distributions and bayesian regression analysis with g-prior distributions. In: Goel PK, Zellner A (eds) Bayesian Inference and Decision Techniques Essays in Honor of Bruno de Finetti. North Holland, Amsterdam, pp 233–243
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Article MathSciNet MATH Google Scholar
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67(2):301–320
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Statistical Science, Academia Sinica, Taipei, 11529, Taiwan
Yen-Shiu Chin & Ting-Li Chen

Authors

Yen-Shiu Chin
View author publications
You can also search for this author in PubMed Google Scholar
Ting-Li Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ting-Li Chen.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 118 KB)

Appendices

Appendix 1: Proofs

Proof of Theorem 1

Assume $Q_{q,i}$, $Q_{q,j}\in \mathcal {Q}_{q}$. Let $E_{i,j}=Q_{q,i}\cap Q_{q,j}$ and $n_{i,j}=q-|E_{i,j}|$. According to MCMC1 algorithm, we have

$$\begin{aligned} p_{ij}=\left\{ \begin{array}{lll} \sum _{\eta \in Q_{q,i}}\frac{1}{q}\cdot \frac{1}{\tilde{z}_{Q_{q,i}\setminus \{\eta \}}}\cdot f(\text {RSS}_{Q_{q,i}}) &{}\quad \text{ if } n_{i,j}= 0, \\ \frac{1}{q}\cdot \frac{1}{\tilde{z}_{E_{i,j}}}\cdot f(\text {RSS}_{Q_{q,j}}) &{}\quad \text{ if } n_{i,j}= 1, \\ 0 &{}\quad \text{ otherwise, } \end{array} \right. \end{aligned}$$

where $\tilde{z}_{B}=\sum _{\kappa \notin B} f(\text {RSS}_{B\cup \{\kappa \}})$, $B\in \mathcal {Q}_{q-1}$. For the convergence, by Ergodic Theorem (see “Appendix 2”), it suffices to show that there exists an $n_{0}$ such that $p_{ij}^{(n_{0})}>0$ for all i, j. First, consider the case $j=i$,

$$\begin{aligned} p_{ii}^{(q)}&\ge \prod _{k=1}^{q} \Pr \left( x_{k}=Q_{q,i}\mid x_{k-1}=Q_{q,i}\right) \nonumber \\&= p^{q}_{ii} \nonumber \\&= \left[ \sum _{\eta \in Q_{q,i}}\frac{1}{q}\cdot \frac{1}{\tilde{z}_{Q_{q,i}\setminus \{\eta \}}} \cdot f(\text {RSS}_{Q_{q,i}})\right] ^{q} ~~~>0.\nonumber \end{aligned}$$

For $i \ne j$, let

$$\begin{aligned} Q^{*}_{q,i}=Q_{q,i}{\setminus }E_{i,j},~~ Q^{*}_{q,j}=Q_{q,j}{\setminus }E_{i,j}, \end{aligned}$$

and define

$$\begin{aligned} Q^{*(k)} = Q^{*}_{q,i}(k+1:n_{i,j}) \cup E_{i,j} \cup Q^{*}_{q,j}(1:k) \quad \quad \quad 0\le k \le n_{i,j}, \end{aligned}$$

where $Q^{*}_{q,j}(1:k)$ denotes the first k variables in $Q^{*}_{q,j}$ and $Q^{*}_{q,i}(k+1:n_{i,i})$ denotes the set $Q^{*}_{q,i}$ excluding the first k variables. In this setup, for $1\le k\le n_{i,j}$, $q-|Q^{*(k-1)} \cap Q^{*(k)}|=1$, which leads to

$$\begin{aligned} \Pr \left( x_{k}=Q^{*(k)}\mid x_{k-1}=Q^{*(k-1)}\right) >0. \end{aligned}$$

Therefore,

$$\begin{aligned} p_{ij}^{(q)}&=\sum _{\alpha }p^{(n_{i,j})}_{i\alpha }p^{(q-n_{i,j})}_{\alpha j} \nonumber \\&\ge p^{(n_{i,j})}_{i j}p^{(q-n_{i,j})}_{jj} \nonumber \\&\ge \prod _{k=1}^{n_{i,j}} \Pr \left( x_{k}=Q^{*(k)}\mid x_{k-1}=Q^{*(k-1)}\right) \cdot p^{q-n_{i,j}}_{jj} >0,\nonumber \end{aligned}$$

where $p^{(0)}_{\alpha j}:=1$ for all $\alpha $. Since $p_{ij}^{(q)}>0$ for all i, j, $p_{ij}^{(m)}$’s converge by Ergodic Theorem. Next, we will show that $p_{ij}^{(m)}$ converges to $f(\text {RSS}_{Q_{q,j}})/z$ for $j=1,\ldots ,C^{p}_{q}$. For $n_{i,j}\ne 1$, it is clear that

$$\begin{aligned} \frac{1}{z} \cdot f(\text {RSS}_{Q_{q,i}}) \cdot p_{ij} =\frac{1}{z}\cdot f(\text {RSS}_{Q_{q,j}})\cdot p_{ji}. \end{aligned}$$

For $n_{i,j}=1$,

$$\begin{aligned} \frac{1}{z} \cdot f(\text {RSS}_{Q_{q,i}}) \cdot p_{ij}&= \frac{1}{z} \cdot f(\text {RSS}_{Q_{q,i}}) \cdot \frac{1}{q}\cdot \frac{1}{\tilde{z}_{E_{i,j}}}\cdot f(\text {RSS}_{Q_{q,j}})\nonumber \\&= \frac{1}{z} \cdot f(\text {RSS}_{Q_{q,j}}) \cdot \frac{1}{q}\cdot \frac{1}{\tilde{z}_{E_{j,i}}}\cdot f(\text {RSS}_{Q_{q,i}})\nonumber \\&= \frac{1}{z} \cdot f(\text {RSS}_{Q_{q,j}}) \cdot p_{ji}.\nonumber \end{aligned}$$

By Remark 3 (see “Appendix 2”) , we prove $\pi _{j}=f(\text {RSS}_{Q_{q,j}})/z$, $j=1,\ldots ,C^{p}_{q}$. $\square $

Proof of Theorem 2

The proof is very similar to that of MCMC1. Assume $Q_{q,i}$, $Q_{q,j}\in \mathcal {Q}_{q}$. Let $E_{i,j}=Q_{q,i}\cap Q_{q,j}$ and $n_{i,j}=q-|E_{i,j}|$. According to MCMC2 algorithm, we have

$$\begin{aligned} p_{ij}=\left\{ \begin{array}{lll} \sum _{\eta \in Q_{q,i}} \frac{1}{z_{Q_{q,i}}} \cdot f(\text {RSS}_{Q_{q,i}\setminus \{\eta \}}) \cdot \frac{1}{\tilde{z}_{Q_{q,i}\setminus \{\eta \}}}\cdot f(\text {RSS}_{Q_{q,i}}) &{}\qquad \text{ if } n_{i,j}= 0, \\ \frac{1}{z_{Q_{q,i}}} \cdot f(\text {RSS}_{E_{i,j}})\cdot \frac{1}{\tilde{z}_{E_{i,j}}}\cdot f(\text {RSS}_{Q_{q,j}}) &{}\qquad \text{ if } n_{i,j}= 1, \\ 0&{} \qquad \text{ otherwise, } \end{array} \right. \end{aligned}$$

where $z_{A}=\sum _{\zeta \in A}f(\text {RSS}_{A\setminus \{\zeta \}})$ for $A\in \mathcal {Q}_{q}$, and $\tilde{z}_{B}=\sum _{\kappa \notin B} f(\text {RSS}_{B\cup \{\kappa \}})$ for $B\in \mathcal {Q}_{q-1}$. To apply Ergodic Theorem, we claim that $p_{ij}^{(q)}>0$ in the following. For $j=i$,

$$\begin{aligned} p_{ii}^{(q)} \ge \left[ \sum _{\eta \in Q_{q,i}} \frac{1}{z_{Q_{q,i}}} \cdot f(\text {RSS}_{Q_{q,i}\setminus \{\eta \}}) \cdot \frac{1}{\tilde{z}_{Q_{q,i}\setminus \{\eta \}}}\cdot f(\text {RSS}_{Q_{q,i}})\right] ^{q} ~~~>0. \end{aligned}$$

For $i\ne j$, following the same arguments and notations of the proof of Theorem 1, we have

$$\begin{aligned} p_{ij}^{(q)} \ge \prod _{k=1}^{n_{i,j}} \Pr \left( x_{k}=Q^{*(k)}\mid x_{k-1}=Q^{*(k-1)}\right) \cdot p^{q-n_{i,j}}_{jj} >0. \end{aligned}$$

Since $p_{ij}^{(q)}>0$ for all i, j, $p_{ij}^{(m)}$’s converge by Ergodic Theorem. Next, we will show that $p_{ij}^{(m)}$ converges to $\frac{1}{z} \cdot \sum _{\zeta \in Q_{q,j}}f(\text {RSS}_{Q_{q,j}\setminus \{\zeta \}}) \cdot f(\text {RSS}_{Q_{q,j}})$ for $j=1,\ldots ,C^{p}_{q}$. For $n_{i,j}\ne 1$, it is trivial that

$$\begin{aligned} \frac{z_{Q_{q,i}}}{z}\cdot f(\text {RSS}_{Q_{q,i}})\cdot p_{ij} =\frac{z_{Q_{q,j}}}{z}\cdot f(\text {RSS}_{Q_{q,j}})\cdot p_{ji}. \end{aligned}$$

For $n_{i,j}=1$,

$$\begin{aligned}&\frac{z_{Q_{q,i}}}{z} \cdot f(\text {RSS}_{Q_{q,i}}) \cdot p_{ij} \nonumber \\&\quad = \frac{z_{Q_{q,i}}}{z} \cdot f(\text {RSS}_{Q_{q,i}}) \cdot \frac{1}{z_{Q_{q,i}}} \cdot f(\text {RSS}_{E_{i,j}})\cdot \frac{1}{\tilde{z}_{E_{i,j}}}\cdot f(\text {RSS}_{Q_{q,j}})\nonumber \\&\quad = \frac{z_{Q_{q,j}}}{z} \cdot f(\text {RSS}_{Q_{q,j}}) \cdot \frac{1}{z_{Q_{q,j}}} \cdot f(\text {RSS}_{E_{j,i}})\cdot \frac{1}{\tilde{z}_{E_{j,i}}}\cdot f(\text {RSS}_{Q_{q,i}})\nonumber \\&\quad = \frac{z_{Q_{q,j}}}{z} \cdot f(\text {RSS}_{Q_{q,j}}) \cdot p_{ji}.\nonumber \end{aligned}$$

By Remark 3 (see “Appendix 2”), we have

$$\begin{aligned} \pi _{j}&= \frac{z_{Q_{q,j}}}{z} \cdot f(\text {RSS}_{Q_{q,j}}) \nonumber \\&= \frac{1}{z} \cdot \sum _{\zeta \in Q_{q,j}}f(\text {RSS}_{Q_{q,j}\setminus \{\zeta \}}) \cdot f(\text {RSS}_{Q_{q,j}})\,\,\,\,j=1,\ldots ,C^{p}_{q}. \nonumber \end{aligned}$$

$\square $

Appendix 2 Ergodic Theorem

Let $\varGamma =[p_{ij}]$ be the transition matrix of a chain with a finite state space $\mathbf S =\{1, 2, \ldots , S\}$. If there is an $n_{0}$ such that

$$\begin{aligned} \min _{i,\,j}p_{ij}^{(n_{0})}>0, \end{aligned}$$

then there is a sequence of numbers $\pi _{1}, \ldots , \pi _{S}$ such that

$$\begin{aligned} \pi _{j}>0,\,\,\,\,\sum _{j}\pi _{j}=1 \end{aligned}$$

(5)

and

$$\begin{aligned} p_{ij}^{(m)}\rightarrow \pi _{j},\,\,\,\,m\rightarrow \infty \end{aligned}$$

for every $i\in \mathbf S $. Moreover, the sequence $\pi _{1}, \ldots , \pi _{S}$ is the only stationary probability distribution with transition matrix $\varGamma $, that is, the set $\pi _{1}, \ldots , \pi _{S}$ is the only one which satisfies both (5) and the following equations

$$\begin{aligned} \pi _{j}=\sum _{\alpha }\pi _{\alpha }p_{\alpha j},\,\,\,\,j=1,\ldots ,S. \nonumber \end{aligned}$$

Proof

Vide Shiryaev (1996, p. 118–120). $\square $

Remark 3

If there is a sequence of numbers $\tilde{\pi }_{1}, \ldots , \tilde{\pi }_{S}$ which satisfies (5) and the equations

$$\begin{aligned} \tilde{\pi }_{i} p_{ij}=\tilde{\pi }_{j}p_{ji},\,\,\,\,i,j=1,\ldots ,S, \nonumber \end{aligned}$$

then $\tilde{\pi }_{i}=\pi _{i}$ for $i=1,\ldots ,S$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chin, YS., Chen, TL. Minimizing variable selection criteria by Markov chain Monte Carlo. Comput Stat 31, 1263–1286 (2016). https://doi.org/10.1007/s00180-016-0649-3

Download citation

Received: 18 August 2014
Accepted: 18 February 2016
Published: 12 March 2016
Issue Date: December 2016
DOI: https://doi.org/10.1007/s00180-016-0649-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Minimizing variable selection criteria by Markov chain Monte Carlo

Abstract

Access this article

Similar content being viewed by others

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

A Systematic Review of Hidden Markov Models and Their Applications

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 118 KB)

Appendices

Appendix 1: Proofs

Proof of Theorem 1

Proof of Theorem 2

Appendix 2

Ergodic Theorem

Proof

Remark 3

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Minimizing variable selection criteria by Markov chain Monte Carlo

Abstract

Access this article

Similar content being viewed by others

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

A Systematic Review of Hidden Markov Models and Their Applications

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 118 KB)

Appendices

Appendix 1: Proofs

Proof of Theorem 1

Proof of Theorem 2

Appendix 2

Ergodic Theorem

Proof

Remark 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation