Abstract
Regression models with a large number of predictors arise in diverse fields of social sciences and natural sciences. For proper interpretation, we often would like to identify a smaller subset of the variables that shows the strongest information. In such a large size of candidate predictors setting, one would encounter a computationally cumbersome search in practice by optimizing some criteria for selecting variables, such as AIC, \(C_{P}\) and BIC, through all possible subsets. In this paper, we present two efficient optimization algorithms vis Markov chain Monte Carlo (MCMC) approach for searching the global optimal subset. Simulated examples as well as one real data set exhibit that our proposed MCMC algorithms did find better solutions than other popular search methods in terms of minimizing a given criterion.
Similar content being viewed by others
References
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723
Bakin S (1999) Adaptive regression and model selection in data mining problems. Ph.D. thesis, Australian National University, Canberra
Breiman L (1995) Better subset regression using the nonnegative garrote. Technometrics 37(4):373–384
Candes E, Tao T (2007) The dantzig selector: statistical estimation when \(p\) is much larger than \(n\). Ann Stat 35:2313–2351
Chiang A, Beck J, Yen HJ, Tayeh M, Scheetz T, Swiderski R, Nishimura D, Braun T, Kim KY, Huang J, Elbedour K, Carmi R, Slusarski D, Casavant T, Stone E, Sheffield V (2006) Homozygosity mapping with snp arrays identifies trim32, an e3 ubiquitin ligase, as a bardet-biedl syndrome gene (bbs11). Proc Natl Acad Sci 103(16):6287–6292
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
George EI, McCulloch RE (1993) Variable selection via gibbs sampling. J Am Stat Assoc 88(423):881–889
George EI, McCulloch RE (1997) Approaches for bayesian variable selection. Stat Sin 7(2):339–373
Huang J, Ma S, Zhang CH (2008) Adaptive lasso for sparse high-dimensional regression models. Stat Sin 18(4):1603–1618
Irizarry R, Hobbs B, Collin F, Beazer-Barclay Y, Antonellis K, Scherf U, Speed T (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2):249–264
Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4598):671680
Kohn R, Smith M, Chan D (2001) Nonparametric regression using linear combinations of basis functions. Stat Comput 11:313–322
Liang F, Paulo R, Molina G, Clyde MA, Berger JO (2008) Mixtures of g priors for bayesian variable selection. J Am Stat Assoc 103(481):410–423
Mallows C (1973) Some comments on \(c_{P}\). Technometrics 15(4):661–675
Miller A (2002) Subset selection in regression, 2nd edn. Chapman and Hall/CRC, Boca Raton
Muller P, Quintana FA (2004) Nonparametric bayesian data analysis. Stat Sci 19(1):95–110
Rocha G, Zhao P (2006) Lasso Matlab codes. http://www.stat.berkeley.edu/twiki/Research/YuGroup/Software
Scheetz T, Kim KY, Swiderski R, Philp A, Braun T, Knudtson K, Dorrance A, DiBona G, Huang J, Casavant T, Sheffield V, Stone E (2006) Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc Natl Acad Sci 103(39):14,429–14,434
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Shiryaev A (1996) Probability, 2nd edn. Springer, New York
Smith M, Kohn R (1996) Nonparametric regression using bayesian variable selection. J Econom 75(2):317–343
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J R Stat Soc Ser B 67(1):91–108
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B 68(1):49–67
Zellner A (1986) On assessing prior distributions and bayesian regression analysis with g-prior distributions. In: Goel PK, Zellner A (eds) Bayesian Inference and Decision Techniques Essays in Honor of Bruno de Finetti. North Holland, Amsterdam, pp 233–243
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67(2):301–320
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Appendix 1: Proofs
Proof of Theorem 1
Assume \(Q_{q,i}\), \(Q_{q,j}\in \mathcal {Q}_{q}\). Let \(E_{i,j}=Q_{q,i}\cap Q_{q,j}\) and \(n_{i,j}=q-|E_{i,j}|\). According to MCMC1 algorithm, we have
where \(\tilde{z}_{B}=\sum _{\kappa \notin B} f(\text {RSS}_{B\cup \{\kappa \}})\), \(B\in \mathcal {Q}_{q-1}\). For the convergence, by Ergodic Theorem (see “Appendix 2”), it suffices to show that there exists an \(n_{0}\) such that \(p_{ij}^{(n_{0})}>0\) for all i, j. First, consider the case \(j=i\),
For \(i \ne j\), let
and define
where \(Q^{*}_{q,j}(1:k)\) denotes the first k variables in \(Q^{*}_{q,j}\) and \(Q^{*}_{q,i}(k+1:n_{i,i})\) denotes the set \(Q^{*}_{q,i}\) excluding the first k variables. In this setup, for \(1\le k\le n_{i,j}\), \(q-|Q^{*(k-1)} \cap Q^{*(k)}|=1\), which leads to
Therefore,
where \(p^{(0)}_{\alpha j}:=1\) for all \(\alpha \). Since \(p_{ij}^{(q)}>0\) for all i, j, \(p_{ij}^{(m)}\)’s converge by Ergodic Theorem. Next, we will show that \(p_{ij}^{(m)}\) converges to \(f(\text {RSS}_{Q_{q,j}})/z\) for \(j=1,\ldots ,C^{p}_{q}\). For \(n_{i,j}\ne 1\), it is clear that
For \(n_{i,j}=1\),
By Remark 3 (see “Appendix 2”) , we prove \(\pi _{j}=f(\text {RSS}_{Q_{q,j}})/z\), \(j=1,\ldots ,C^{p}_{q}\). \(\square \)
Proof of Theorem 2
The proof is very similar to that of MCMC1. Assume \(Q_{q,i}\), \(Q_{q,j}\in \mathcal {Q}_{q}\). Let \(E_{i,j}=Q_{q,i}\cap Q_{q,j}\) and \(n_{i,j}=q-|E_{i,j}|\). According to MCMC2 algorithm, we have
where \(z_{A}=\sum _{\zeta \in A}f(\text {RSS}_{A\setminus \{\zeta \}})\) for \(A\in \mathcal {Q}_{q}\), and \(\tilde{z}_{B}=\sum _{\kappa \notin B} f(\text {RSS}_{B\cup \{\kappa \}})\) for \(B\in \mathcal {Q}_{q-1}\). To apply Ergodic Theorem, we claim that \(p_{ij}^{(q)}>0\) in the following. For \(j=i\),
For \(i\ne j\), following the same arguments and notations of the proof of Theorem 1, we have
Since \(p_{ij}^{(q)}>0\) for all i, j, \(p_{ij}^{(m)}\)’s converge by Ergodic Theorem. Next, we will show that \(p_{ij}^{(m)}\) converges to \(\frac{1}{z} \cdot \sum _{\zeta \in Q_{q,j}}f(\text {RSS}_{Q_{q,j}\setminus \{\zeta \}}) \cdot f(\text {RSS}_{Q_{q,j}})\) for \(j=1,\ldots ,C^{p}_{q}\). For \(n_{i,j}\ne 1\), it is trivial that
For \(n_{i,j}=1\),
By Remark 3 (see “Appendix 2”), we have
\(\square \)
Appendix 2
Ergodic Theorem
Let \(\varGamma =[p_{ij}]\) be the transition matrix of a chain with a finite state space \(\mathbf S =\{1, 2, \ldots , S\}\). If there is an \(n_{0}\) such that
then there is a sequence of numbers \(\pi _{1}, \ldots , \pi _{S}\) such that
and
for every \(i\in \mathbf S \). Moreover, the sequence \(\pi _{1}, \ldots , \pi _{S}\) is the only stationary probability distribution with transition matrix \(\varGamma \), that is, the set \(\pi _{1}, \ldots , \pi _{S}\) is the only one which satisfies both (5) and the following equations
Proof
Vide Shiryaev (1996, p. 118–120). \(\square \)
Remark 3
If there is a sequence of numbers \(\tilde{\pi }_{1}, \ldots , \tilde{\pi }_{S}\) which satisfies (5) and the equations
then \(\tilde{\pi }_{i}=\pi _{i}\) for \(i=1,\ldots ,S\).
Rights and permissions
About this article
Cite this article
Chin, YS., Chen, TL. Minimizing variable selection criteria by Markov chain Monte Carlo. Comput Stat 31, 1263–1286 (2016). https://doi.org/10.1007/s00180-016-0649-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-016-0649-3