Skip to main content
Log in

Variance reduction of estimators arising from Metropolis–Hastings algorithms

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

The Metropolis–Hastings algorithm is one of the most basic and well-studied Markov chain Monte Carlo methods. It generates a Markov chain which has as limit distribution the target distribution by simulating observations from a different proposal distribution. A proposed value is accepted with some particular probability otherwise the previous value is repeated. As a consequence, the accepted values are repeated a positive number of times and thus any resulting ergodic mean is, in fact, a weighted average. It turns out that this weighted average is an importance sampling-type estimator with random weights. By the standard theory of importance sampling, replacement of these random weights by their (conditional) expectations leads to more efficient estimators. In this paper we study the estimator arising by replacing the random weights with certain estimators of their conditional expectations. We illustrate by simulations that it is often more efficient than the original estimator while in the case of the independence Metropolis–Hastings and for distributions with finite support we formally prove that it is even better than the “optimal” importance sampling estimator.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Atchadé, Y.F., Perron, F.: Improving on the independent Metropolis–Hastings algorithm. Stat. Sin. 15, 3–18 (2005)

    MATH  Google Scholar 

  • Billingsley, P.: Statistical methods in Markov chains. Ann. Math. Stat. 32, 12–40 (1961)

    Article  MathSciNet  MATH  Google Scholar 

  • Casella, G., Robert, C.: Rao–Blackwellisation of sampling schemes. Biometrika 83, 81–94 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  • Douc, R., Robert, C.P.: A vanilla Rao–Blackwellisation of Metropolis–Hastings algorithms. Ann. Stat. 39, 261–277 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Geyer, C.J.: Practical Markov chain Monte Carlo. Stat. Sci. 7, 473–483 (1992)

    Article  Google Scholar 

  • Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 (1970)

    Article  MATH  Google Scholar 

  • Jacob, P., Robert, C.P., Smith, M.: Using parallel computation to improve independent Metropolis–Hastings based estimation. J. Comput. Graph. Stat. 20, 616–635 (2011)

    Article  MathSciNet  Google Scholar 

  • Malefaki, S., Iliopoulos, G.: On convergence of properly weighted samples to the target distribution. J. Stat. Plan. Inference 138, 1210–1225 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Marin, J.M., Robert, C.P.: Importance sampling methods for Bayesian discrimination between embedded models. In: Chen, M.-H., Dey, D.K., Müller, P., Sun, D., Ye, K. (eds.) Frontiers of Statistical Decision Making and Bayesian Analysis (2010). Chapter 14, pp. 513–553

    Google Scholar 

  • Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1091 (1953)

    Article  Google Scholar 

Download references

Acknowledgements

The authors wish to thank the two anonymous referees for their helpful comments and suggestions which considerably improved the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to George Iliopoulos.

Appendix: Proof of Theorem 1

Appendix: Proof of Theorem 1

(a) Assume without loss of generality that the state space is \(\mathcal {X}=\{1, \ldots, m\}\) for some m≥2 and set for convenience h k =h(k), π k =π(k), g k =g(k), g kl =g(l|k) and w k =w(k). Then,

(5)

By the ergodic theorem it holds \(n^{-1}\sum_{i=1}^{n}I(X_{i}=k)\stackrel {\mathrm{a.s.}}{\longrightarrow}g_{k}\) and \(\sum_{j=1}^{n}\frac{\xi_{j}}{\sum_{l=1}^{n}\xi_{l}}I(X_{j}=\ell )\stackrel{\mathrm{a.s.}}{\longrightarrow}\pi _{\ell}\). Since (5) contains only finite sums we conclude that it converges almost surely to

because \(g_{k} = \kappa\sum_{i=1}^{m} \min\{\pi_{k} g_{k\ell}, \pi _{\ell }g_{\ell k}\}\). By taking h≡1 we get that \(n^{-1}\sum _{i=1}^{n}\hat{w}(X_{i})\stackrel{\mathrm{a.s.}}{\longrightarrow}\kappa \) and thus

$$\hat{h}_{\hat{w}} = \frac{\sum_{i=1}^n\hat{w}(X_i)h(X_i)/n}{\sum _{i=1}^n\hat{w}(X_i)/n} \stackrel{\mathrm{a.s.}}{\longrightarrow }E_{\pi}(h). $$

(b) Let us define

$$\bar{U}_{k} = \frac{1}{n}\sum_{i=1}^{n}I(X_{i}={k}), \quad k=1,\ldots,m, $$
$$\bar{V}_{k} = \frac{1}{n}\sum_{i=1}^{n}\xi_{i}I(X_{i}={k}), \quad k=1,\ldots,m, $$

and

$$\rho_{kl} = \min \biggl\{\frac{q_{lk}}{\pi_{k}},\frac{q_{lk}}{\pi_{l}} \biggr\}, \quad k,l=1,\ldots,m. $$

Billingsley (1961) proved that

$$n^{1/2}(\bar{\boldsymbol{U}}-\boldsymbol{g})\rightarrow\mathcal {N}_m(\boldsymbol{0}, \boldsymbol{\varSigma }_{11}) $$

where \(\bar{\boldsymbol{U}}= (\bar{U}_{1}, \ldots, \bar{U}_{m})^{T}\), g=(g 1,…,g m )T and the ij entry of Σ 11 is

Here \(g_{ij}^{(n)}\) denotes the n-step transition probability from state i to state j and δ ij is Kronecker’s delta. Using similar arguments it can be proven that

where the ij entry of the submatrix Σ 12 is σ 12(ij)=κw j σ 11(ij) while the ij entry of the submatrix Σ 22 is σ 22(ij)=δ ij g i κw i (κw i −1)+κ 2 w i w j σ 11(ij).

It is clear that the asymptotic distribution of \(n^{1/2}\{\hat {h}_{\hat {w}}-E_{\pi}(h)\}\) can be found via the standard delta method. Observe that \(\hat{h}_{\hat{w}}\) can be expressed as

(6)

say. By differentiation we get

$$\frac{\partial}{\partial u_i}f_1(\boldsymbol{u}, \boldsymbol{v}) \bigg|_{(\boldsymbol{u}, \boldsymbol{v})=(\boldsymbol{g}, \kappa\boldsymbol{\pi})} = w_i\bigl\{h_i- E_{\pi}(h)\bigr\} $$

and

The variance of the asymptotic normal distribution of \(n^{1/2}\{\hat {h}_{\hat{w}}-E_{\pi}(h)\}\) is

where ∇ u f 1, ∇ v f 1 are the vectors containing the derivatives with respect to u, v, respectively, evaluated at (u,v)=(g,κ π). We will see below that the above expression is in fact as stated in the theorem. To this end, let us also consider the IS “estimator” \(\hat {h}_{IS}\) and evaluate its asymptotic variance too. After some algebra we get that

$$ \hat{h}_{IS} = \sum_{k=1}^{m}\frac{h_{k}\bar{U}_{k}}{\sum _{l=1}^{m}\rho _{kl}\pi_{l}} \bigg/ \sum_{k=1}^{m}\frac{\bar{U}_{k}}{\sum _{l=1}^{m}\rho _{kl}\pi_{l}} = f_2(\bar{\boldsymbol{U}}), $$
(7)

say. By (6) and (7) we see that the only difference between the two “estimators” is the replacement of π l by its unbiased estimate \(\kappa^{-1}\bar{V}_{l}\) in \(\hat {h}_{\hat{w}}\). By differentiation we get

$$\frac{\partial}{\partial u_i}f_2(\boldsymbol{u}) \bigg|_{\boldsymbol{u}=\boldsymbol{g}} = w_i\bigl\{h_i- E_{\pi}(h)\bigr\}. $$

Since the corresponding variance of the importance sampling estimator is

$$\sigma^2_{IS}(h) = \nabla_{\boldsymbol{u}}f_2^{T} \boldsymbol{ \varSigma }_{11} \nabla_{\boldsymbol{u}}f_2 \equiv \nabla_{\boldsymbol{u}}f_1^{T} \boldsymbol{ \varSigma }_{11} \nabla_{\boldsymbol{u}}f_1, $$

it is clear that

Set now \(A_{1} = - 2\nabla_{\boldsymbol{u}}f_{1}^{T} \boldsymbol{\varSigma } _{12} \nabla_{\boldsymbol{v}}f_{1}\), \(A_{2} = -\nabla_{\boldsymbol{u}}f_{1}^{T} \boldsymbol{\varSigma }_{22} \nabla_{\boldsymbol{u}}f_{1}\), B n =w(X n )h(X n ) for n=0,1,…, and b i =w i h i , μ i =E(B 1|X 0=i) for i=1,…,m. Consider further without loss of generality that E π (h)=0 and note that under stationarity, it holds that E π (h)=E g (B n ) for all n. Then,

But

$$\sum_{i=1}^m \mu_i g_i = \sum_{i=1}^m E(B_i|X_0=i) g_i = E_g\bigl\{ E(B_1|X_0)\bigr\} = E_g(B_1) = 0 $$

and since X is reversible, i.e., it holds g i g ij =g j g ji and more generally \(g_{i} g_{ij}^{(n)} = g_{j} g_{ji}^{(n)}\) for all n, we conclude that

$$A_1 = 2 \Biggl\{\sum_{i=1}^m b_i \mu_i g_i + 2 \sum _{n=1}^{\infty }\sum_{i=1}^m \sum_{j=1}^mb_i g_i \mu_j g_{ij}^{(n)} \Biggr\}. $$

Moreover,

$$\sum_{i=1}^m b_i \mu_i g_i = \sum_{i=1}^m b_i \Biggl(\sum_{k=1}^m b_kg_{ik} \Biggr) g_i = E_g(B_0B_1) $$

and for all n, we have that

$$\sum_{i=1}^m\sum _{j=1}^m b_ig_i \mu_j g_{ij}^{(n)} = \sum _{i=1}^m \sum_{j=1}^m b_ig_i \Biggl(\sum_{k=1}^m b_kg_{jk} \Biggr) g_{ij}^{(n)} = \sum_{i=1}^m \sum _{k=1}^m b_ib_kg_i \sum_{j=1}^m g_{jk} g_{ij}^{(n)} -\sum_{i=1}^m \sum_{k=1}^m b_ib_kg_i g_{ik}^{(n+1)} = E_g(B_0B_{n+1}). $$

Thus,

$$ A_1 = 2E_g(B_0 B_1) + 4 \sum_{n=1}^{\infty} E_g (B_0 B_{n+1}) = 2 \sum_{n=1}^{\infty} E_g (B_0 B_{n}) + 2 \sum _{n=1}^{\infty} E_g (B_0 B_{n+1}). $$
(8)

On the other hand,

But

and for all n,

Hence,

$$ A_{2} = \sum_{i=1}^{m} \frac{\mu_{i}^{2}g_{i}}{\kappa w_{i}}-2E_{g}(B_{0}B_{2})-2\sum _{n=1}^{\infty}E_{g}(B_{0}B_{n+2}) = \sum_{i=1}^{m}\frac{\mu_{i}^{2}g_{i}}{\kappa w_{i}}-2\sum _{n=1}^{\infty}E_{g}(B_{0}B_{n+1}). $$
(9)

From (8) and (9) we get

$$ \sigma^{2}_{IS}(h)-\sigma^{2}_{\hat{w}}(h) = A_{1}+A_{2} = \sum_{i=1}^{m} \frac{\mu_{i}^{2}g_{i}}{\kappa w_{i}} + 2\sum_{n=1}^{\infty }E_{g}(B_{0}B_{n}). $$
(10)

By the standard asymptotic theory for Markov chains \(\sigma ^{2}_{IS}(h)=E_{g}(B_{0}^{2})+2\sum_{n=1}^{\infty}E_{g}(B_{0}B_{n})\). Since \(E_{g}(B_{0}^{2})=\operatorname{Var}_{g}\{w(X_{0})h(X_{0})\}\) part (b) of the theorem follows.

(c) In order to compare the two variances note first that the first term of the sum in (10) is clearly positive. Moreover, we know that \(\sum_{n=2}^{\infty}E_{g}(B_{0}B_{n}) \equiv\sum _{n=2}^{\infty}\mathrm{Cov}_{g}(B_{0},B_{n}) \geq 0\) since each residual sum of autocovariances starting from an even integer is nonnegative (see for example Geyer 1992). Thus a sufficient condition for the variances difference to be positive is E g (B 0 B 1)≥0. This expectation can be expressed as a quadratic form, namely,

$$E_{g}(B_{0}B_{1}) = \sum _{i=1}^{m}\sum_{j=1}^{m}b_{i}b_{j} \tilde {g}_{ij} = \boldsymbol{b}^{T}\tilde{\boldsymbol{G}} \boldsymbol{b} $$

where \(\tilde{g}_{ij}=g_{i}g_{ij}\). We will show that in the case of independence MH, the matrix \(\tilde{\boldsymbol{G}}\) is nonnegative definite. Indeed, in this case,

Consider without loss of generality that

$$ q_{1}/\pi_{1}\leq\cdots\leq q_{m}/\pi_{m} $$
(11)

and set γ ij =κ −1 q i π j . Then clearly \(\tilde {g}_{ij}=\gamma_{i\wedge j,i\vee j}\) where ij=min{i,j} and ij=max{i,j} so \(\tilde{\boldsymbol{G}}\) is nonnegative definite. To see that, proceed by induction to show that all its principal minors are nonnegative. Let \(\tilde{\boldsymbol{G}}_{kk}\) denote the matrix consisting of the first k rows and columns of \(\tilde{\boldsymbol{G}}\). Then, \(|\tilde{\boldsymbol{G}}_{11}|\equiv\gamma_{1\wedge1}=\kappa^{-1}q_{1}\pi_{1}>0\). Suppose now that \(|\tilde{\boldsymbol{G}}_{kk}|\geq 0\) for some k≥1. In order to prove that \(|\tilde{\boldsymbol{G}}_{k+1,k+1}|\geq 0\), multiply the k-th row of \(\tilde{\boldsymbol{G}}_{k+1,k+1}\) by π k+1/π k and subtract it from the (k+1)-th. Then the resulting matrix has all last row’s elements equal to zero except from the last one which is \(\kappa^{-1}q_{k+1}\pi_{k+1}-\kappa^{-1}q_{k}\pi _{k+1}^{2}/\pi _{k}=\kappa^{-1}\pi_{k+1}^{2}(q_{k+1}/\pi_{k+1}-q_{k}/\pi_{k})\). By (11) this quantity is nonnegative thus, \(|\tilde {\boldsymbol{G}}_{k+1,k+1}|= \kappa^{-1}\pi_{k+1}^{2}(q_{k+1}/\pi_{k+1}- q_{k}/\pi _{k})|\tilde{\boldsymbol{G}}_{kk}|\geq 0\).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Iliopoulos, G., Malefaki, S. Variance reduction of estimators arising from Metropolis–Hastings algorithms. Stat Comput 23, 577–587 (2013). https://doi.org/10.1007/s11222-012-9331-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-012-9331-y

Keywords

Navigation