Skip to main content
Log in

General multilevel adaptations for stochastic approximation algorithms of Robbins–Monro and Polyak–Ruppert type

  • Published:
Numerische Mathematik Aims and scope Submit manuscript

Abstract

In this article we present and analyse new multilevel adaptations of classical stochastic approximation algorithms for the computation of a zero of a function \(f:D \rightarrow {{\mathbb {R}}}^d\) defined on a convex domain \(D\subset {{\mathbb {R}}}^d\), which is given as a parameterised family of expectations. The analysis of the error and the computational cost of our method is based on similar assumptions as used in Giles (Oper Res 56(3):607–617, 2008) for the computation of a single expectation. Additionally, we essentially only require that f satisfies a classical contraction property from stochastic approximation theory. Under these assumptions we establish error bounds in pth mean for our multilevel Robbins–Monro and Polyak–Ruppert schemes that decay in the computational time as fast as the classical error bounds for multilevel Monte Carlo approximations of single expectations known from Giles (Oper Res 56(3):607–617, 2008). Our approach is universal in the sense that having multilevel implementations for a particular application at hand it is straightforward to implement the corresponding stochastic approximation algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Benveniste, A., Métivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations. Volume 22 of Applications of Mathematics (New York), vol. 22. Springer, Berlin (1990)

    Book  MATH  Google Scholar 

  2. Duflo, M.: Algorithmes Stochastiques. Volume 23 of Mathématiques & Applications (Berlin) [Mathematics & Applications], vol. 23. Springer, Berlin (1996)

    MATH  Google Scholar 

  3. Frikha, N.: Multi-level stochastic approximation algorithms. Ann. Appl. Probab. 26, 933–985 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  4. Gaposhkin, V.F., Krasulina, T.P.: On the law of the iterated logarithm in stochastic approximation processes. Theory Probab. Appl. 19(4), 844–850 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  5. Giles, M.B.: Multilevel Monte Carlo path simulation. Oper. Res. 56(3), 607–617 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  6. Heinrich, S.: Multilevel Monte Carlo methods. In: Margenov, S., Waśniewski, J., Yalamov, P. (eds.) Large-Scale Scientific Computing, pp. 58–67. Springer, Berlin (2001)

    Chapter  Google Scholar 

  7. Kushner, H.J., Yang, J.: Stochastic approximation with averaging of the iterates: optimal asymptotic rate of convergence for general processes. SIAM J. Control Optim. 31(4), 1045–1062 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  8. Kushner, H.J., Yin, G.G.: Stochastic Approximation and Recursive Algorithms and Applications, Volume 35 of Applications of Mathematics (New York). Stochastic Modelling and Applied Probability, 2nd edn. Springer, New York (2003)

    Google Scholar 

  9. Lai, T.L.: Stochastic approximation. Ann. Stat. 31(2), 391–406 (2003). Dedicated to the memory of Herbert E. Robbins

    Article  MATH  Google Scholar 

  10. Lai, T.L., Robbins, H.: Limit theorems for weighted sums and stochastic approximation processes. Proc. Nat. Acad. Sci. U.S.A. 75, 1068–1070 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  11. Le Breton, A., Novikov, A.: Some results about averaging in stochastic approximation. Metrika 42(3–4):153–171 (1995). Second International Conference on Mathematical Statistics (Smolenice Castle, 1994)

  12. Ljung, L., Pflug, G., Walk, H.: Stochastic Approximation and Optimization of Random Systems. Volume 17 of DMV Seminar, vol. 17. Birkhäuser Verlag, Basel (1992)

    Book  MATH  Google Scholar 

  13. Nualart, D.: The Malliavin Calculus and Related Topics. Probability and Its Applications (New York), 2nd edn. Springer, Berlin (2006)

    MATH  Google Scholar 

  14. Pelletier, M.: On the almost sure asymptotic behaviour of stochastic algorithms. Stoch. Process. Appl. 78(2), 217–244 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  15. Pelletier, M.: Weak convergence rates for stochastic approximation with application to multiple targets and simulated annealing. Ann. Appl. Probab. 8(1), 10–44 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  16. Polyak, B.T.: A new method of stochastic approximation type. Avtomat. i Telemekh. 51(7), 937–1008 (1998)

    MathSciNet  MATH  Google Scholar 

  17. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  18. Ruppert, D.: Almost sure approximations to the Robbins-Monro and Kiefer-Wolfowitz processes with dependent noise. Ann. Probab. 10, 178–187 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  19. Ruppert, D.: Stochastic Approximation. In: Ghosh, B.K., Sen, P.K. (eds.) Handbook of Sequential Analysis. Volume 118 of Statist. Textbooks Monogr., pp. 503–529. Dekker, New York (1991)

    Google Scholar 

Download references

Acknowledgements

We thank two anonymous referees for their valuable comments, which improved the presentation of the material.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Müller-Gronbach.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Let \((\Omega ,{\mathcal {F}},{P})\) be a probability space endowed with a filtration \(({\mathcal {F}}_n)_{n\in {{\mathbb {N}}}_0}\) and let \(\Vert \cdot \Vert \) denote a Hilbert space norm on \({{\mathbb {R}}}^d\).

In this section we provide pth mean estimates for an adapted d-dimensional dynamical system \((\zeta _n)_{n\in {{\mathbb {N}}}_0}\) with the property that for each \(n\in {{\mathbb {N}}}\), \(\zeta _n\) is a zero-mean perturbation of a previsible proposal \(\xi _n\) being comparable in size to \(\zeta _{n-1}\). More formally, we assume that there exist a previsible d-dimensional process \((\xi _n)_{n\in {{\mathbb {N}}}}\), a d-dimensional martingale \((M_n)_{n\in {{\mathbb {N}}}_0}\) with \(M_0=\zeta _0\) and a constant \(c\ge 0\) such that for all \(n\in {{\mathbb {N}}}\)

$$\begin{aligned} \begin{aligned} \zeta _{n}&= \xi _{n}+\Delta M_n,\\ \Vert \xi _n\Vert&\le \Vert \zeta _{n-1}\Vert \vee c, \end{aligned} \end{aligned}$$
(106)

where \(\Delta M_n = M_n-M_{n-1}\). Note that necessarily \(\xi _n={{\mathbb {E}}}[\zeta _n|{\mathcal {F}}_{n-1}]\).

Theorem 5.1

Assume that \((\zeta _n)_{n\in {{\mathbb {N}}}_0}\) is an adapted d-dimensional process, which satisfies  (106), and let \(p\in [1,\infty )\). Then there exists a constant \(\kappa \in (0,\infty )\), which only depends on p, such that for every \(n\in {{\mathbb {N}}}_0\),

$$\begin{aligned} {{\mathbb {E}}}\left[ \max _{0\le k\le n}\Vert \zeta _k\Vert ^p\right] \le \kappa \, \bigl ( {{\mathbb {E}}}\bigl [ [M]_n^{p/2}\bigr ] + c^p\bigr ), \end{aligned}$$

where

$$\begin{aligned} {}[M]_n=\sum _{k=1}^{n} \Vert \Delta M_k\Vert ^2 +\Vert M_0\Vert ^2. \end{aligned}$$

Proof

Fix \(p\in [1,\infty )\).

We first consider the case where \(c=0\). Recall that by the Burkholder-Davis-Gundy inequality there exists a constant \(\bar{\kappa }>0\) depending only on p such that for every d-dimensional martingale \((M_n)_{n\in {{\mathbb {N}}}_0}\),

$$\begin{aligned} {{\mathbb {E}}}\bigl [\max _{0\le k\le n}\Vert M_k\Vert ^p\bigr ]\le {\bar{\kappa }}\, {{\mathbb {E}}}\bigl [ [M]_n^{p/2}\bigr ]. \end{aligned}$$

We fix a time horizon \(T\in {{\mathbb {N}}}_0\) and prove the statement of the theorem with \(\kappa = {\bar{\kappa }}\) by induction: we say that the statement holds up to time \(t\in \{0,\dots ,T\}\), if for every d-dimensional adapted process \((\zeta _n)_{n\in {{\mathbb {N}}}_0}\), for every d-dimensional previsible process \((\xi _n)_{n\in {{\mathbb {N}}}}\) and for every d-dimensional martingale \((M_n)_{n\in {{\mathbb {N}}}_0}\) with

figure c

one has

$$\begin{aligned} {{\mathbb {E}}}\left[ \max _{0\le n\le T}\Vert \zeta _n\Vert ^p\right] \le {\bar{\kappa }}\, {{\mathbb {E}}}\bigl [ [M]_T^{p/2}\bigr ]. \end{aligned}$$

Clearly, the statement is satisfied up to time 0 as a consequence of the Burkholder–Davis–Gundy inequality. Next, suppose that the statement is satisfied up to time \(t\in \{0,\dots ,T-1\}\). Let \((\zeta _n)_{n\in {{\mathbb {N}}}_0}\) be a d-dimensional adapted process, \((\xi _n)_{n\in {{\mathbb {N}}}}\) be a d-dimensional previsible process and \((M_n)_{n\in {{\mathbb {N}}}_0}\) be a d-dimensional martingale satisfying property (\(C_{t+1}\)). Consider any \({\mathcal {F}}_{t}\)-measurable random orthonormal transformation U on \(({{\mathbb {R}}}^d,\Vert \cdot \Vert )\) and put

$$\begin{aligned} \zeta ^U_n={\left\{ \begin{array}{ll} \zeta _n,&{}\quad \text { if }n\le t,\\ \zeta _{t}+U(M_{n}-M_{t}), &{}\quad \text { if }n>t\end{array}\right. } \end{aligned}$$

as well as

$$\begin{aligned} M^U_n={\left\{ \begin{array}{ll} M_n,&{}\quad \text { if }n\le t,\\ M_{t}+ U(M_{n}-M_{t}), &{}\quad \text { if }n>t.\end{array}\right. } \end{aligned}$$

Then it is easy to check that \((M^U_n)_{n\in {{\mathbb {N}}}_0}\) is a martingale with \([M^U]_n = [M]_n\) for all \(n\in {{\mathbb {N}}}\). Furthermore, \((\zeta ^U_n)_{n\in {{\mathbb {N}}}_0}\) is adapted and the triple \((\zeta ^U,\xi , M^U)\) satisfies property (\(C_t\)). Hence, by the induction hypothesis,

$$\begin{aligned} {{\mathbb {E}}}\bigl [\max _{0\le n\le T}\Vert \zeta ^U_n\Vert ^p\bigr ] \le {\bar{\kappa }}\, {{\mathbb {E}}}\bigl [ [M^U]_T^{p/2}\bigr ] ={\bar{\kappa }}\, {{\mathbb {E}}}\bigl [ [M]_T^{p/2}\bigr ]. \end{aligned}$$
(107)

Note that for any such random orthonormal transformation U, the norm of the random variable \(\zeta _n^U\) is the same as the norm of the variable \({\bar{\zeta }}_n^U\) given by

$$\begin{aligned} {\bar{\zeta }}^U_n={\left\{ \begin{array}{ll} \zeta _n,&{}\quad \text { if }n\le t,\\ U^* \zeta _{t}+ M_{n}-M_{t}, &{}\quad \text { if }n>t,\end{array}\right. } \end{aligned}$$

whence

$$\begin{aligned} {{\mathbb {E}}}\left[ \max _{0\le n\le T}\Vert {\bar{\zeta }}^U_n\Vert ^p\right] = {{\mathbb {E}}}\left[ \max _{0\le n\le T}\Vert \zeta ^U_n\Vert ^p\right] . \end{aligned}$$
(108)

Clearly, we can choose an \({\mathcal {F}}_{t}\)-measurable random orthonormal transformation U on \(({{\mathbb {R}}}^d,\Vert \cdot \Vert )\) such that

$$\begin{aligned} U^* \zeta _t = \frac{\Vert \zeta _t\Vert }{\Vert \xi _{t+1}\Vert } \xi _{t+1} \end{aligned}$$

holds on \(\{\xi _{t+1}\ne 0\}\). Let

$$\begin{aligned} \alpha = \frac{\Vert \xi _{t+1}\Vert +\Vert \zeta _t\Vert }{2\Vert \zeta _t\Vert }\cdot 1_{\{\zeta _t\ne 0\}}. \end{aligned}$$

Then \(\alpha \) is \({\mathcal {F}}_{t}\)-measurable and takes values in [0, 1] since \(\Vert \xi _{t+1}\Vert \le \Vert \zeta _t\Vert \). Moreover, we have \(\xi _{t+1} = \alpha U^* \zeta _t + (1-\alpha ) (-U)^* \zeta _t\) so that by property (\(C_{t+1}\)) of the triple \((\zeta ,\xi ,M)\),

$$\begin{aligned} \zeta _n= \xi _{t+1} + M_n-M_t = \alpha {\bar{\zeta }}^U_n+(1-\alpha ) {\bar{\zeta }}_n^{-U} \end{aligned}$$

for \(n= t+1,\dots ,T\). Note that \(\zeta _n=\zeta ^U_n=\zeta _n^{-U}\) for \(n=0,\dots ,t\). By convexity of \(\Vert \cdot \Vert ^p\) we thus obtain

$$\begin{aligned} \max _{0\le n\le T}\Vert {\bar{\zeta }}^U_n\Vert ^p&= \max _{0\le n\le T}\Vert \alpha {\bar{\zeta }}^U_n+ (1-\alpha ) {\bar{\zeta }}^{-U}_n\Vert ^p \\&\le \alpha \max _{0\le n\le T}\Vert {\bar{\zeta }}^U_n\Vert ^p + (1-\alpha )\max _{0\le n\le T}\Vert {\bar{\zeta }}^{-U}_n\Vert ^p. \end{aligned}$$

Hence

$$\begin{aligned} {{\mathbb {E}}}\left[ \max _{0\le n\le T}\Vert \zeta _{n}\Vert ^p|{\mathcal {F}}_{t}\right]&\le \alpha {{\mathbb {E}}}\left[ \max _{0\le n\le T}\Vert {\bar{\zeta }}^U_n\Vert ^p|{\mathcal {F}}_{t}\right] + (1-\alpha ){{\mathbb {E}}}\left[ \max _{0\le n\le T}\Vert {\bar{\zeta }}^{-U}_n\Vert ^p|{\mathcal {F}}_{t}\right] \\&\le {{\mathbb {E}}}\left[ \max _{0\le n\le T}\Vert {\bar{\zeta }}^{U'}_n\Vert ^p|{\mathcal {F}}_{t}\right] , \end{aligned}$$

where \(U'\) is the \({\mathcal {F}}_{t}\)-measurable random orthonormal transformation given by

$$\begin{aligned} U'(\omega )= {\left\{ \begin{array}{ll} U(\omega )&{}\quad \text { if }\omega \in \bigl \{ {{\mathbb {E}}}[\max _{0\le n\le T}\Vert {\bar{\zeta }}^U_n\Vert ^p|{\mathcal {F}}_{t}]\ge {{\mathbb {E}}}[\max _{0\le n\le T}\Vert {\bar{\zeta }}^{- U}_n\Vert ^p|{\mathcal {F}}_{t}]\bigr \},\\ -U(\omega ) &{}\quad \text { otherwise}. \end{array}\right. } \end{aligned}$$

Applying (107) and (108) with \(U=U'\) finishes the induction step.

Next, we consider the case of \(c > 0\). Suppose that \(\zeta ,\xi \) and M are as stated in the theorem. For \(n\in {{\mathbb {N}}}\) we put

$$\begin{aligned} {\tilde{\xi }}_n = (1-c/\Vert \xi _n\Vert )_+ \cdot \xi _n \end{aligned}$$

and

$$\begin{aligned} {\tilde{\zeta }}_n = {\tilde{\xi }}_n + \Delta M_n. \end{aligned}$$

Furthermore, let \({\tilde{\zeta }}_0=\zeta _0=M_0\). We will show that the triple \(({\tilde{\zeta }},{\tilde{\xi }},M)\) satisfies (106) with \(c=0\). Clearly, \(({\tilde{\zeta }}_n)_{n\in {{\mathbb {N}}}_0}\) is adapted and \((\tilde{\xi }_n)_{n\in {{\mathbb {N}}}}\) is previsible. Moreover, one has for \(n\in {{\mathbb {N}}}\) on \(\{\Vert \xi _n\Vert \ge c\}\) that

$$\begin{aligned} \Vert {\tilde{\xi }}_n\Vert&= \Vert \xi _n\Vert -c\le \Vert \zeta _{n-1}\Vert -c=\Vert \tilde{\zeta }_{n-1}+\xi _{n-1}-{\tilde{\xi }}_{n-1}\Vert -c\\&\le \Vert {\tilde{\zeta }}_{n-1}\Vert +\Vert \xi _{n-1}-{\tilde{\xi }}_{n-1}\Vert -c = \Vert {\tilde{\zeta }}_{n-1}\Vert \end{aligned}$$

and on \(\{\Vert \xi _n\Vert < c\}\) that \(\Vert {\tilde{\xi }}_n\Vert =0\le \Vert \tilde{\zeta }_{n-1}\Vert \). We may thus apply Theorem 5.1 with \(c=0\) to obtain that for every \(n\in {{\mathbb {N}}}\),

$$\begin{aligned} {{\mathbb {E}}}\left[ \max _{0\le k \le n}\Vert {\tilde{\zeta }}_n\Vert ^p\right] \le \bar{\kappa }\, {{\mathbb {E}}}\bigl [ [ M]_n^{p/2}\bigr ]. \end{aligned}$$

Since for every \(n\in {{\mathbb {N}}}\),

$$\begin{aligned} \Vert \zeta _n\Vert ^p = \Vert {\tilde{\zeta }}_n + \xi _n-{\tilde{\xi }}_n\Vert ^p \le 2^p(\Vert {\tilde{\zeta }}_n\Vert ^p + c^p), \end{aligned}$$

we conclude that

$$\begin{aligned} {{\mathbb {E}}}\left[ \max _{0\le k \le n}\Vert \zeta _n\Vert ^p\right] \le 2^p\bigl (\bar{\kappa }\, {{\mathbb {E}}}\bigl [ [ M]_n^{p/2}\bigr ] + c^p\bigr ) \le 2^p({\bar{\kappa }} \vee 1) \cdot \bigl ( {{\mathbb {E}}}\bigl [ [ M]_n^{p/2}\bigr ] + c^p\bigr ), \end{aligned}$$

which completes the proof. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dereich, S., Müller-Gronbach, T. General multilevel adaptations for stochastic approximation algorithms of Robbins–Monro and Polyak–Ruppert type. Numer. Math. 142, 279–328 (2019). https://doi.org/10.1007/s00211-019-01024-y

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00211-019-01024-y

Mathematics Subject Classification

Navigation