Abstract
In this article we present and analyse new multilevel adaptations of classical stochastic approximation algorithms for the computation of a zero of a function \(f:D \rightarrow {{\mathbb {R}}}^d\) defined on a convex domain \(D\subset {{\mathbb {R}}}^d\), which is given as a parameterised family of expectations. The analysis of the error and the computational cost of our method is based on similar assumptions as used in Giles (Oper Res 56(3):607–617, 2008) for the computation of a single expectation. Additionally, we essentially only require that f satisfies a classical contraction property from stochastic approximation theory. Under these assumptions we establish error bounds in pth mean for our multilevel Robbins–Monro and Polyak–Ruppert schemes that decay in the computational time as fast as the classical error bounds for multilevel Monte Carlo approximations of single expectations known from Giles (Oper Res 56(3):607–617, 2008). Our approach is universal in the sense that having multilevel implementations for a particular application at hand it is straightforward to implement the corresponding stochastic approximation algorithm.
Similar content being viewed by others
References
Benveniste, A., Métivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations. Volume 22 of Applications of Mathematics (New York), vol. 22. Springer, Berlin (1990)
Duflo, M.: Algorithmes Stochastiques. Volume 23 of Mathématiques & Applications (Berlin) [Mathematics & Applications], vol. 23. Springer, Berlin (1996)
Frikha, N.: Multi-level stochastic approximation algorithms. Ann. Appl. Probab. 26, 933–985 (2016)
Gaposhkin, V.F., Krasulina, T.P.: On the law of the iterated logarithm in stochastic approximation processes. Theory Probab. Appl. 19(4), 844–850 (1974)
Giles, M.B.: Multilevel Monte Carlo path simulation. Oper. Res. 56(3), 607–617 (2008)
Heinrich, S.: Multilevel Monte Carlo methods. In: Margenov, S., Waśniewski, J., Yalamov, P. (eds.) Large-Scale Scientific Computing, pp. 58–67. Springer, Berlin (2001)
Kushner, H.J., Yang, J.: Stochastic approximation with averaging of the iterates: optimal asymptotic rate of convergence for general processes. SIAM J. Control Optim. 31(4), 1045–1062 (1993)
Kushner, H.J., Yin, G.G.: Stochastic Approximation and Recursive Algorithms and Applications, Volume 35 of Applications of Mathematics (New York). Stochastic Modelling and Applied Probability, 2nd edn. Springer, New York (2003)
Lai, T.L.: Stochastic approximation. Ann. Stat. 31(2), 391–406 (2003). Dedicated to the memory of Herbert E. Robbins
Lai, T.L., Robbins, H.: Limit theorems for weighted sums and stochastic approximation processes. Proc. Nat. Acad. Sci. U.S.A. 75, 1068–1070 (1978)
Le Breton, A., Novikov, A.: Some results about averaging in stochastic approximation. Metrika 42(3–4):153–171 (1995). Second International Conference on Mathematical Statistics (Smolenice Castle, 1994)
Ljung, L., Pflug, G., Walk, H.: Stochastic Approximation and Optimization of Random Systems. Volume 17 of DMV Seminar, vol. 17. Birkhäuser Verlag, Basel (1992)
Nualart, D.: The Malliavin Calculus and Related Topics. Probability and Its Applications (New York), 2nd edn. Springer, Berlin (2006)
Pelletier, M.: On the almost sure asymptotic behaviour of stochastic algorithms. Stoch. Process. Appl. 78(2), 217–244 (1998)
Pelletier, M.: Weak convergence rates for stochastic approximation with application to multiple targets and simulated annealing. Ann. Appl. Probab. 8(1), 10–44 (1998)
Polyak, B.T.: A new method of stochastic approximation type. Avtomat. i Telemekh. 51(7), 937–1008 (1998)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Ruppert, D.: Almost sure approximations to the Robbins-Monro and Kiefer-Wolfowitz processes with dependent noise. Ann. Probab. 10, 178–187 (1982)
Ruppert, D.: Stochastic Approximation. In: Ghosh, B.K., Sen, P.K. (eds.) Handbook of Sequential Analysis. Volume 118 of Statist. Textbooks Monogr., pp. 503–529. Dekker, New York (1991)
Acknowledgements
We thank two anonymous referees for their valuable comments, which improved the presentation of the material.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Let \((\Omega ,{\mathcal {F}},{P})\) be a probability space endowed with a filtration \(({\mathcal {F}}_n)_{n\in {{\mathbb {N}}}_0}\) and let \(\Vert \cdot \Vert \) denote a Hilbert space norm on \({{\mathbb {R}}}^d\).
In this section we provide pth mean estimates for an adapted d-dimensional dynamical system \((\zeta _n)_{n\in {{\mathbb {N}}}_0}\) with the property that for each \(n\in {{\mathbb {N}}}\), \(\zeta _n\) is a zero-mean perturbation of a previsible proposal \(\xi _n\) being comparable in size to \(\zeta _{n-1}\). More formally, we assume that there exist a previsible d-dimensional process \((\xi _n)_{n\in {{\mathbb {N}}}}\), a d-dimensional martingale \((M_n)_{n\in {{\mathbb {N}}}_0}\) with \(M_0=\zeta _0\) and a constant \(c\ge 0\) such that for all \(n\in {{\mathbb {N}}}\)
where \(\Delta M_n = M_n-M_{n-1}\). Note that necessarily \(\xi _n={{\mathbb {E}}}[\zeta _n|{\mathcal {F}}_{n-1}]\).
Theorem 5.1
Assume that \((\zeta _n)_{n\in {{\mathbb {N}}}_0}\) is an adapted d-dimensional process, which satisfies (106), and let \(p\in [1,\infty )\). Then there exists a constant \(\kappa \in (0,\infty )\), which only depends on p, such that for every \(n\in {{\mathbb {N}}}_0\),
where
Proof
Fix \(p\in [1,\infty )\).
We first consider the case where \(c=0\). Recall that by the Burkholder-Davis-Gundy inequality there exists a constant \(\bar{\kappa }>0\) depending only on p such that for every d-dimensional martingale \((M_n)_{n\in {{\mathbb {N}}}_0}\),
We fix a time horizon \(T\in {{\mathbb {N}}}_0\) and prove the statement of the theorem with \(\kappa = {\bar{\kappa }}\) by induction: we say that the statement holds up to time \(t\in \{0,\dots ,T\}\), if for every d-dimensional adapted process \((\zeta _n)_{n\in {{\mathbb {N}}}_0}\), for every d-dimensional previsible process \((\xi _n)_{n\in {{\mathbb {N}}}}\) and for every d-dimensional martingale \((M_n)_{n\in {{\mathbb {N}}}_0}\) with
one has
Clearly, the statement is satisfied up to time 0 as a consequence of the Burkholder–Davis–Gundy inequality. Next, suppose that the statement is satisfied up to time \(t\in \{0,\dots ,T-1\}\). Let \((\zeta _n)_{n\in {{\mathbb {N}}}_0}\) be a d-dimensional adapted process, \((\xi _n)_{n\in {{\mathbb {N}}}}\) be a d-dimensional previsible process and \((M_n)_{n\in {{\mathbb {N}}}_0}\) be a d-dimensional martingale satisfying property (\(C_{t+1}\)). Consider any \({\mathcal {F}}_{t}\)-measurable random orthonormal transformation U on \(({{\mathbb {R}}}^d,\Vert \cdot \Vert )\) and put
as well as
Then it is easy to check that \((M^U_n)_{n\in {{\mathbb {N}}}_0}\) is a martingale with \([M^U]_n = [M]_n\) for all \(n\in {{\mathbb {N}}}\). Furthermore, \((\zeta ^U_n)_{n\in {{\mathbb {N}}}_0}\) is adapted and the triple \((\zeta ^U,\xi , M^U)\) satisfies property (\(C_t\)). Hence, by the induction hypothesis,
Note that for any such random orthonormal transformation U, the norm of the random variable \(\zeta _n^U\) is the same as the norm of the variable \({\bar{\zeta }}_n^U\) given by
whence
Clearly, we can choose an \({\mathcal {F}}_{t}\)-measurable random orthonormal transformation U on \(({{\mathbb {R}}}^d,\Vert \cdot \Vert )\) such that
holds on \(\{\xi _{t+1}\ne 0\}\). Let
Then \(\alpha \) is \({\mathcal {F}}_{t}\)-measurable and takes values in [0, 1] since \(\Vert \xi _{t+1}\Vert \le \Vert \zeta _t\Vert \). Moreover, we have \(\xi _{t+1} = \alpha U^* \zeta _t + (1-\alpha ) (-U)^* \zeta _t\) so that by property (\(C_{t+1}\)) of the triple \((\zeta ,\xi ,M)\),
for \(n= t+1,\dots ,T\). Note that \(\zeta _n=\zeta ^U_n=\zeta _n^{-U}\) for \(n=0,\dots ,t\). By convexity of \(\Vert \cdot \Vert ^p\) we thus obtain
Hence
where \(U'\) is the \({\mathcal {F}}_{t}\)-measurable random orthonormal transformation given by
Applying (107) and (108) with \(U=U'\) finishes the induction step.
Next, we consider the case of \(c > 0\). Suppose that \(\zeta ,\xi \) and M are as stated in the theorem. For \(n\in {{\mathbb {N}}}\) we put
and
Furthermore, let \({\tilde{\zeta }}_0=\zeta _0=M_0\). We will show that the triple \(({\tilde{\zeta }},{\tilde{\xi }},M)\) satisfies (106) with \(c=0\). Clearly, \(({\tilde{\zeta }}_n)_{n\in {{\mathbb {N}}}_0}\) is adapted and \((\tilde{\xi }_n)_{n\in {{\mathbb {N}}}}\) is previsible. Moreover, one has for \(n\in {{\mathbb {N}}}\) on \(\{\Vert \xi _n\Vert \ge c\}\) that
and on \(\{\Vert \xi _n\Vert < c\}\) that \(\Vert {\tilde{\xi }}_n\Vert =0\le \Vert \tilde{\zeta }_{n-1}\Vert \). We may thus apply Theorem 5.1 with \(c=0\) to obtain that for every \(n\in {{\mathbb {N}}}\),
Since for every \(n\in {{\mathbb {N}}}\),
we conclude that
which completes the proof. \(\square \)
Rights and permissions
About this article
Cite this article
Dereich, S., Müller-Gronbach, T. General multilevel adaptations for stochastic approximation algorithms of Robbins–Monro and Polyak–Ruppert type. Numer. Math. 142, 279–328 (2019). https://doi.org/10.1007/s00211-019-01024-y
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00211-019-01024-y