Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER

Holzleitner, Markus; Gruber, Lukas; Arjona-Medina, José; Brandstetter, Johannes; Hochreiter, Sepp

doi:10.1007/978-3-662-63519-3_5

Markus Holzleitner¹⁰,
Lukas Gruber¹⁰,
José Arjona-Medina¹⁰,
Johannes Brandstetter¹⁰ &
…
Sepp Hochreiter^10,11

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 12670))

1156 Accesses

Abstract

We prove under commonly used assumptions the convergence of actor-critic reinforcement learning algorithms, which simultaneously learn a policy function, the actor, and a value function, the critic. Both functions can be deep neural networks of arbitrary complexity. Our framework allows showing convergence of the well known Proximal Policy Optimization (PPO) and of the recently introduced RUDDER. For the convergence proof we employ recently introduced techniques from the two time-scale stochastic approximation theory.

Previous convergence proofs assume linear function approximation, cannot treat episodic examples, or do not consider that policies become greedy. The latter is relevant since optimal policies are typically deterministic. Our results are valid for actor-critic methods that use episodic samples and that have a policy that becomes more greedy during learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

On the Analysis of Model-Free Methods for the Linear Quadratic Regulator

Article 09 July 2024

On the sample complexity of actor-critic method for reinforcement learning with function approximation

Article 16 February 2023

Policy Gradient

References

Absil, P.A., Kurdyka, K.: On the stable equilibrium points of gradient systems. Syst. Control Lett. 55(7), 573–577 (2006)
Article MathSciNet Google Scholar
Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: RUDDER: Return decomposition for delayed rewards (2018). ArXiv https://arxiv.org/abs/1806.07857
Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: RUDDER: return decomposition for delayed rewards. In: Advances in Neural Information Processing Systems, vol. 33 (2019). ArXiv https://arxiv.org/abs/1806.07857
Bakker, B.: Reinforcement learning by backpropagation through an LSTM model/critic. In: IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp. 127–134 (2007). https://doi.org/10.1109/ADPRL.2007.368179
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)
MATH Google Scholar
Bhatnagar, S., Prasad, H.L., Prashanth, L.A.: Stochastic Recursive Algorithms for Optimization. Lecture Notes in Control and Information Sciences, 1st edn., p. 302. Springer, London (2013). https://doi.org/10.1007/978-1-4471-4285-0
Book MATH Google Scholar
Stochastic Approximation. TRM, vol. 48. Hindustan Book Agency, Gurgaon (2008). https://doi.org/10.1007/978-93-86279-38-5
Borkar, V.S., Meyn, S.P.: The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2), 447–469 (2000). https://doi.org/10.1137/S0363012997331639
Casella, G., Berger, R.L.: Statistical Inference. Wadsworth and Brooks/Cole, Stanley (2002)
Google Scholar
Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B., LeCun, Y.: The loss surfaces of multilayer networks. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pp. 192–204 (2015)
Google Scholar
Dayan, P.: The convergence of TD($\lambda $) for general $\lambda $. Mach. Learn. 8, 341 (1992)
MATH Google Scholar
Fan, J., Wang, Z., Xie, Y., Yang, Z.: A theoretical analysis of deep $q$-learning. CoRR abs/1901.00137 (2020)
Google Scholar
Hairer, M.: Ergodic properties of Markov processes. In: Lecture Notes (2018)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a Nash equilibrium. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. pp. 6626–6637. Curran Associates, Inc. (2017). Preprint arXiv:1706.08500
Jin, C., Netrapalli, P., Jordan, M.I.: Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv:1902.00618 (2019)
Karmakar, P., Bhatnagar, S.: Two time-scale stochastic approximation with controlled Markov noise and off-policy temporal-difference learning. Math. Oper. Res. (2017). https://doi.org/10.1287/moor.2017.0855
Article MATH Google Scholar
Kawaguchi, K.: Deep learning without poor local minima. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29. pp. 586–594 (2016)
Google Scholar
Kawaguchi, K., Bengio, Y.: Depth with nonlinearity creates no bad local minima in ResNets. Neural Netw. 118, 167–174 (2019)
Article Google Scholar
Kawaguchi, K., Huang, J., Kaelbling, L.P.: Effect of depth and width on local minima in deep learning. Neural Comput. 31(6), 1462–1498 (2019)
Article MathSciNet Google Scholar
Kawaguchi, K., Kaelbling, L.P., Bengio, Y.: Generalization in deep learning. arXiv:1710.05468 (2017)
Konda, V.R., Borkar, V.S.: Actor-critic-type learning algorithms for Markov decision processes. SIAM J. Control Optim. 38(1), 94–123 (1999). https://doi.org/10.1137/S036301299731669X
Article MathSciNet MATH Google Scholar
Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008–1014 (2000)
Google Scholar
Konda, V.R., Tsitsiklis, J.N.: On actor-critic algorithms. SIAM J. Control Optim. 42(4), 1143–1166 (2003). https://doi.org/10.1137/S0363012901385691
Article MathSciNet MATH Google Scholar
Kushner, H.J., Clark, D.S.: Stochastic Approximation Methods for Constrained and Unconstrained Systems. Applied Mathematical Sciences. Springer, New York (1978). https://doi.org/10.1007/978-1-4684-9352-8
Kushner, H.J., Yin, G.G.: Stochastic Approximation and Recursive Algorithms and Applications. Stochastic Modelling and Applied Probability. Springer, New York (2003). https://doi.org/10.1007/b97441
Lin, T., Jin, C., Jordan, M.I.: On gradient descent ascent for nonconvex-concave minimax problems. arXiv:1906.00331 (2019)
Liu, B., Cai, Q., Yang, Z., Wang, Z.: Neural proximal/trust region policy optimization attains globally optimal policy. In: Advances in Neural Information Processing Systems, vol. 33. arXiv:1906.10306 (2019)
Maei, H.R., Szepesvári, C., Bhatnagar, S., Precup, D., Silver, D., Sutton, R.S.: Convergent temporal-difference learning with arbitrary smooth function approximation. In: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 22. pp. 1204–1212. Curran Associates, Inc. (2009)
Google Scholar
Mazumdar, E.V., Jordan, M.I., Sastry, S.S.: On finding local Nash equilibria (and only local Nash equilibria) in zero-sum games. arXiv:1901.00838 (2019)
Metrikopoulos, P., Hallak, N., Kavis, A., Cevher, V.: On the almost sure convergence of stochastic gradient descent in non-convex problems. In: Advances in Neural Information Processing Systems, vol. 34 (2020). arXiv:2006.11144
Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv:1312.5602 (2013)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015). https://doi.org/10.1038/nature14236
Article Google Scholar
Munro, P.W.: A dual back-propagation scheme for scalar reinforcement learning. In: Proceedings of the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, pp. 165–176 (1987)
Google Scholar
Open, A.I., et al.: Dota 2 with large scale deep reinforcement learning. arXiv:1912.06680 (2019)
Patil, V.P., et al.: Align-RUDDER: learning from few demonstrations by reward redistribution. arXiv:2009.14108 (2020)
Puterman, M.L.: Markov Decision Processes, 2nd edn. Wiley, Hoboken (2005)
MATH Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951). https://doi.org/10.1214/aoms/1177729586
Article MathSciNet MATH Google Scholar
Robinson, A.J.: Dynamic error propagation networks. Ph.D. thesis, Trinity Hall and Cambridge University Engineering Department (1989)
Google Scholar
Robinson, T., Fallside, F.: Dynamic reinforcement driven error propagation networks with application to game playing. In: Proceedings of the 11th Conference of the Cognitive Science Society, Ann Arbor, pp. 836–843 (1989)
Google Scholar
Schulman, J., Levine, S., Moritz, P., Jordan, M.I., Abbeel, P.: Trust region policy optimization. arXiv:1502.05477 (2015). 31st International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 37
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:1707.06347 (2018)
Singh, S., Jaakkola, T., Littman, M., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38, 287–308 (2000). https://doi.org/10.1023/A:1007678930559
Article MATH Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge (2018)
MATH Google Scholar
Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057–1063 (2000)
Google Scholar
Tsitsiklis, J.N.: Asynchronous stochastic approximation and $q$-learning. Mach. Learn. 16(3), 185–202 (1994). https://doi.org/10.1023/A:1022689125041
Article MATH Google Scholar
Vinyals, O., et al.: Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019). https://doi.org/10.1038/s41586-019-1724-z
Article Google Scholar
Watkins, C.J.C.H., Dayan, P.: Q-learning. Mach. Learn. 8, 279–292 (1992)
MATH Google Scholar
Xu, T., Zou, S., Liang, Y.: Two time-scale off-policy TD learning: non-asymptotic analysis over Markovian samples. Adv. Neural Inf. Process. Syst. 32, 10633–10643 (2019)
Google Scholar
Yang, Z., Chen, Y., Hong, M., Wang, Z.: Provably global convergence of actor-critic: a case for linear quadratic regulator with ergodic cost. Adv. Neural Inf. Process. Syst. 32, 8351–8363 (2019)
Google Scholar

Download references

Acknowledgments

The ELLIS Unit Linz, the LIT AI Lab, the Institute for Machine Learning, are supported by the Federal State Upper Austria. IARAI is supported by Here Technologies. We thank the projects AI-MOTION (LIT-2018-6-YOU-212), DeepToxGen (LIT-2017-3-YOU-003), AI-SNN (LIT-2018-6-YOU-214), DeepFlood (LIT-2019-8-YOU-213), Medical Cognitive Computing Center (MC3), PRIMAL (FFG873979), S3AI (FFG-872172), DL for granular flow (FFG-871302), ELISE (H2020-ICT-2019-3 ID: 951847), AIDD (MSCA-ITN-2020 ID: 956832). We thank Janssen Pharmaceutica, UCB Biopharma SRL, Merck Healthcare KGaA, Audi.JKU Deep Learning Center, TGW LOGISTICS GROUP GMBH, Silicon Austria Labs (SAL), FILL Gesellschaft mbH, Anyline GmbH, Google Brain, ZF Friedrichshafen AG, Robert Bosch GmbH, Software Competence Center Hagenberg GmbH, TÜV Austria, and the NVIDIA Corporation.

Author information

Authors and Affiliations

ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Linz, Austria
Markus Holzleitner, Lukas Gruber, José Arjona-Medina, Johannes Brandstetter & Sepp Hochreiter
Institute of Advanced Research in Artificial Intelligence (IARAI), Vienna, Austria
Sepp Hochreiter

Authors

Markus Holzleitner
View author publications
You can also search for this author in PubMed Google Scholar
Lukas Gruber
View author publications
You can also search for this author in PubMed Google Scholar
José Arjona-Medina
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Brandstetter
View author publications
You can also search for this author in PubMed Google Scholar
Sepp Hochreiter
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
IFS, Vienna University of Technology, Vienna, Austria
A Min Tjoa

A Appendix

This appendix is meant to provide the reader with details and more precise descriptions of several parts of the main text, including e.g. exact formulations of the algorithms and more technical proof steps. Sections A.1 and A.2 provide the full formulation of the PPO and RUDDER algorithm, respectively, for which we ensure convergence. Section A.3 describes how the causality assumption leads to the formulas for PPO. In Sect. A.4 we discuss the precise formulations of the assumptions from [16]. Section A.5 gives further details about the probabilistic setup that we use to formalize the sampling process while Sect. A.6 gives formal details on how to ensure the assumptions from [16] to obtain our main convergence result Theorem 1. The last Sect. A.7 discusses arguments how to deduce the optimal policy from the approximate ones.

1.1 A.1 Further Details on PPO

Here we describe the minimization problem for the PPO setup in a more detailed way by including the exact expression for the gradients of the respective loss functions:

$$\begin{aligned}&\mathrm {L}_h(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n) = \ \mathbf {\mathrm {E}}_{\tau \sim \pi (\boldsymbol{\theta }_n,\boldsymbol{z}_n) } \left[ - \ G_0 \ + \ (z_2)_n \ \rho (\tau ,\boldsymbol{\theta }_n,\boldsymbol{z}_n) \right] , \end{aligned}$$

(15)

$$\begin{aligned}&h(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n)= \nonumber \\&\mathbf {\mathrm {E}}_{\tau \sim \pi (\boldsymbol{\theta }_n,\boldsymbol{z}_n)} \left[ -\sum _{t=0}^T \nabla _{\boldsymbol{\theta }} \log \pi (a_t \mid s_t ; \boldsymbol{\theta }_n,\boldsymbol{z}_n) \ ( \hat{q}^{\pi }(s_{t},a_{t};\boldsymbol{\omega }_n) - \hat{v}^\pi (s_t;\boldsymbol{\omega }_n) ) \right. \nonumber \\&+ (z_2)_n \ \sum _{t=0}^T \nabla _{\boldsymbol{\theta }_n} \log \pi (a_t \mid s_t ; \boldsymbol{\theta }_n,\boldsymbol{z}_n) \ \rho (\tau ,\boldsymbol{\theta }_n,\boldsymbol{z}_n) + \ (z_2)_n \nabla _{\boldsymbol{\theta }_n} \rho (\tau ,\boldsymbol{\theta }_n,\boldsymbol{z}_n) \Bigg ] ,\end{aligned}$$

(16)

$$\begin{aligned}&\mathrm {L}^\mathrm {TD}_g(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n) \ = \mathbf {\mathrm {E}}_{\tau \sim \pi (\boldsymbol{\theta }_n,\boldsymbol{z}_n)} \left[ \frac{1}{2} \ \sum _{t=0}^{T} \big ( \delta ^{\mathrm {TD}}(t) \big )^2 \right] , \end{aligned}$$

(17)

$$\begin{aligned}&f^\mathrm {TD}(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n) = \mathbf {\mathrm {E}}_{\tau \sim \pi (\boldsymbol{\theta }_n,\boldsymbol{z}_n)} \left[ - \sum _{t=0}^{T} \delta ^{\mathrm {TD}}(t) \ \nabla _{\boldsymbol{\omega }_n} \hat{q}^{\pi }(s_t,a_t;\boldsymbol{\omega }_n) \right] , \end{aligned}$$

(18)

$$\begin{aligned}&\mathrm {L}^\mathrm {MC}_g(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n) \ = \mathbf {\mathrm {E}}_{\tau \sim \pi (\boldsymbol{\theta }_n,\boldsymbol{z}_n)} \left[ \frac{1}{2} \ \sum _{t=0}^{T} \bigg ( G_t \ - \ \hat{q}^{\pi }(s_t,a_t;\boldsymbol{\omega }_n) \bigg )^2 \right] , \end{aligned}$$

(19)

$$\begin{aligned}&f^\mathrm {MC}(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n)= \ \nonumber \\&\ \mathbf {\mathrm {E}}_{\tau \sim \pi (\boldsymbol{\theta }_n,\boldsymbol{z}_n)} \left[ -\sum _{t=0}^{T} \bigg ( G_t \ - \ \hat{q}^{\pi }(s_t,a_t;\boldsymbol{\omega }_n) \bigg ) \ \nabla _{\boldsymbol{\omega }_n} \hat{q}^{\pi }(s_t,a_t;\boldsymbol{\omega }_n) \right] , \end{aligned}$$

(20)

$$\begin{aligned}&\boldsymbol{\theta }_{n+1} \ = \ \boldsymbol{\theta }_n \ - \ a(n) \ \hat{h} (\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n), \boldsymbol{\omega }_{n+1} \ = \ \boldsymbol{\omega }_n \ - \ b(n) \ \hat{f}(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n). \end{aligned}$$

(21)

1.2 A.2 Further details on RUDDER

In a similar vein we present the minimization problem of RUDDER in more detail:

$$\begin{aligned}&\mathrm {L}_h(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n,\boldsymbol{z}_n)=\nonumber \\&\mathbf {\mathrm {E}}_{\tau \sim \breve{\pi }} \left[ \frac{1}{2} \ \sum _{t=0}^{T} \bigg ( R_{t+1}(\tau ; \boldsymbol{\omega }_n) - \hat{q}(s_t, a_t; \boldsymbol{\theta }_n)\bigg )^2 \ + \ (z_2)_n \ \rho _{\boldsymbol{\theta }}(\tau ,\boldsymbol{\theta }_n,\boldsymbol{z}_n) \right] \end{aligned}$$

(22)

$$\begin{aligned}&h(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n,\boldsymbol{z}_n)= \nonumber \\&\mathbf {\mathrm {E}}_{\tau \sim \breve{\pi }} \left[ -\sum _{t=0}^{T} \bigg ( R_{t+1}(\tau ; \boldsymbol{\omega }_n) - \hat{q}(s_t, a_t; \boldsymbol{\theta }_n)\bigg ) \ \nabla _{\boldsymbol{\theta }} \hat{q}(s_t, a_t; \boldsymbol{\theta }_n) \right. \Bigg . + (z_2)_n \ \nabla _{\boldsymbol{\theta }} \rho _{\boldsymbol{\theta }}(\tau ,\boldsymbol{\theta }_n,\boldsymbol{z}_n) \ \Bigg ] \end{aligned}$$

(23)

$$\begin{aligned}&\mathrm {L}_g(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n)= \nonumber \\&\mathbf {\mathrm {E}}_{\tau \sim \pi (\boldsymbol{\theta }_n,\boldsymbol{z}_n)} \left[ \frac{1}{2} \ \bigg ( \sum _{t=0}^{T} \tilde{R}_{t+1} \ - \ g( \tau ; \boldsymbol{\omega }_n ) \bigg )^2 \ + \ (z_2)_n \ \rho _{\boldsymbol{\omega }}(\tau ,\boldsymbol{\theta }_n,\boldsymbol{z}_n) \right] \end{aligned}$$

(24)

$$\begin{aligned}&f(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n) \ = \ \mathbf {\mathrm {E}}_{\tau \sim \pi (\boldsymbol{\theta }_n,\boldsymbol{z}_n)} \left[ -\bigg ( \sum _{t=0}^{T} \tilde{R}_{t+1} \ - \ g( \tau ; \boldsymbol{\omega }_n ) \bigg ) \ \nabla _{\boldsymbol{\omega }} g( \tau ; \boldsymbol{\omega }_n ) \ \right. \nonumber \\&\Bigg . +(z_2)_n \ \nabla _{\boldsymbol{\omega }} \rho _{\boldsymbol{\omega }}(\tau ,\boldsymbol{\theta }_n,\boldsymbol{z}_n) \Bigg ], \end{aligned}$$

(25)

$$\begin{aligned}&\boldsymbol{\theta }_{n+1} \ = \ \boldsymbol{\theta }_n \ - \ a(n) \ \hat{h} (\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n), \boldsymbol{\omega }_{n+1} \ = \ \boldsymbol{\omega }_n \ - \ b(n) \ \hat{f}(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n). \end{aligned}$$

(26)

1.3 A.3 Causality and Reward-To-Go

This section is meant to provide the reader with more details concerning the causality assumption that leads to the formula for h in Eq. (15) for PPO. We can derive a formulation of the policy gradient with reward-to-go. For ease of notation, instead of using $\tilde{P_{\pi }}(\tau )$ as in previous sections, we here denote the probability of state-action sequence $\tau =\tau _{0,T}=(s_0,a_0,s_1,a_1,\ldots ,s_T,a_T)$ with policy $\pi $ as

$$\begin{aligned}&p(\tau ) \ = \ p(s_0) \ \pi (a_0 \mid s_0) \ \prod _{t=1}^{T} p(s_t \mid s_{t-1},a_{t-1}) \ \pi (a_t \mid s_t) \nonumber \\&= \ p(s_0) \ \prod _{t=1}^{T} p(s_t \mid s_{t-1},a_{t-1}) \ \prod _{t=0}^{T} \pi (a_t \mid s_t). \end{aligned}$$

(27)

The probability of state-action sequence $\tau _{0,t}=(s_0,s_0,s_1,a_1,\ldots ,s_t,a_t)$ with policy $\pi $ is

$$\begin{aligned}&p(\tau _{0,t}) \ = \ p(s_0) \ \pi (a_0 \mid s_0) \ \prod _{k=1}^{t} p(s_k \mid s_{k-1},a_{k-1}) \ \pi (a_k \mid s_k) \nonumber \\&= \ p(s_0) \ \prod _{k=1}^{t} p(s_k \mid s_{k-1},a_{k-1}) \ \prod _{k=0}^{t} \pi (a_k \mid s_k). \end{aligned}$$

(28)

The probability of state-action sequence $\tau _{t+1,T}=(s_{t+1},a_{t+1},\ldots ,s_T,a_T)$ with policy $\pi $ given $( s_t,a_t)$ is

$$\begin{aligned}&p(\tau _{t+1,T} \mid s_t,a_t) \ = \ \prod _{k=t+1}^{T} p(s_k \mid s_{k-1},a_{k-1}) \ \pi (a_k \mid s_k) \nonumber \\&= \ \prod _{k=t+1}^{T} p(s_k \mid s_{k-1},a_{k-1}) \ \prod _{k=t+1}^{T} \pi (a_k \mid s_k). \end{aligned}$$

(29)

The expectation of $\sum _{t=0}^{T} R_{t+1}$ is

$$\begin{aligned}&\mathbf {\mathrm {E}}_{\pi } \left[ \sum _{t=0}^{T} R_{t+1} \right] \ = \ \sum _{t=0}^{T} \mathbf {\mathrm {E}}_{\pi } \left[ R_{t+1} \right] . \end{aligned}$$

(30)

With $R_{t+1} \sim p(r_{t+1} \mid s_t,a_t)$, the random variable $R_{t+1}$ depends only on $(s_t,a_t)$. We define the expected reward $\mathbf {\mathrm {E}}_{r_{t+1}} \left[ R_{t+1} \mid s_t,a_t\right] $ as a function $r(s_t,a_t)$ of $(s_t,a_t)$:

$$\begin{aligned} r(s_t,a_t)&:= \ \mathbf {\mathrm {E}}_{r_{t+1}} \left[ R_{t+1} \mid s_t,a_t\right] \ = \ \sum _{r_{t+1}} p(r_{t+1} \mid s_t,a_t) \ r_{t+1}. \end{aligned}$$

(31)

Causality. We assume that the reward $R_{t+1}=R(s_t,a_t) \sim p(r_{t+1} \mid s_t,a_t)$ only depends on the past but not on the future. The state-action pair $(s_t,a_t)$ is determined by the past and not by the future. Relevant is only how likely we observe $(s_t,a_t)$ and not what we do afterwards.

Causality is derived from the Markov property of the MDP and means:

$$\begin{aligned}&\mathbf {\mathrm {E}}_{\tau \sim \pi } \left[ R_{t+1} \right] \ = \ \mathbf {\mathrm {E}}_{\tau _{0,t} \sim \pi } \left[ R_{t+1} \right] . \end{aligned}$$

(32)

That is

$$\begin{aligned}&\mathbf {\mathrm {E}}_{\tau \sim \pi } \left[ R_{t+1} \right] \ = \ \sum _{s_1} \sum _{a_1} \sum _{s_2} \sum _{a_2} \ \ldots \ \sum _{s_T} \sum _{a_T} p(\tau ) \ r(s_t,a_t) \nonumber \\&= \ \sum _{s_1} \sum _{a_1} \sum _{s_2} \sum _{a_2} \ \ldots \ \sum _{s_T} \sum _{a_T} \ \prod _{l=1}^{T} p(s_l \mid s_{l-1},a_{l-1}) \ \prod _{l=1}^{T} \pi (a_l \mid s_l) \ r(s_t,a_t)\nonumber \\&= \ \sum _{s_1} \sum _{a_1} \sum _{s_2} \sum _{a_2} \ \ldots \ \sum _{s_t} \sum _{a_t} \ \prod _{l=1}^{t} p(s_{l} \mid s_{l-1},a_{l-1}) \ \prod _{l=1}^{t} \pi (a_{l} \mid s_{l}) \ r(s_t,a_t)\nonumber \\&~~~\sum _{s_{t+1}} \sum _{a_{t+1}} \sum _{s_{t+2}} \sum _{a_{t+2}} \ \ldots \ \sum _{s_T} \sum _{a_T} \ \prod _{l=t+1}^{T} p(s_{l} \mid s_{l-1},a_{l-1}) \ \prod _{l=t+1}^{T} \pi (a_{l} \mid s_{l})\nonumber \\&= \ \sum _{s_1} \sum _{a_1} \sum _{s_2} \sum _{a_2} \ \ldots \ \sum _{s_t} \sum _{a_t} \ \prod _{l=1}^{t} p(s_{l} \mid s_{l-1},a_{l-1}) \ \prod _{l=1}^{t} \pi (a_{l} \mid s_{l}) \ r(s_t,a_t) \nonumber \\&= \ \mathbf {\mathrm {E}}_{\tau _{0,t} \sim \pi } \left[ R_{t+1} \right] . \end{aligned}$$

(33)

Policy Gradient Theorem. We now assume that the policy $\pi $ is parametrized by $\boldsymbol{\theta }$, that is, $\pi (a_t \mid s_t) = \pi (a_t \mid s_t ; \boldsymbol{\theta })$. We need the gradient with respect to $\boldsymbol{\theta }$ of $\prod _{t=a}^{b} \pi (a_t \mid s_t)$:

$$\begin{aligned}&\nabla _{\theta } \prod _{t=a}^{b} \pi (a_t \mid s_t ; \boldsymbol{\theta }) \ = \ \sum _{s=a}^{b} \prod _{t=a,t \not = s}^{b} \pi (a_t \mid s_t ; \boldsymbol{\theta }) \ \nabla _{\theta } \pi (a_s \mid s_s ; \boldsymbol{\theta }) \nonumber \\&= \ \prod _{t=a}^{b} \pi (a_t \mid s_t ; \boldsymbol{\theta }) \ \sum _{s=a}^{b} \frac{ \nabla _{\theta } \pi (a_s \mid s_s ; \boldsymbol{\theta })}{\pi (a_s \mid s_s ; \boldsymbol{\theta })}\nonumber \\&= \ \prod _{t=a}^{b} \pi (a_t \mid s_t ; \boldsymbol{\theta }) \ \sum _{s=a}^{b} \nabla _{\theta } \log \pi (a_s \mid s_s ; \boldsymbol{\theta }). \end{aligned}$$

(34)

It follows that

$$\begin{aligned}&\nabla _{\theta } \mathbf {\mathrm {E}}_{\pi } \left[ R_{t+1} \right] \ = \ \mathbf {\mathrm {E}}_{\pi } \left[ \sum _{s=1}^{t} \nabla _{\theta } \log \pi (a_s \mid s_s ; \boldsymbol{\theta }) \ R_{t+1} \right] . \end{aligned}$$

(35)

We only have to consider the reward to go. Since $a_0$ does not depend on $\pi $, we have $\nabla _{\theta } \mathbf {\mathrm {E}}_{\pi } \left[ R_1 \right] =0$. Therefore

$$\begin{aligned}&\nabla _{\theta } \mathbf {\mathrm {E}}_{\pi } \left[ \sum _{t=0}^{T} R_{t+1} \right] \ = \ \sum _{t=0}^{T} \nabla _{\theta } \mathbf {\mathrm {E}}_{\pi } \left[ R_{t+1} \right] \nonumber \\&= \ \mathbf {\mathrm {E}}_{\pi } \left[ \sum _{t=1}^{T} \sum _{k=1}^{t} \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ R_{t+1} \right] \nonumber \\&= \ \mathbf {\mathrm {E}}_{\pi } \left[ \sum _{k=1}^{T} \sum _{t=k}^{T} \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ R_{t+1} \right] \nonumber \\&= \ \mathbf {\mathrm {E}}_{\pi } \left[ \sum _{k=1}^{T} \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ \sum _{t=k}^{T} R_{t+1} \right] \nonumber \\&= \ \mathbf {\mathrm {E}}_{\pi } \left[ \sum _{k=1}^{T} \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ G_k \right] . \end{aligned}$$

(36)

We can express this by Q-values.

$$\begin{aligned}&\mathbf {\mathrm {E}}_{\pi } \left[ \sum _{k=1}^{T} \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ G_k \right] \nonumber \\&= \ \sum _{k=1}^{T} \mathbf {\mathrm {E}}_{\pi } \left[ \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ G_k \right] \nonumber \\&= \ \sum _{k=1}^{T} \mathbf {\mathrm {E}}_{\tau _{0,k} \sim \pi } \left[ \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ \mathbf {\mathrm {E}}_{\tau _{k+1,T} \sim \pi } \left[ G_k \mid s_k,a_k \right] \right] \nonumber \\&= \ \sum _{k=1}^{T} \mathbf {\mathrm {E}}_{\tau _{0,k} \sim \pi } \left[ \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ q^{\pi }(s_k,a_k) \right] \nonumber \\&= \ \mathbf {\mathrm {E}}_{\tau \sim \pi } \left[ \sum _{k=1}^{T} \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ q^{\pi }(s_k,a_k) \right] . \end{aligned}$$

(37)

We have finally:

$$\begin{aligned}&\nabla _{\theta } \mathbf {\mathrm {E}}_{\pi } \left[ \sum _{t=0}^{T} R_{t+1} \right] \ = \ \mathbf {\mathrm {E}}_{\tau \sim \pi } \left[ \sum _{k=1}^{T} \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ q^{\pi }(s_k,a_k) \right] . \end{aligned}$$

(38)

1.4 A.4 Precise statement of Assumptions

Here we provide a precise formulation of the assumptions from [16]. The formulation we use here is mostly taken from [14]:

(A1)
Assumptions on the controlled Markov processes: The controlled Markov process $\boldsymbol{z}$ takes values in a compact metric space S. It is controlled by the iterate sequences $\boldsymbol{\theta }_n\}$ and $\boldsymbol{\omega }_n$ and furthermore $\boldsymbol{z}_n$ by a random process $\boldsymbol{a}_n$ taking values in a compact metric space W. For B Borel in S the $\boldsymbol{z}_n$ dynamics for $n\geqslant 0$ is determined by a transition kernel $\tilde{p}$:
$$\begin{aligned}&\mathrm {P}(\boldsymbol{z}_{n+1} \in B |\boldsymbol{z}_l, \boldsymbol{a}_l, \boldsymbol{\theta }_l, \boldsymbol{\omega }_l, l\leqslant n) = \ \int _{B} \tilde{p}(\mathrm {d}\boldsymbol{z}| \boldsymbol{z}_n, \boldsymbol{a}_n, \boldsymbol{\theta }_n, \boldsymbol{\omega }_n). \end{aligned}$$
(39)
(A2)
Assumptions on the update functions: $h : \mathbb {R}^{m+k} \times S^{(1)} \rightarrow \mathbb {R}^m$ is jointly continuous as well as Lipschitz in its first two arguments, and uniformly w.r.t. the third. This means that for all $ \boldsymbol{z}\in S$:
$$\begin{aligned} \Vert h(\boldsymbol{\theta }, \boldsymbol{\omega }, \boldsymbol{z}) \ - \ h(\boldsymbol{\theta }', \boldsymbol{w}', \boldsymbol{z})\Vert \leqslant \ L^{(1)} \ (\Vert \boldsymbol{\theta }-\boldsymbol{\theta }'\Vert + \Vert \boldsymbol{\omega }- \boldsymbol{\omega }'\Vert ). \end{aligned}$$
(40)
Similarly for f, where the Lipschitz constant is $L^{(2)}$.
(A3)
Assumptions on the additive noise: For $i=1,2$, $\{(\boldsymbol{m}_i)_n\}$ are martingale difference sequences with bounded second moments. More precisely, $(\boldsymbol{m}_i)_n$ are martingale difference sequences w.r.t. increasing $\sigma $-fields
$$\begin{aligned} \mathfrak {F}_n \ = \ \sigma (\boldsymbol{\theta }_l, \boldsymbol{\omega }_l, (\boldsymbol{m}_1)_{l}, (\boldsymbol{m}_2)_{l}, \boldsymbol{z}_l, l \leqslant n) , \end{aligned}$$
(41)
satisfying $ \mathrm {E}\left[ \Vert (\boldsymbol{m}_i)_n \Vert ^2 \mid \mathfrak {F}_n \right] \ \leqslant \ B_i $ for $n \geqslant 0$ and given constants $B_i$.
(A4)
Assumptions on the learning rates:
$$\begin{aligned}&\sum _{n} a(n) \ = \ \infty , \quad \sum _{n} a^2(n) \ < \ \infty , \end{aligned}$$
(42)

$$\begin{aligned}&\sum _{n} b(n) \ = \ \infty , \quad \sum _{n} b^2(n) \ < \ \infty , \end{aligned}$$
(43)
and $a(n) \ = \ \mathrm {o}(b(n))$. Furthermore, $a(n), b(n), n \geqslant 0$ are non-increasing.
(A5)
Assumptions on the transition kernels: The state-action map
$$\begin{aligned} S \times W \times \mathbb {R}^{m+k} \ni&(\boldsymbol{z},\boldsymbol{a},\boldsymbol{\theta },\boldsymbol{\omega }) \mapsto \ \tilde{p}(\mathrm {d}\boldsymbol{y}\mid \boldsymbol{z}, \boldsymbol{a}, \boldsymbol{\theta }, \boldsymbol{\omega }) \end{aligned}$$
(44)
is continuous (the topology on the spaces of probability measures is induced by weak convergence).
(A6)
Assumptions on the associated ODEs: We consider occupation measures which intuitively give for the controlled Markov process the probability or density to observe a particular state-action pair from $S \times W$ for given $\boldsymbol{\theta }$ and $\boldsymbol{\omega }$ and a given control. A precise definition of these occupation measures can be found e.g. on page 68 of [7] or page 5 in [16]. We have following assumptions:
- We assume that there exists only one such ergodic occupation measure for $\boldsymbol{z}_n$ on $S \times W$, denoted by $\varGamma _{\boldsymbol{\theta },\boldsymbol{\omega }}$. A main reason for assuming uniqueness is that it enables us to deal with ODEs instead of differential inclusions. Moreover, set
  $$\begin{aligned} \tilde{f}(\boldsymbol{\theta }, \boldsymbol{\omega }) \ = \ \int f(\boldsymbol{\theta },\boldsymbol{\omega },\boldsymbol{z}) \ \varGamma _{\boldsymbol{\theta },\boldsymbol{\omega }}(\mathrm {d}\boldsymbol{z}, W). \end{aligned}$$
  (45)
- We assume that for $ \boldsymbol{\theta }\in \mathbb {R}^m$, the ODE $ \dot{\boldsymbol{\omega }}(t) \ = \ \tilde{f}(\boldsymbol{\theta },\boldsymbol{\omega }(t)) $ has a unique asymptotically stable equilibrium $\boldsymbol{\lambda }(\boldsymbol{\theta })$ with attractor set $B_{\boldsymbol{\theta }}$ such that $\boldsymbol{\lambda }: \mathbb {R}^m \rightarrow \mathbb {R}^k$ is a Lipschitz map with global Lipschitz constant.
- The Lyapunov function $V(\boldsymbol{\theta },.)$ associated to $\boldsymbol{\lambda }(\boldsymbol{\theta })$ is continuously differentiable.
- Next define
  $$\begin{aligned} \tilde{h}(\boldsymbol{\theta }) \ = \ \int h(\boldsymbol{\theta },\boldsymbol{\lambda }(\boldsymbol{\theta }),\boldsymbol{z}) \ \varGamma _{\boldsymbol{\theta },\boldsymbol{\lambda }(\boldsymbol{\theta })}(\mathrm {d}\boldsymbol{z}, W). \end{aligned}$$
  (46)
  We assume that the ODE $ \dot{\boldsymbol{\theta }}(t) \ = \ \tilde{h}(\boldsymbol{\theta }(t)) $ has a global attractor set A.
- For all $\boldsymbol{\theta }$, with probability 1, $\boldsymbol{\omega }_n$ for $n\geqslant 1$ belongs to a compact subset $Q_{\boldsymbol{\theta }}$ of $B_{\boldsymbol{\theta }}$ “eventually”.
This assumption is an adapted version of (A6)’ of [16], to avoid too many technicalities (e.g. in [16] two controls are used, which we avoid here to not overload notation).
(A7)
Assumption of bounded iterates: $\sup _n \Vert \boldsymbol{\theta }_n \Vert \ < \ \infty $ and $\sup _n \Vert \boldsymbol{\omega }_n \Vert \ < \ \infty $ a.s.

1.5 A.5 Further Details concerning the Sampling Process

Let us formulate the construction of the sampling process in more detail: We introduced the function $S_{\pi }$ in the main paper as follows:

$$\begin{aligned} S_{\pi }: \varOmega \rightarrow \tilde{\varOmega }_{\pi },\ x \mapsto \mathop {\mathrm {argmax}\,}_{\tau \in \tilde{\varOmega }_{\pi }} \left\{ \sum _{\eta \le \tau } \tilde{P_{\pi }}(\eta ) \le x \right\} . \end{aligned}$$

(47)

Now $S_{\pi }$ basically divides the interval [0, 1] into finitely many disjoint subintervals, such that the i-th subinterval $I_i$ maps to the i-th element $\tau _i \in \tilde{\varOmega }_{\pi }$, and additionally the length of $I_i$ is given by $\tilde{P_{\pi }}(\tau _i)$. $S_{\pi }$ is measurable, because the pre-image of any element of the sigma-algebra $\tilde{\mathfrak {A}_{\pi }}$ wrt. $S_{\pi }$ is just a finite union of subintervals of [0, 1], which is clearly contained in the Borel-algebra. Basically $S_{\pi }$ just describes how to get one sample from a multinomial distribution with (finitely many) probabilities $\tilde{P_{\pi }}(\tau )$, where $\tau \in \tilde{\varOmega }_{\pi }$. Compare with inverse transform sampling, e.g. Theorem 2.1.10. in [9] and applications thereof. For the reader’s convenience let us briefly recall this important concept here in a formal way:

Lemma 1 (Inverse transform sampling)

Let X have continuous cumulative distribution $F_X(x)$ and define the random variable Y as $Y=F_{X}(X)$. Then Y is uniformly distributed on (0, 1).

1.6 A.6 Further Details for Proof of Theorem 1

Here we provide further technical details needed to ensure the assumptions stated before to prove our main theorem Theorem 1.

Ad (A1): Assumptions on the Controlled Markov Processes: Let us start by discussing more details for controlled processes that appear in the PPO and RUDDER setting. Let us focus on the process related to $(z_1)_n$: Let $\beta >1$ and let the real sequence $z_n$ be defined by $(z_1)_1=1$ and $(z_1)_{n+1}=(1-\frac{1}{\beta })(z_1)_{n}+1$. The $z_n$’s are nothing more but the partial sums of a geometric series converging to $\beta $.

The sequence $(z_1)_n$ can also be interpreted as a time-homogeneous Markov process $(\boldsymbol{z}_1)_n$ with transition probabilities given by

$$\begin{aligned} P(z, y)=\delta _{(1-\frac{1}{\beta })z+1}, \end{aligned}$$

(48)

where $\delta $ denotes the Dirac measure, and with the compact interval $[1,\beta ]$ as its range. We use the standard notation for discrete time Markov processes, described in detail e.g. in [13]. Its unique invariant measure is clearly $\delta _{\beta }$. So integrating wrt. this invariant measure will in our case just correspond to taking the limit $(z_1)_n \rightarrow \beta $.

Ad (A2): $\boldsymbol{h}$ and $\boldsymbol{f}$ are Lipschitz: By the mean value theorem it is enough to show that the derivatives wrt. $\boldsymbol{\theta }$ and $\boldsymbol{\omega }$ are bounded uniformly wrt. $\boldsymbol{z}$. We only show details for f, since for h similar considerations apply. By the explicit formula for $L_g$, we see that $f(\boldsymbol{\theta },\boldsymbol{\omega },\boldsymbol{z})$ can be written as:

$$\begin{aligned} \sum _{\begin{array}{c} s_1,..,s_T \\ a_1,...,a_T \end{array}}&\prod _{t=1}^{T} p(s_t \mid s_{t-1},a_{t-1}) \pi (a_t \mid s_t, \boldsymbol{\theta },\boldsymbol{z}) \nabla _{\boldsymbol{\omega }}\varPhi (g(\tau ; \boldsymbol{\omega }, \boldsymbol{z}) ,\tau , \boldsymbol{\theta }, \boldsymbol{\omega }, \boldsymbol{z}) . \end{aligned}$$

(49)

The claim can now be readily deduced from the assumptions (L1)–(L3).

Ad (A3): Martingale Difference Property and Estimates: From the results in the main paper on the probabilistic setting, $(\boldsymbol{m}_1)_{n+1}$ and $(\boldsymbol{m}_2)_{n+1}$ can easily be seen to be martingale difference sequences with respect to their filtrations $\mathfrak {F}_n$. Indeed, the sigma algebras created by $\boldsymbol{\omega }_n$ and $\boldsymbol{\theta }_n$ already describe $\tilde{\mathfrak {A}}_{\pi _{\boldsymbol{\theta }_n}}$, and thus:

$$\begin{aligned} \mathbf {\mathrm {E}}[(\boldsymbol{m}_i)_{n+1}|\mathfrak {F}_n]=\mathbf {\mathrm {E}}[\hat{f}(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n)|\mathfrak {F}_n]-\mathbf {\mathrm {E}}[f(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n)]=0. \end{aligned}$$

(50)

It remains to show that

$$\begin{aligned} \mathbf {\mathrm {E}}[||(\boldsymbol{m}_i)_{n+1}||^2 | \mathfrak {F}_n] \le B_i \text { for }i=1,2. \end{aligned}$$

(51)

This, however, is also clear, since all the involved expressions are bounded uniformly again by the assumptions (L1)–(L3) on the losses (e.g. one can observe this by writing down the involved expressions explicitly as indicated in the previous point (A2) ).

Ad (A4): Assumptions on the Learning Rates: These standard assumptions are taken for granted.

Ad (A5):Transition Kernels: The continuity of the transition kernels is clear from Eq. (48) (continuity is wrt. to the weak topology in the space of probability measures. So in our case, this again boils down to using continuity of the test functions).

Ad (A6): Stability Properties of the ODEs:

By the explanations for (A1) we mentioned that integrating wrt. the ergodic occupation measure in our case corresponds to taking the limit $\boldsymbol{z}_n \rightarrow \boldsymbol{z}$ (since our Markov processes can be interpreted as sequences). Thus $\tilde{f}(\boldsymbol{\theta }, \boldsymbol{\omega })=f(\boldsymbol{\theta },\boldsymbol{\omega },\boldsymbol{z})$. In the sequel we will also use the following abbreviations: $f(\boldsymbol{\theta },\boldsymbol{\omega })=f(\boldsymbol{\theta },\boldsymbol{\omega },\boldsymbol{z})$, $h(\boldsymbol{\theta },\boldsymbol{\omega })=h(\boldsymbol{\theta },\boldsymbol{\omega },\boldsymbol{z})$, etc. Now consider the ODE
$$\begin{aligned} \dot{\boldsymbol{\omega }}(t)=f(\boldsymbol{\theta },\boldsymbol{\omega }(t)), \end{aligned}$$
(52)
where $\boldsymbol{\theta }$ is fixed. Equation (52) can be seen as a gradient system for the function $L_g$. By standard results on gradient systems (cf. e.g. Sect. 4 in [1] for a nice summary), which guarantee equivalence between strict local minima of the loss function and asymptotically stable points of the associated gradient system, we can use the assumptions (L1)–(L3) and the remarks thereafter from the main paper to ensure that there exists a unique asymptotically stable equilibrium $\boldsymbol{\lambda }(\boldsymbol{\theta })$ of Eq. (52).
The fact that $\boldsymbol{\lambda }(\boldsymbol{\theta })$ is smooth enough can be deduced by the Implicit Function Theorem as discussed in the main paper.
For Eq. (52) $L_g(\boldsymbol{\theta },\boldsymbol{\omega })-L_g(\boldsymbol{\theta },\boldsymbol{\lambda }(\boldsymbol{\theta }))$ can be taken as associated Lyapunov function $V_{\boldsymbol{\theta }}(\boldsymbol{\omega })$, and thus $V_{\boldsymbol{\theta }}(\boldsymbol{\omega })$ clearly is differentiable wrt. $\boldsymbol{\omega }$ for any $\boldsymbol{\theta }$.
The slow ODE $ \dot{\boldsymbol{\theta }}(t)=h(\boldsymbol{\theta }(t),\boldsymbol{\lambda }(\boldsymbol{\theta }(t)) $ also has a unique asymptotically stable fixed point, which again is guaranteed by our assumptions and the standard results on gradient systems.

Ad (A7): Assumption of Bounded Iterates: This follows from the assumptions on the loss functions.

1.7 A.7 Finite Greediness is Sufficient to Converge to the Optimal Policy

Here we provide details on how the optimal policy can be deduced using only a finite parameter $\beta >1$. The Q-values for policy $\pi $ are:

$$\begin{aligned} q^{\pi }(s_t,a_t)&= \ \mathbf {\mathrm {E}}_{\pi } \left[ \sum _{\tau =t}^{T} R_{\tau +1} \mid s_t,a_t \right] \nonumber \\&= \ \sum _{\begin{array}{c} s_t,..,s_T \\ a_t,...,a_T \end{array}} \prod _{\tau =t}^{T-1} p(s_{\tau +1} \mid s_{\tau },a_{\tau }) \ \prod _{\tau =t}^{T} \pi (a_{\tau } \mid s_{\tau }) \ \sum _{\tau =t}^{T} R_{\tau +1}. \end{aligned}$$

(53)

The optimal policy $\pi ^*$ is known to be deterministic $\left( \prod _{t=1}^T \pi ^*(a_t\ |\ s_t) \in \{0,1\} \right) $. Let us assume that the optimal policy is also unique. Then we are going to show the following result:

Lemma 2

For $i_{\max }= \arg \max _{i} q^{\pi ^*}(s,a^i)$ and $v^{\pi ^*}(s) = \max _{i} q^{\pi ^*}(s,a^i)$. We define

$$\begin{aligned} 0&< \ \epsilon \ < \ \min _{s,i\not =i_{\max }} (v^{\pi ^*}(s) \ - \ q^{\pi ^*}(s,a^i)), \end{aligned}$$

(54)

We assume a function $\psi (s,a^i)$ that defines the actual policy $\pi $ via

$$\begin{aligned} \pi (a^i \mid s; \beta )&= \ \frac{\exp (\beta \ \psi (s,a^i) ) }{\sum _j \exp (\beta \ \psi (s,a^j) )}. \end{aligned}$$

(55)

We assume that the function $\psi $ already identified the optimal actions, which will occur during learning at some time point when the policy is getting more greedy:

$$\begin{aligned} 0&< \ \delta \ < \ \min _{s,i\not =i_{\max }} (\psi (s,a^{i_{\max }}) \ - \psi (s,a^i) ). \end{aligned}$$

(56)

Hence,

$$\begin{aligned} \lim _{\beta \rightarrow \infty } \pi (a^i \mid s; \beta )&= \ \pi ^*(a^i \mid s). \end{aligned}$$

(57)

We assume that

$$\begin{aligned} \beta&> \nonumber \\&\max \left( \frac{\log ({{\left| \mathscr {A} \right| }}-1)}{\delta }, -\log \left( \frac{\epsilon }{2\,T \ (\left| \mathscr {A}\right| - 1) \ |\mathscr {S}|^T \ |\mathscr {A}|^T \ (T+1) \ K_R} \right) / \delta \ \right) . \end{aligned}$$

(58)

Then we can make the statement for all s:

$$\begin{aligned} \forall _{j,j \not =i}: \ q^{\pi }(s,a^i)&> \ q^{\pi }(s,a^j) \ \Rightarrow \ i = i_{\max }, \end{aligned}$$

(59)

therefore the Q-values $q^{\pi }(s,a^i)$ determine the optimal policy as the action with the largest Q-value can be chosen.

More importantly, $\beta $ is large enough to allow Q-value based methods and policy gradients converge to the optimal policy if it is the local minimum of the loss functions. For Q-value based methods the optimal action can be determined if the optimal policy is the minimum of the loss functions. For policy gradients the optimal action receives always the largest gradient and the policy converges to the optimal policy.

Proof

We already discussed that the optimal policy $\pi ^*$ is known to be deterministic $\left( \prod _{t=1}^T \pi ^*(a_t\ |\ s_t) \in \{0,1\} \right) $. Let us assume that the optimal policy is also unique. Since

$$\begin{aligned} \pi (a^i \mid s; \beta )&= \ \frac{\exp (\beta \ (\psi (s,a^i) \ - \ \psi (s,a^{i_{\max }})) ) }{\sum _j \exp (\beta \ \psi (s,a^j) \ - \ \psi (s,a^{i_{\max }}))}, \end{aligned}$$

(60)

we have

$$\begin{aligned} \pi (a^{i_{\max }} \mid s; \beta )&= \ \frac{1}{1 \ + \ \sum _{j,j\not =i_{\max }} \exp (\beta \ \psi (s,a^j) \ - \ \psi (s,a^{i_{\max }}))} \nonumber \\&> \frac{1}{1 \ + \ (|\mathscr {A}|-1) \ \exp (- \ \beta \ \delta ) } \nonumber \\&= \ 1 \ - \ \frac{(|\mathscr {A}|-1) \ \exp (- \ \beta \ \delta )}{1 \ + \ (|\mathscr {A}|-1) \ \exp (- \beta \ \delta ) } \nonumber \\&> 1 \ - \ (|\mathscr {A}|-1) \ \exp (- \ \beta \ \delta ) \end{aligned}$$

(61)

and for $i \not = i_{\max }$

$$\begin{aligned} \pi (a^i \mid s; \beta )&= \ \frac{\exp (\beta \ (\psi (s,a^i) \ - \ \psi (s,a^{i_{\max }})) ) }{1 \ + \ \sum _{j,j\not =i_{\max }} \exp (\beta \ \psi (s,a^j) \ - \ \psi (s,a^{i_{\max }}))} \nonumber \\&< \ \exp (- \ \beta \ \delta ). \end{aligned}$$

(62)

For $\prod _{t=1}^{T} \pi ^*(a_t \mid s_t) = 1$, we have

$$\begin{aligned} \prod _{t=1}^{T} \pi (a_t \mid s_t)&> \ (1 \ - \ (|\mathscr {A}|-1) \ \exp (- \ \beta \ \delta ))^T \nonumber \\&> \ 1 - \ T \ (|\mathscr {A}|-1) \ \exp (- \ \beta \ \delta ), \end{aligned}$$

(63)

where in the last step we used that $(|\mathscr {A}|-1) \exp (- \beta \delta )<1$ by definition of $\beta $ in (58) so that an application of Bernoulli’s inequality is justified. For $\prod _{t=1}^{T} \pi ^*(a_t \mid s_t) = 0$, we have

$$\begin{aligned} \prod _{t=1}^{T} \pi (a_t \mid s_t)&< \ \exp (- \ \beta \ \delta ). \end{aligned}$$

(64)

Therefore

$$\begin{aligned} {{\left| \prod _{t=1}^{T} \pi ^*(a_t \mid s_t) \ - \ \prod _{t=1}^{T} \pi (a_t \mid s_t) \right| }}&< \ T \ (|\mathscr {A}|-1) \ \exp (- \ \beta \ \delta ). \end{aligned}$$

(65)

Using Eq. (65) and the definition of $\beta $ in Eq. (58) we get:

$$\begin{aligned}&{{\left| q^{\pi ^*}(s,a^i) \ - \ q^{\pi }(s,a^i) \right| }} \nonumber \\&= \ \left| \sum _{\begin{array}{c} s_1,..,s_T \\ a_1,...,a_T \end{array}} \ \prod _{t=1}^{T} p(s_t \mid s_{t-1},a_{t-1}) \ \left( \prod _{t=1}^{T} \pi ^*(a_t \mid s_t) \ - \ \prod _{t=1}^{T} \pi (a_t \mid s_t) \right) \ \sum _{t=0}^{T} R_{t+1} \right| \nonumber \\&< \ \sum _{\begin{array}{c} s_1,..,s_T \\ a_1,...,a_T \end{array}} \ \prod _{t=1}^{T} p(s_t \mid s_{t-1},a_{t-1}) \ \left| \prod _{t=1}^{T} \pi ^*(a_t \mid s_t) \ - \ \prod _{t=1}^{T} \pi (a_t \mid s_t) \right| \ (T+1) \ K_R\nonumber \\&< \ \sum _{\begin{array}{c} s_1,..,s_T \\ a_1,...,a_T \end{array}} \ \left| \prod _{t=1}^{T} \pi ^*(a_t \mid s_t) \ - \ \prod _{t=1}^{T} \pi (a_t \mid s_t) \right| \ (T+1) \ K_R\nonumber \\&< \ |\mathscr {S}|^T \ |\mathscr {A}|^T \ \frac{\epsilon }{ 2|\mathscr {S}|^T \ |\mathscr {A}|^T \ (T+1) \ K_R} \ (T+1) \ K_R \ = \ \epsilon / 2. \end{aligned}$$

(66)

Now from the condition that $q^{\pi }(s,a^i) \ > \ q^{\pi }(s,a^j) $ for all $j \ne i$ we can conclude that

$$\begin{aligned} \begin{array}{c} q^{\pi ^*}(s,a^j) - q^{\pi ^*}(s,a^i) \\< (q^{\pi }(s,a^j) + \epsilon / 2) - (q^{\pi }(s,a^i) - \epsilon / 2) < \epsilon \end{array} \end{aligned}$$

(67)

for all $j \ne i$. Thus for $j \not =i$ it follows that $j \not =i_{\max }$ and consequently $i=i_{\max }$. $\square $

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Holzleitner, M., Gruber, L., Arjona-Medina, J., Brandstetter, J., Hochreiter, S. (2021). Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER. In: Hameurlain, A., Tjoa, A.M. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XLVIII. Lecture Notes in Computer Science(), vol 12670. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-63519-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-662-63519-3_5
Published: 18 May 2021
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-63518-6
Online ISBN: 978-3-662-63519-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

On the Analysis of Model-Free Methods for the Linear Quadratic Regulator

On the sample complexity of actor-critic method for reinforcement learning with function approximation

Policy Gradient

References

Acknowledgments

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Further Details on PPO

1.2 A.2 Further details on RUDDER

1.3 A.3 Causality and Reward-To-Go

1.4 A.4 Precise statement of Assumptions

1.5 A.5 Further Details concerning the Sampling Process

Lemma 1 (Inverse transform sampling)

1.6 A.6 Further Details for Proof of Theorem 1

1.7 A.7 Finite Greediness is Sufficient to Converge to the Optimal Policy

Lemma 2

Proof

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us