Skip to main content

Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER

  • Chapter
  • First Online:
Transactions on Large-Scale Data- and Knowledge-Centered Systems XLVIII

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 12670))

  • 1156 Accesses

Abstract

We prove under commonly used assumptions the convergence of actor-critic reinforcement learning algorithms, which simultaneously learn a policy function, the actor, and a value function, the critic. Both functions can be deep neural networks of arbitrary complexity. Our framework allows showing convergence of the well known Proximal Policy Optimization (PPO) and of the recently introduced RUDDER. For the convergence proof we employ recently introduced techniques from the two time-scale stochastic approximation theory.

Previous convergence proofs assume linear function approximation, cannot treat episodic examples, or do not consider that policies become greedy. The latter is relevant since optimal policies are typically deterministic. Our results are valid for actor-critic methods that use episodic samples and that have a policy that becomes more greedy during learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Absil, P.A., Kurdyka, K.: On the stable equilibrium points of gradient systems. Syst. Control Lett. 55(7), 573–577 (2006)

    Article  MathSciNet  Google Scholar 

  2. Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: RUDDER: Return decomposition for delayed rewards (2018). ArXiv https://arxiv.org/abs/1806.07857

  3. Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: RUDDER: return decomposition for delayed rewards. In: Advances in Neural Information Processing Systems, vol. 33 (2019). ArXiv https://arxiv.org/abs/1806.07857

  4. Bakker, B.: Reinforcement learning by backpropagation through an LSTM model/critic. In: IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp. 127–134 (2007). https://doi.org/10.1109/ADPRL.2007.368179

  5. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)

    MATH  Google Scholar 

  6. Bhatnagar, S., Prasad, H.L., Prashanth, L.A.: Stochastic Recursive Algorithms for Optimization. Lecture Notes in Control and Information Sciences, 1st edn., p. 302. Springer, London (2013). https://doi.org/10.1007/978-1-4471-4285-0

    Book  MATH  Google Scholar 

  7. Stochastic Approximation. TRM, vol. 48. Hindustan Book Agency, Gurgaon (2008). https://doi.org/10.1007/978-93-86279-38-5

  8. Borkar, V.S., Meyn, S.P.: The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2), 447–469 (2000). https://doi.org/10.1137/S0363012997331639

  9. Casella, G., Berger, R.L.: Statistical Inference. Wadsworth and Brooks/Cole, Stanley (2002)

    Google Scholar 

  10. Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B., LeCun, Y.: The loss surfaces of multilayer networks. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pp. 192–204 (2015)

    Google Scholar 

  11. Dayan, P.: The convergence of TD(\(\lambda \)) for general \(\lambda \). Mach. Learn. 8, 341 (1992)

    MATH  Google Scholar 

  12. Fan, J., Wang, Z., Xie, Y., Yang, Z.: A theoretical analysis of deep \(q\)-learning. CoRR abs/1901.00137 (2020)

    Google Scholar 

  13. Hairer, M.: Ergodic properties of Markov processes. In: Lecture Notes (2018)

    Google Scholar 

  14. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a Nash equilibrium. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. pp. 6626–6637. Curran Associates, Inc. (2017). Preprint arXiv:1706.08500

  15. Jin, C., Netrapalli, P., Jordan, M.I.: Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv:1902.00618 (2019)

  16. Karmakar, P., Bhatnagar, S.: Two time-scale stochastic approximation with controlled Markov noise and off-policy temporal-difference learning. Math. Oper. Res. (2017). https://doi.org/10.1287/moor.2017.0855

    Article  MATH  Google Scholar 

  17. Kawaguchi, K.: Deep learning without poor local minima. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29. pp. 586–594 (2016)

    Google Scholar 

  18. Kawaguchi, K., Bengio, Y.: Depth with nonlinearity creates no bad local minima in ResNets. Neural Netw. 118, 167–174 (2019)

    Article  Google Scholar 

  19. Kawaguchi, K., Huang, J., Kaelbling, L.P.: Effect of depth and width on local minima in deep learning. Neural Comput. 31(6), 1462–1498 (2019)

    Article  MathSciNet  Google Scholar 

  20. Kawaguchi, K., Kaelbling, L.P., Bengio, Y.: Generalization in deep learning. arXiv:1710.05468 (2017)

  21. Konda, V.R., Borkar, V.S.: Actor-critic-type learning algorithms for Markov decision processes. SIAM J. Control Optim. 38(1), 94–123 (1999). https://doi.org/10.1137/S036301299731669X

    Article  MathSciNet  MATH  Google Scholar 

  22. Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008–1014 (2000)

    Google Scholar 

  23. Konda, V.R., Tsitsiklis, J.N.: On actor-critic algorithms. SIAM J. Control Optim. 42(4), 1143–1166 (2003). https://doi.org/10.1137/S0363012901385691

    Article  MathSciNet  MATH  Google Scholar 

  24. Kushner, H.J., Clark, D.S.: Stochastic Approximation Methods for Constrained and Unconstrained Systems. Applied Mathematical Sciences. Springer, New York (1978). https://doi.org/10.1007/978-1-4684-9352-8

  25. Kushner, H.J., Yin, G.G.: Stochastic Approximation and Recursive Algorithms and Applications. Stochastic Modelling and Applied Probability. Springer, New York (2003). https://doi.org/10.1007/b97441

  26. Lin, T., Jin, C., Jordan, M.I.: On gradient descent ascent for nonconvex-concave minimax problems. arXiv:1906.00331 (2019)

  27. Liu, B., Cai, Q., Yang, Z., Wang, Z.: Neural proximal/trust region policy optimization attains globally optimal policy. In: Advances in Neural Information Processing Systems, vol. 33. arXiv:1906.10306 (2019)

  28. Maei, H.R., Szepesvári, C., Bhatnagar, S., Precup, D., Silver, D., Sutton, R.S.: Convergent temporal-difference learning with arbitrary smooth function approximation. In: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 22. pp. 1204–1212. Curran Associates, Inc. (2009)

    Google Scholar 

  29. Mazumdar, E.V., Jordan, M.I., Sastry, S.S.: On finding local Nash equilibria (and only local Nash equilibria) in zero-sum games. arXiv:1901.00838 (2019)

  30. Metrikopoulos, P., Hallak, N., Kavis, A., Cevher, V.: On the almost sure convergence of stochastic gradient descent in non-convex problems. In: Advances in Neural Information Processing Systems, vol. 34 (2020). arXiv:2006.11144

  31. Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv:1312.5602 (2013)

  32. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015). https://doi.org/10.1038/nature14236

    Article  Google Scholar 

  33. Munro, P.W.: A dual back-propagation scheme for scalar reinforcement learning. In: Proceedings of the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, pp. 165–176 (1987)

    Google Scholar 

  34. Open, A.I., et al.: Dota 2 with large scale deep reinforcement learning. arXiv:1912.06680 (2019)

  35. Patil, V.P., et al.: Align-RUDDER: learning from few demonstrations by reward redistribution. arXiv:2009.14108 (2020)

  36. Puterman, M.L.: Markov Decision Processes, 2nd edn. Wiley, Hoboken (2005)

    MATH  Google Scholar 

  37. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951). https://doi.org/10.1214/aoms/1177729586

    Article  MathSciNet  MATH  Google Scholar 

  38. Robinson, A.J.: Dynamic error propagation networks. Ph.D. thesis, Trinity Hall and Cambridge University Engineering Department (1989)

    Google Scholar 

  39. Robinson, T., Fallside, F.: Dynamic reinforcement driven error propagation networks with application to game playing. In: Proceedings of the 11th Conference of the Cognitive Science Society, Ann Arbor, pp. 836–843 (1989)

    Google Scholar 

  40. Schulman, J., Levine, S., Moritz, P., Jordan, M.I., Abbeel, P.: Trust region policy optimization. arXiv:1502.05477 (2015). 31st International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 37

  41. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:1707.06347 (2018)

  42. Singh, S., Jaakkola, T., Littman, M., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38, 287–308 (2000). https://doi.org/10.1023/A:1007678930559

    Article  MATH  Google Scholar 

  43. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge (2018)

    MATH  Google Scholar 

  44. Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057–1063 (2000)

    Google Scholar 

  45. Tsitsiklis, J.N.: Asynchronous stochastic approximation and \(q\)-learning. Mach. Learn. 16(3), 185–202 (1994). https://doi.org/10.1023/A:1022689125041

    Article  MATH  Google Scholar 

  46. Vinyals, O., et al.: Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019). https://doi.org/10.1038/s41586-019-1724-z

    Article  Google Scholar 

  47. Watkins, C.J.C.H., Dayan, P.: Q-learning. Mach. Learn. 8, 279–292 (1992)

    MATH  Google Scholar 

  48. Xu, T., Zou, S., Liang, Y.: Two time-scale off-policy TD learning: non-asymptotic analysis over Markovian samples. Adv. Neural Inf. Process. Syst. 32, 10633–10643 (2019)

    Google Scholar 

  49. Yang, Z., Chen, Y., Hong, M., Wang, Z.: Provably global convergence of actor-critic: a case for linear quadratic regulator with ergodic cost. Adv. Neural Inf. Process. Syst. 32, 8351–8363 (2019)

    Google Scholar 

Download references

Acknowledgments

The ELLIS Unit Linz, the LIT AI Lab, the Institute for Machine Learning, are supported by the Federal State Upper Austria. IARAI is supported by Here Technologies. We thank the projects AI-MOTION (LIT-2018-6-YOU-212), DeepToxGen (LIT-2017-3-YOU-003), AI-SNN (LIT-2018-6-YOU-214), DeepFlood (LIT-2019-8-YOU-213), Medical Cognitive Computing Center (MC3), PRIMAL (FFG873979), S3AI (FFG-872172), DL for granular flow (FFG-871302), ELISE (H2020-ICT-2019-3 ID: 951847), AIDD (MSCA-ITN-2020 ID: 956832). We thank Janssen Pharmaceutica, UCB Biopharma SRL, Merck Healthcare KGaA, Audi.JKU Deep Learning Center, TGW LOGISTICS GROUP GMBH, Silicon Austria Labs (SAL), FILL Gesellschaft mbH, Anyline GmbH, Google Brain, ZF Friedrichshafen AG, Robert Bosch GmbH, Software Competence Center Hagenberg GmbH, TÜV Austria, and the NVIDIA Corporation.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

A Appendix

A Appendix

This appendix is meant to provide the reader with details and more precise descriptions of several parts of the main text, including e.g. exact formulations of the algorithms and more technical proof steps. Sections A.1 and A.2 provide the full formulation of the PPO and RUDDER algorithm, respectively, for which we ensure convergence. Section A.3 describes how the causality assumption leads to the formulas for PPO. In Sect. A.4 we discuss the precise formulations of the assumptions from [16]. Section A.5 gives further details about the probabilistic setup that we use to formalize the sampling process while Sect. A.6 gives formal details on how to ensure the assumptions from [16] to obtain our main convergence result Theorem 1. The last Sect. A.7 discusses arguments how to deduce the optimal policy from the approximate ones.

1.1 A.1 Further Details on PPO

Here we describe the minimization problem for the PPO setup in a more detailed way by including the exact expression for the gradients of the respective loss functions:

$$\begin{aligned}&\mathrm {L}_h(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n) = \ \mathbf {\mathrm {E}}_{\tau \sim \pi (\boldsymbol{\theta }_n,\boldsymbol{z}_n) } \left[ - \ G_0 \ + \ (z_2)_n \ \rho (\tau ,\boldsymbol{\theta }_n,\boldsymbol{z}_n) \right] , \end{aligned}$$
(15)
$$\begin{aligned}&h(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n)= \nonumber \\&\mathbf {\mathrm {E}}_{\tau \sim \pi (\boldsymbol{\theta }_n,\boldsymbol{z}_n)} \left[ -\sum _{t=0}^T \nabla _{\boldsymbol{\theta }} \log \pi (a_t \mid s_t ; \boldsymbol{\theta }_n,\boldsymbol{z}_n) \ ( \hat{q}^{\pi }(s_{t},a_{t};\boldsymbol{\omega }_n) - \hat{v}^\pi (s_t;\boldsymbol{\omega }_n) ) \right. \nonumber \\&+ (z_2)_n \ \sum _{t=0}^T \nabla _{\boldsymbol{\theta }_n} \log \pi (a_t \mid s_t ; \boldsymbol{\theta }_n,\boldsymbol{z}_n) \ \rho (\tau ,\boldsymbol{\theta }_n,\boldsymbol{z}_n) + \ (z_2)_n \nabla _{\boldsymbol{\theta }_n} \rho (\tau ,\boldsymbol{\theta }_n,\boldsymbol{z}_n) \Bigg ] ,\end{aligned}$$
(16)
$$\begin{aligned}&\mathrm {L}^\mathrm {TD}_g(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n) \ = \mathbf {\mathrm {E}}_{\tau \sim \pi (\boldsymbol{\theta }_n,\boldsymbol{z}_n)} \left[ \frac{1}{2} \ \sum _{t=0}^{T} \big ( \delta ^{\mathrm {TD}}(t) \big )^2 \right] , \end{aligned}$$
(17)
$$\begin{aligned}&f^\mathrm {TD}(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n) = \mathbf {\mathrm {E}}_{\tau \sim \pi (\boldsymbol{\theta }_n,\boldsymbol{z}_n)} \left[ - \sum _{t=0}^{T} \delta ^{\mathrm {TD}}(t) \ \nabla _{\boldsymbol{\omega }_n} \hat{q}^{\pi }(s_t,a_t;\boldsymbol{\omega }_n) \right] , \end{aligned}$$
(18)
$$\begin{aligned}&\mathrm {L}^\mathrm {MC}_g(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n) \ = \mathbf {\mathrm {E}}_{\tau \sim \pi (\boldsymbol{\theta }_n,\boldsymbol{z}_n)} \left[ \frac{1}{2} \ \sum _{t=0}^{T} \bigg ( G_t \ - \ \hat{q}^{\pi }(s_t,a_t;\boldsymbol{\omega }_n) \bigg )^2 \right] , \end{aligned}$$
(19)
$$\begin{aligned}&f^\mathrm {MC}(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n)= \ \nonumber \\&\ \mathbf {\mathrm {E}}_{\tau \sim \pi (\boldsymbol{\theta }_n,\boldsymbol{z}_n)} \left[ -\sum _{t=0}^{T} \bigg ( G_t \ - \ \hat{q}^{\pi }(s_t,a_t;\boldsymbol{\omega }_n) \bigg ) \ \nabla _{\boldsymbol{\omega }_n} \hat{q}^{\pi }(s_t,a_t;\boldsymbol{\omega }_n) \right] , \end{aligned}$$
(20)
$$\begin{aligned}&\boldsymbol{\theta }_{n+1} \ = \ \boldsymbol{\theta }_n \ - \ a(n) \ \hat{h} (\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n), \boldsymbol{\omega }_{n+1} \ = \ \boldsymbol{\omega }_n \ - \ b(n) \ \hat{f}(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n). \end{aligned}$$
(21)

1.2 A.2 Further details on RUDDER

In a similar vein we present the minimization problem of RUDDER in more detail:

$$\begin{aligned}&\mathrm {L}_h(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n,\boldsymbol{z}_n)=\nonumber \\&\mathbf {\mathrm {E}}_{\tau \sim \breve{\pi }} \left[ \frac{1}{2} \ \sum _{t=0}^{T} \bigg ( R_{t+1}(\tau ; \boldsymbol{\omega }_n) - \hat{q}(s_t, a_t; \boldsymbol{\theta }_n)\bigg )^2 \ + \ (z_2)_n \ \rho _{\boldsymbol{\theta }}(\tau ,\boldsymbol{\theta }_n,\boldsymbol{z}_n) \right] \end{aligned}$$
(22)
$$\begin{aligned}&h(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n,\boldsymbol{z}_n)= \nonumber \\&\mathbf {\mathrm {E}}_{\tau \sim \breve{\pi }} \left[ -\sum _{t=0}^{T} \bigg ( R_{t+1}(\tau ; \boldsymbol{\omega }_n) - \hat{q}(s_t, a_t; \boldsymbol{\theta }_n)\bigg ) \ \nabla _{\boldsymbol{\theta }} \hat{q}(s_t, a_t; \boldsymbol{\theta }_n) \right. \Bigg . + (z_2)_n \ \nabla _{\boldsymbol{\theta }} \rho _{\boldsymbol{\theta }}(\tau ,\boldsymbol{\theta }_n,\boldsymbol{z}_n) \ \Bigg ] \end{aligned}$$
(23)
$$\begin{aligned}&\mathrm {L}_g(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n)= \nonumber \\&\mathbf {\mathrm {E}}_{\tau \sim \pi (\boldsymbol{\theta }_n,\boldsymbol{z}_n)} \left[ \frac{1}{2} \ \bigg ( \sum _{t=0}^{T} \tilde{R}_{t+1} \ - \ g( \tau ; \boldsymbol{\omega }_n ) \bigg )^2 \ + \ (z_2)_n \ \rho _{\boldsymbol{\omega }}(\tau ,\boldsymbol{\theta }_n,\boldsymbol{z}_n) \right] \end{aligned}$$
(24)
$$\begin{aligned}&f(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n) \ = \ \mathbf {\mathrm {E}}_{\tau \sim \pi (\boldsymbol{\theta }_n,\boldsymbol{z}_n)} \left[ -\bigg ( \sum _{t=0}^{T} \tilde{R}_{t+1} \ - \ g( \tau ; \boldsymbol{\omega }_n ) \bigg ) \ \nabla _{\boldsymbol{\omega }} g( \tau ; \boldsymbol{\omega }_n ) \ \right. \nonumber \\&\Bigg . +(z_2)_n \ \nabla _{\boldsymbol{\omega }} \rho _{\boldsymbol{\omega }}(\tau ,\boldsymbol{\theta }_n,\boldsymbol{z}_n) \Bigg ], \end{aligned}$$
(25)
$$\begin{aligned}&\boldsymbol{\theta }_{n+1} \ = \ \boldsymbol{\theta }_n \ - \ a(n) \ \hat{h} (\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n), \boldsymbol{\omega }_{n+1} \ = \ \boldsymbol{\omega }_n \ - \ b(n) \ \hat{f}(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n). \end{aligned}$$
(26)

1.3 A.3 Causality and Reward-To-Go

This section is meant to provide the reader with more details concerning the causality assumption that leads to the formula for h in Eq. (15) for PPO. We can derive a formulation of the policy gradient with reward-to-go. For ease of notation, instead of using \(\tilde{P_{\pi }}(\tau )\) as in previous sections, we here denote the probability of state-action sequence \(\tau =\tau _{0,T}=(s_0,a_0,s_1,a_1,\ldots ,s_T,a_T)\) with policy \(\pi \) as

$$\begin{aligned}&p(\tau ) \ = \ p(s_0) \ \pi (a_0 \mid s_0) \ \prod _{t=1}^{T} p(s_t \mid s_{t-1},a_{t-1}) \ \pi (a_t \mid s_t) \nonumber \\&= \ p(s_0) \ \prod _{t=1}^{T} p(s_t \mid s_{t-1},a_{t-1}) \ \prod _{t=0}^{T} \pi (a_t \mid s_t). \end{aligned}$$
(27)

The probability of state-action sequence \(\tau _{0,t}=(s_0,s_0,s_1,a_1,\ldots ,s_t,a_t)\) with policy \(\pi \) is

$$\begin{aligned}&p(\tau _{0,t}) \ = \ p(s_0) \ \pi (a_0 \mid s_0) \ \prod _{k=1}^{t} p(s_k \mid s_{k-1},a_{k-1}) \ \pi (a_k \mid s_k) \nonumber \\&= \ p(s_0) \ \prod _{k=1}^{t} p(s_k \mid s_{k-1},a_{k-1}) \ \prod _{k=0}^{t} \pi (a_k \mid s_k). \end{aligned}$$
(28)

The probability of state-action sequence \(\tau _{t+1,T}=(s_{t+1},a_{t+1},\ldots ,s_T,a_T)\) with policy \(\pi \) given \(( s_t,a_t)\) is

$$\begin{aligned}&p(\tau _{t+1,T} \mid s_t,a_t) \ = \ \prod _{k=t+1}^{T} p(s_k \mid s_{k-1},a_{k-1}) \ \pi (a_k \mid s_k) \nonumber \\&= \ \prod _{k=t+1}^{T} p(s_k \mid s_{k-1},a_{k-1}) \ \prod _{k=t+1}^{T} \pi (a_k \mid s_k). \end{aligned}$$
(29)

The expectation of \(\sum _{t=0}^{T} R_{t+1}\) is

$$\begin{aligned}&\mathbf {\mathrm {E}}_{\pi } \left[ \sum _{t=0}^{T} R_{t+1} \right] \ = \ \sum _{t=0}^{T} \mathbf {\mathrm {E}}_{\pi } \left[ R_{t+1} \right] . \end{aligned}$$
(30)

With \(R_{t+1} \sim p(r_{t+1} \mid s_t,a_t)\), the random variable \(R_{t+1}\) depends only on \((s_t,a_t)\). We define the expected reward \(\mathbf {\mathrm {E}}_{r_{t+1}} \left[ R_{t+1} \mid s_t,a_t\right] \) as a function \(r(s_t,a_t)\) of \((s_t,a_t)\):

$$\begin{aligned} r(s_t,a_t)&:= \ \mathbf {\mathrm {E}}_{r_{t+1}} \left[ R_{t+1} \mid s_t,a_t\right] \ = \ \sum _{r_{t+1}} p(r_{t+1} \mid s_t,a_t) \ r_{t+1}. \end{aligned}$$
(31)

Causality. We assume that the reward \(R_{t+1}=R(s_t,a_t) \sim p(r_{t+1} \mid s_t,a_t)\) only depends on the past but not on the future. The state-action pair \((s_t,a_t)\) is determined by the past and not by the future. Relevant is only how likely we observe \((s_t,a_t)\) and not what we do afterwards.

Causality is derived from the Markov property of the MDP and means:

$$\begin{aligned}&\mathbf {\mathrm {E}}_{\tau \sim \pi } \left[ R_{t+1} \right] \ = \ \mathbf {\mathrm {E}}_{\tau _{0,t} \sim \pi } \left[ R_{t+1} \right] . \end{aligned}$$
(32)

That is

$$\begin{aligned}&\mathbf {\mathrm {E}}_{\tau \sim \pi } \left[ R_{t+1} \right] \ = \ \sum _{s_1} \sum _{a_1} \sum _{s_2} \sum _{a_2} \ \ldots \ \sum _{s_T} \sum _{a_T} p(\tau ) \ r(s_t,a_t) \nonumber \\&= \ \sum _{s_1} \sum _{a_1} \sum _{s_2} \sum _{a_2} \ \ldots \ \sum _{s_T} \sum _{a_T} \ \prod _{l=1}^{T} p(s_l \mid s_{l-1},a_{l-1}) \ \prod _{l=1}^{T} \pi (a_l \mid s_l) \ r(s_t,a_t)\nonumber \\&= \ \sum _{s_1} \sum _{a_1} \sum _{s_2} \sum _{a_2} \ \ldots \ \sum _{s_t} \sum _{a_t} \ \prod _{l=1}^{t} p(s_{l} \mid s_{l-1},a_{l-1}) \ \prod _{l=1}^{t} \pi (a_{l} \mid s_{l}) \ r(s_t,a_t)\nonumber \\&~~~\sum _{s_{t+1}} \sum _{a_{t+1}} \sum _{s_{t+2}} \sum _{a_{t+2}} \ \ldots \ \sum _{s_T} \sum _{a_T} \ \prod _{l=t+1}^{T} p(s_{l} \mid s_{l-1},a_{l-1}) \ \prod _{l=t+1}^{T} \pi (a_{l} \mid s_{l})\nonumber \\&= \ \sum _{s_1} \sum _{a_1} \sum _{s_2} \sum _{a_2} \ \ldots \ \sum _{s_t} \sum _{a_t} \ \prod _{l=1}^{t} p(s_{l} \mid s_{l-1},a_{l-1}) \ \prod _{l=1}^{t} \pi (a_{l} \mid s_{l}) \ r(s_t,a_t) \nonumber \\&= \ \mathbf {\mathrm {E}}_{\tau _{0,t} \sim \pi } \left[ R_{t+1} \right] . \end{aligned}$$
(33)

Policy Gradient Theorem. We now assume that the policy \(\pi \) is parametrized by \(\boldsymbol{\theta }\), that is, \(\pi (a_t \mid s_t) = \pi (a_t \mid s_t ; \boldsymbol{\theta })\). We need the gradient with respect to \(\boldsymbol{\theta }\) of \(\prod _{t=a}^{b} \pi (a_t \mid s_t)\):

$$\begin{aligned}&\nabla _{\theta } \prod _{t=a}^{b} \pi (a_t \mid s_t ; \boldsymbol{\theta }) \ = \ \sum _{s=a}^{b} \prod _{t=a,t \not = s}^{b} \pi (a_t \mid s_t ; \boldsymbol{\theta }) \ \nabla _{\theta } \pi (a_s \mid s_s ; \boldsymbol{\theta }) \nonumber \\&= \ \prod _{t=a}^{b} \pi (a_t \mid s_t ; \boldsymbol{\theta }) \ \sum _{s=a}^{b} \frac{ \nabla _{\theta } \pi (a_s \mid s_s ; \boldsymbol{\theta })}{\pi (a_s \mid s_s ; \boldsymbol{\theta })}\nonumber \\&= \ \prod _{t=a}^{b} \pi (a_t \mid s_t ; \boldsymbol{\theta }) \ \sum _{s=a}^{b} \nabla _{\theta } \log \pi (a_s \mid s_s ; \boldsymbol{\theta }). \end{aligned}$$
(34)

It follows that

$$\begin{aligned}&\nabla _{\theta } \mathbf {\mathrm {E}}_{\pi } \left[ R_{t+1} \right] \ = \ \mathbf {\mathrm {E}}_{\pi } \left[ \sum _{s=1}^{t} \nabla _{\theta } \log \pi (a_s \mid s_s ; \boldsymbol{\theta }) \ R_{t+1} \right] . \end{aligned}$$
(35)

We only have to consider the reward to go. Since \(a_0\) does not depend on \(\pi \), we have \(\nabla _{\theta } \mathbf {\mathrm {E}}_{\pi } \left[ R_1 \right] =0\). Therefore

$$\begin{aligned}&\nabla _{\theta } \mathbf {\mathrm {E}}_{\pi } \left[ \sum _{t=0}^{T} R_{t+1} \right] \ = \ \sum _{t=0}^{T} \nabla _{\theta } \mathbf {\mathrm {E}}_{\pi } \left[ R_{t+1} \right] \nonumber \\&= \ \mathbf {\mathrm {E}}_{\pi } \left[ \sum _{t=1}^{T} \sum _{k=1}^{t} \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ R_{t+1} \right] \nonumber \\&= \ \mathbf {\mathrm {E}}_{\pi } \left[ \sum _{k=1}^{T} \sum _{t=k}^{T} \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ R_{t+1} \right] \nonumber \\&= \ \mathbf {\mathrm {E}}_{\pi } \left[ \sum _{k=1}^{T} \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ \sum _{t=k}^{T} R_{t+1} \right] \nonumber \\&= \ \mathbf {\mathrm {E}}_{\pi } \left[ \sum _{k=1}^{T} \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ G_k \right] . \end{aligned}$$
(36)

We can express this by Q-values.

$$\begin{aligned}&\mathbf {\mathrm {E}}_{\pi } \left[ \sum _{k=1}^{T} \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ G_k \right] \nonumber \\&= \ \sum _{k=1}^{T} \mathbf {\mathrm {E}}_{\pi } \left[ \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ G_k \right] \nonumber \\&= \ \sum _{k=1}^{T} \mathbf {\mathrm {E}}_{\tau _{0,k} \sim \pi } \left[ \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ \mathbf {\mathrm {E}}_{\tau _{k+1,T} \sim \pi } \left[ G_k \mid s_k,a_k \right] \right] \nonumber \\&= \ \sum _{k=1}^{T} \mathbf {\mathrm {E}}_{\tau _{0,k} \sim \pi } \left[ \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ q^{\pi }(s_k,a_k) \right] \nonumber \\&= \ \mathbf {\mathrm {E}}_{\tau \sim \pi } \left[ \sum _{k=1}^{T} \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ q^{\pi }(s_k,a_k) \right] . \end{aligned}$$
(37)

We have finally:

$$\begin{aligned}&\nabla _{\theta } \mathbf {\mathrm {E}}_{\pi } \left[ \sum _{t=0}^{T} R_{t+1} \right] \ = \ \mathbf {\mathrm {E}}_{\tau \sim \pi } \left[ \sum _{k=1}^{T} \nabla _{\theta } \log \pi (a_k \mid s_k ; \boldsymbol{\theta }) \ q^{\pi }(s_k,a_k) \right] . \end{aligned}$$
(38)

1.4 A.4 Precise statement of Assumptions

Here we provide a precise formulation of the assumptions from [16]. The formulation we use here is mostly taken from [14]:

  1. (A1)

    Assumptions on the controlled Markov processes: The controlled Markov process \(\boldsymbol{z}\) takes values in a compact metric space S. It is controlled by the iterate sequences \(\boldsymbol{\theta }_n\}\) and \(\boldsymbol{\omega }_n\) and furthermore \(\boldsymbol{z}_n\) by a random process \(\boldsymbol{a}_n\) taking values in a compact metric space W. For B Borel in S the \(\boldsymbol{z}_n\) dynamics for \(n\geqslant 0\) is determined by a transition kernel \(\tilde{p}\):

    $$\begin{aligned}&\mathrm {P}(\boldsymbol{z}_{n+1} \in B |\boldsymbol{z}_l, \boldsymbol{a}_l, \boldsymbol{\theta }_l, \boldsymbol{\omega }_l, l\leqslant n) = \ \int _{B} \tilde{p}(\mathrm {d}\boldsymbol{z}| \boldsymbol{z}_n, \boldsymbol{a}_n, \boldsymbol{\theta }_n, \boldsymbol{\omega }_n). \end{aligned}$$
    (39)
  2. (A2)

    Assumptions on the update functions: \(h : \mathbb {R}^{m+k} \times S^{(1)} \rightarrow \mathbb {R}^m\) is jointly continuous as well as Lipschitz in its first two arguments, and uniformly w.r.t. the third. This means that for all \( \boldsymbol{z}\in S\):

    $$\begin{aligned} \Vert h(\boldsymbol{\theta }, \boldsymbol{\omega }, \boldsymbol{z}) \ - \ h(\boldsymbol{\theta }', \boldsymbol{w}', \boldsymbol{z})\Vert \leqslant \ L^{(1)} \ (\Vert \boldsymbol{\theta }-\boldsymbol{\theta }'\Vert + \Vert \boldsymbol{\omega }- \boldsymbol{\omega }'\Vert ). \end{aligned}$$
    (40)

    Similarly for f, where the Lipschitz constant is \(L^{(2)}\).

  3. (A3)

    Assumptions on the additive noise: For \(i=1,2\), \(\{(\boldsymbol{m}_i)_n\}\) are martingale difference sequences with bounded second moments. More precisely, \((\boldsymbol{m}_i)_n\) are martingale difference sequences w.r.t. increasing \(\sigma \)-fields

    $$\begin{aligned} \mathfrak {F}_n \ = \ \sigma (\boldsymbol{\theta }_l, \boldsymbol{\omega }_l, (\boldsymbol{m}_1)_{l}, (\boldsymbol{m}_2)_{l}, \boldsymbol{z}_l, l \leqslant n) , \end{aligned}$$
    (41)

    satisfying \( \mathrm {E}\left[ \Vert (\boldsymbol{m}_i)_n \Vert ^2 \mid \mathfrak {F}_n \right] \ \leqslant \ B_i \) for \(n \geqslant 0\) and given constants \(B_i\).

  4. (A4)

    Assumptions on the learning rates:

    $$\begin{aligned}&\sum _{n} a(n) \ = \ \infty , \quad \sum _{n} a^2(n) \ < \ \infty , \end{aligned}$$
    (42)
    $$\begin{aligned}&\sum _{n} b(n) \ = \ \infty , \quad \sum _{n} b^2(n) \ < \ \infty , \end{aligned}$$
    (43)

    and \(a(n) \ = \ \mathrm {o}(b(n))\). Furthermore, \(a(n), b(n), n \geqslant 0\) are non-increasing.

  5. (A5)

    Assumptions on the transition kernels: The state-action map

    $$\begin{aligned} S \times W \times \mathbb {R}^{m+k} \ni&(\boldsymbol{z},\boldsymbol{a},\boldsymbol{\theta },\boldsymbol{\omega }) \mapsto \ \tilde{p}(\mathrm {d}\boldsymbol{y}\mid \boldsymbol{z}, \boldsymbol{a}, \boldsymbol{\theta }, \boldsymbol{\omega }) \end{aligned}$$
    (44)

    is continuous (the topology on the spaces of probability measures is induced by weak convergence).

  6. (A6)

    Assumptions on the associated ODEs: We consider occupation measures which intuitively give for the controlled Markov process the probability or density to observe a particular state-action pair from \(S \times W\) for given \(\boldsymbol{\theta }\) and \(\boldsymbol{\omega }\) and a given control. A precise definition of these occupation measures can be found e.g. on page 68 of [7] or page 5 in [16]. We have following assumptions:

    • We assume that there exists only one such ergodic occupation measure for \(\boldsymbol{z}_n\) on \(S \times W\), denoted by \(\varGamma _{\boldsymbol{\theta },\boldsymbol{\omega }}\). A main reason for assuming uniqueness is that it enables us to deal with ODEs instead of differential inclusions. Moreover, set

      $$\begin{aligned} \tilde{f}(\boldsymbol{\theta }, \boldsymbol{\omega }) \ = \ \int f(\boldsymbol{\theta },\boldsymbol{\omega },\boldsymbol{z}) \ \varGamma _{\boldsymbol{\theta },\boldsymbol{\omega }}(\mathrm {d}\boldsymbol{z}, W). \end{aligned}$$
      (45)
    • We assume that for \( \boldsymbol{\theta }\in \mathbb {R}^m\), the ODE \( \dot{\boldsymbol{\omega }}(t) \ = \ \tilde{f}(\boldsymbol{\theta },\boldsymbol{\omega }(t)) \) has a unique asymptotically stable equilibrium \(\boldsymbol{\lambda }(\boldsymbol{\theta })\) with attractor set \(B_{\boldsymbol{\theta }}\) such that \(\boldsymbol{\lambda }: \mathbb {R}^m \rightarrow \mathbb {R}^k\) is a Lipschitz map with global Lipschitz constant.

    • The Lyapunov function \(V(\boldsymbol{\theta },.)\) associated to \(\boldsymbol{\lambda }(\boldsymbol{\theta })\) is continuously differentiable.

    • Next define

      $$\begin{aligned} \tilde{h}(\boldsymbol{\theta }) \ = \ \int h(\boldsymbol{\theta },\boldsymbol{\lambda }(\boldsymbol{\theta }),\boldsymbol{z}) \ \varGamma _{\boldsymbol{\theta },\boldsymbol{\lambda }(\boldsymbol{\theta })}(\mathrm {d}\boldsymbol{z}, W). \end{aligned}$$
      (46)

      We assume that the ODE \( \dot{\boldsymbol{\theta }}(t) \ = \ \tilde{h}(\boldsymbol{\theta }(t)) \) has a global attractor set A.

    • For all \(\boldsymbol{\theta }\), with probability 1, \(\boldsymbol{\omega }_n\) for \(n\geqslant 1\) belongs to a compact subset \(Q_{\boldsymbol{\theta }}\) of \(B_{\boldsymbol{\theta }}\) “eventually”.

    This assumption is an adapted version of (A6)’ of [16], to avoid too many technicalities (e.g. in [16] two controls are used, which we avoid here to not overload notation).

  7. (A7)

    Assumption of bounded iterates: \(\sup _n \Vert \boldsymbol{\theta }_n \Vert \ < \ \infty \) and \(\sup _n \Vert \boldsymbol{\omega }_n \Vert \ < \ \infty \) a.s.

1.5 A.5 Further Details concerning the Sampling Process

Let us formulate the construction of the sampling process in more detail: We introduced the function \(S_{\pi }\) in the main paper as follows:

$$\begin{aligned} S_{\pi }: \varOmega \rightarrow \tilde{\varOmega }_{\pi },\ x \mapsto \mathop {\mathrm {argmax}\,}_{\tau \in \tilde{\varOmega }_{\pi }} \left\{ \sum _{\eta \le \tau } \tilde{P_{\pi }}(\eta ) \le x \right\} . \end{aligned}$$
(47)

Now \(S_{\pi }\) basically divides the interval [0, 1] into finitely many disjoint subintervals, such that the i-th subinterval \(I_i\) maps to the i-th element \(\tau _i \in \tilde{\varOmega }_{\pi }\), and additionally the length of \(I_i\) is given by \(\tilde{P_{\pi }}(\tau _i)\). \(S_{\pi }\) is measurable, because the pre-image of any element of the sigma-algebra \(\tilde{\mathfrak {A}_{\pi }}\) wrt. \(S_{\pi }\) is just a finite union of subintervals of [0, 1], which is clearly contained in the Borel-algebra. Basically \(S_{\pi }\) just describes how to get one sample from a multinomial distribution with (finitely many) probabilities \(\tilde{P_{\pi }}(\tau )\), where \(\tau \in \tilde{\varOmega }_{\pi }\). Compare with inverse transform sampling, e.g. Theorem 2.1.10. in [9] and applications thereof. For the reader’s convenience let us briefly recall this important concept here in a formal way:

Lemma 1 (Inverse transform sampling)

Let X have continuous cumulative distribution \(F_X(x)\) and define the random variable Y as \(Y=F_{X}(X)\). Then Y is uniformly distributed on (0, 1).

1.6 A.6 Further Details for Proof of Theorem 1

Here we provide further technical details needed to ensure the assumptions stated before to prove our main theorem Theorem 1.

Ad (A1): Assumptions on the Controlled Markov Processes: Let us start by discussing more details for controlled processes that appear in the PPO and RUDDER setting. Let us focus on the process related to \((z_1)_n\): Let \(\beta >1\) and let the real sequence \(z_n\) be defined by \((z_1)_1=1\) and \((z_1)_{n+1}=(1-\frac{1}{\beta })(z_1)_{n}+1\). The \(z_n\)’s are nothing more but the partial sums of a geometric series converging to \(\beta \).

The sequence \((z_1)_n\) can also be interpreted as a time-homogeneous Markov process \((\boldsymbol{z}_1)_n\) with transition probabilities given by

$$\begin{aligned} P(z, y)=\delta _{(1-\frac{1}{\beta })z+1}, \end{aligned}$$
(48)

where \(\delta \) denotes the Dirac measure, and with the compact interval \([1,\beta ]\) as its range. We use the standard notation for discrete time Markov processes, described in detail e.g. in [13]. Its unique invariant measure is clearly \(\delta _{\beta }\). So integrating wrt. this invariant measure will in our case just correspond to taking the limit \((z_1)_n \rightarrow \beta \).

Ad (A2): \(\boldsymbol{h}\) and \(\boldsymbol{f}\) are Lipschitz: By the mean value theorem it is enough to show that the derivatives wrt. \(\boldsymbol{\theta }\) and \(\boldsymbol{\omega }\) are bounded uniformly wrt. \(\boldsymbol{z}\). We only show details for f, since for h similar considerations apply. By the explicit formula for \(L_g\), we see that \(f(\boldsymbol{\theta },\boldsymbol{\omega },\boldsymbol{z})\) can be written as:

$$\begin{aligned} \sum _{\begin{array}{c} s_1,..,s_T \\ a_1,...,a_T \end{array}}&\prod _{t=1}^{T} p(s_t \mid s_{t-1},a_{t-1}) \pi (a_t \mid s_t, \boldsymbol{\theta },\boldsymbol{z}) \nabla _{\boldsymbol{\omega }}\varPhi (g(\tau ; \boldsymbol{\omega }, \boldsymbol{z}) ,\tau , \boldsymbol{\theta }, \boldsymbol{\omega }, \boldsymbol{z}) . \end{aligned}$$
(49)

The claim can now be readily deduced from the assumptions (L1)–(L3).

Ad (A3): Martingale Difference Property and Estimates: From the results in the main paper on the probabilistic setting, \((\boldsymbol{m}_1)_{n+1}\) and \((\boldsymbol{m}_2)_{n+1}\) can easily be seen to be martingale difference sequences with respect to their filtrations \(\mathfrak {F}_n\). Indeed, the sigma algebras created by \(\boldsymbol{\omega }_n\) and \(\boldsymbol{\theta }_n\) already describe \(\tilde{\mathfrak {A}}_{\pi _{\boldsymbol{\theta }_n}}\), and thus:

$$\begin{aligned} \mathbf {\mathrm {E}}[(\boldsymbol{m}_i)_{n+1}|\mathfrak {F}_n]=\mathbf {\mathrm {E}}[\hat{f}(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n)|\mathfrak {F}_n]-\mathbf {\mathrm {E}}[f(\boldsymbol{\theta }_n, \boldsymbol{\omega }_n, \boldsymbol{z}_n)]=0. \end{aligned}$$
(50)

It remains to show that

$$\begin{aligned} \mathbf {\mathrm {E}}[||(\boldsymbol{m}_i)_{n+1}||^2 | \mathfrak {F}_n] \le B_i \text { for }i=1,2. \end{aligned}$$
(51)

This, however, is also clear, since all the involved expressions are bounded uniformly again by the assumptions (L1)–(L3) on the losses (e.g. one can observe this by writing down the involved expressions explicitly as indicated in the previous point (A2) ).

Ad (A4): Assumptions on the Learning Rates: These standard assumptions are taken for granted.

Ad (A5):Transition Kernels: The continuity of the transition kernels is clear from Eq. (48) (continuity is wrt. to the weak topology in the space of probability measures. So in our case, this again boils down to using continuity of the test functions).

Ad (A6): Stability Properties of the ODEs:

  • By the explanations for (A1) we mentioned that integrating wrt. the ergodic occupation measure in our case corresponds to taking the limit \(\boldsymbol{z}_n \rightarrow \boldsymbol{z}\) (since our Markov processes can be interpreted as sequences). Thus \(\tilde{f}(\boldsymbol{\theta }, \boldsymbol{\omega })=f(\boldsymbol{\theta },\boldsymbol{\omega },\boldsymbol{z})\). In the sequel we will also use the following abbreviations: \(f(\boldsymbol{\theta },\boldsymbol{\omega })=f(\boldsymbol{\theta },\boldsymbol{\omega },\boldsymbol{z})\), \(h(\boldsymbol{\theta },\boldsymbol{\omega })=h(\boldsymbol{\theta },\boldsymbol{\omega },\boldsymbol{z})\), etc. Now consider the ODE

    $$\begin{aligned} \dot{\boldsymbol{\omega }}(t)=f(\boldsymbol{\theta },\boldsymbol{\omega }(t)), \end{aligned}$$
    (52)

    where \(\boldsymbol{\theta }\) is fixed. Equation (52) can be seen as a gradient system for the function \(L_g\). By standard results on gradient systems (cf. e.g. Sect. 4 in [1] for a nice summary), which guarantee equivalence between strict local minima of the loss function and asymptotically stable points of the associated gradient system, we can use the assumptions (L1)–(L3) and the remarks thereafter from the main paper to ensure that there exists a unique asymptotically stable equilibrium \(\boldsymbol{\lambda }(\boldsymbol{\theta })\) of Eq. (52).

  • The fact that \(\boldsymbol{\lambda }(\boldsymbol{\theta })\) is smooth enough can be deduced by the Implicit Function Theorem as discussed in the main paper.

  • For Eq. (52) \(L_g(\boldsymbol{\theta },\boldsymbol{\omega })-L_g(\boldsymbol{\theta },\boldsymbol{\lambda }(\boldsymbol{\theta }))\) can be taken as associated Lyapunov function \(V_{\boldsymbol{\theta }}(\boldsymbol{\omega })\), and thus \(V_{\boldsymbol{\theta }}(\boldsymbol{\omega })\) clearly is differentiable wrt. \(\boldsymbol{\omega }\) for any \(\boldsymbol{\theta }\).

  • The slow ODE \( \dot{\boldsymbol{\theta }}(t)=h(\boldsymbol{\theta }(t),\boldsymbol{\lambda }(\boldsymbol{\theta }(t)) \) also has a unique asymptotically stable fixed point, which again is guaranteed by our assumptions and the standard results on gradient systems.

Ad (A7): Assumption of Bounded Iterates: This follows from the assumptions on the loss functions.

1.7 A.7 Finite Greediness is Sufficient to Converge to the Optimal Policy

Here we provide details on how the optimal policy can be deduced using only a finite parameter \(\beta >1\). The Q-values for policy \(\pi \) are:

$$\begin{aligned} q^{\pi }(s_t,a_t)&= \ \mathbf {\mathrm {E}}_{\pi } \left[ \sum _{\tau =t}^{T} R_{\tau +1} \mid s_t,a_t \right] \nonumber \\&= \ \sum _{\begin{array}{c} s_t,..,s_T \\ a_t,...,a_T \end{array}} \prod _{\tau =t}^{T-1} p(s_{\tau +1} \mid s_{\tau },a_{\tau }) \ \prod _{\tau =t}^{T} \pi (a_{\tau } \mid s_{\tau }) \ \sum _{\tau =t}^{T} R_{\tau +1}. \end{aligned}$$
(53)

The optimal policy \(\pi ^*\) is known to be deterministic \(\left( \prod _{t=1}^T \pi ^*(a_t\ |\ s_t) \in \{0,1\} \right) \). Let us assume that the optimal policy is also unique. Then we are going to show the following result:

Lemma 2

For \(i_{\max }= \arg \max _{i} q^{\pi ^*}(s,a^i)\) and \(v^{\pi ^*}(s) = \max _{i} q^{\pi ^*}(s,a^i)\). We define

$$\begin{aligned} 0&< \ \epsilon \ < \ \min _{s,i\not =i_{\max }} (v^{\pi ^*}(s) \ - \ q^{\pi ^*}(s,a^i)), \end{aligned}$$
(54)

We assume a function \(\psi (s,a^i)\) that defines the actual policy \(\pi \) via

$$\begin{aligned} \pi (a^i \mid s; \beta )&= \ \frac{\exp (\beta \ \psi (s,a^i) ) }{\sum _j \exp (\beta \ \psi (s,a^j) )}. \end{aligned}$$
(55)

We assume that the function \(\psi \) already identified the optimal actions, which will occur during learning at some time point when the policy is getting more greedy:

$$\begin{aligned} 0&< \ \delta \ < \ \min _{s,i\not =i_{\max }} (\psi (s,a^{i_{\max }}) \ - \psi (s,a^i) ). \end{aligned}$$
(56)

Hence,

$$\begin{aligned} \lim _{\beta \rightarrow \infty } \pi (a^i \mid s; \beta )&= \ \pi ^*(a^i \mid s). \end{aligned}$$
(57)

We assume that

$$\begin{aligned} \beta&> \nonumber \\&\max \left( \frac{\log ({{\left| \mathscr {A} \right| }}-1)}{\delta }, -\log \left( \frac{\epsilon }{2\,T \ (\left| \mathscr {A}\right| - 1) \ |\mathscr {S}|^T \ |\mathscr {A}|^T \ (T+1) \ K_R} \right) / \delta \ \right) . \end{aligned}$$
(58)

Then we can make the statement for all s:

$$\begin{aligned} \forall _{j,j \not =i}: \ q^{\pi }(s,a^i)&> \ q^{\pi }(s,a^j) \ \Rightarrow \ i = i_{\max }, \end{aligned}$$
(59)

therefore the Q-values \(q^{\pi }(s,a^i)\) determine the optimal policy as the action with the largest Q-value can be chosen.

More importantly, \(\beta \) is large enough to allow Q-value based methods and policy gradients converge to the optimal policy if it is the local minimum of the loss functions. For Q-value based methods the optimal action can be determined if the optimal policy is the minimum of the loss functions. For policy gradients the optimal action receives always the largest gradient and the policy converges to the optimal policy.

Proof

We already discussed that the optimal policy \(\pi ^*\) is known to be deterministic \(\left( \prod _{t=1}^T \pi ^*(a_t\ |\ s_t) \in \{0,1\} \right) \). Let us assume that the optimal policy is also unique. Since

$$\begin{aligned} \pi (a^i \mid s; \beta )&= \ \frac{\exp (\beta \ (\psi (s,a^i) \ - \ \psi (s,a^{i_{\max }})) ) }{\sum _j \exp (\beta \ \psi (s,a^j) \ - \ \psi (s,a^{i_{\max }}))}, \end{aligned}$$
(60)

we have

$$\begin{aligned} \pi (a^{i_{\max }} \mid s; \beta )&= \ \frac{1}{1 \ + \ \sum _{j,j\not =i_{\max }} \exp (\beta \ \psi (s,a^j) \ - \ \psi (s,a^{i_{\max }}))} \nonumber \\&> \frac{1}{1 \ + \ (|\mathscr {A}|-1) \ \exp (- \ \beta \ \delta ) } \nonumber \\&= \ 1 \ - \ \frac{(|\mathscr {A}|-1) \ \exp (- \ \beta \ \delta )}{1 \ + \ (|\mathscr {A}|-1) \ \exp (- \beta \ \delta ) } \nonumber \\&> 1 \ - \ (|\mathscr {A}|-1) \ \exp (- \ \beta \ \delta ) \end{aligned}$$
(61)

and for \(i \not = i_{\max }\)

$$\begin{aligned} \pi (a^i \mid s; \beta )&= \ \frac{\exp (\beta \ (\psi (s,a^i) \ - \ \psi (s,a^{i_{\max }})) ) }{1 \ + \ \sum _{j,j\not =i_{\max }} \exp (\beta \ \psi (s,a^j) \ - \ \psi (s,a^{i_{\max }}))} \nonumber \\&< \ \exp (- \ \beta \ \delta ). \end{aligned}$$
(62)

For \(\prod _{t=1}^{T} \pi ^*(a_t \mid s_t) = 1\), we have

$$\begin{aligned} \prod _{t=1}^{T} \pi (a_t \mid s_t)&> \ (1 \ - \ (|\mathscr {A}|-1) \ \exp (- \ \beta \ \delta ))^T \nonumber \\&> \ 1 - \ T \ (|\mathscr {A}|-1) \ \exp (- \ \beta \ \delta ), \end{aligned}$$
(63)

where in the last step we used that \((|\mathscr {A}|-1) \exp (- \beta \delta )<1\) by definition of \(\beta \) in (58) so that an application of Bernoulli’s inequality is justified. For \(\prod _{t=1}^{T} \pi ^*(a_t \mid s_t) = 0\), we have

$$\begin{aligned} \prod _{t=1}^{T} \pi (a_t \mid s_t)&< \ \exp (- \ \beta \ \delta ). \end{aligned}$$
(64)

Therefore

$$\begin{aligned} {{\left| \prod _{t=1}^{T} \pi ^*(a_t \mid s_t) \ - \ \prod _{t=1}^{T} \pi (a_t \mid s_t) \right| }}&< \ T \ (|\mathscr {A}|-1) \ \exp (- \ \beta \ \delta ). \end{aligned}$$
(65)

Using Eq. (65) and the definition of \(\beta \) in Eq. (58) we get:

$$\begin{aligned}&{{\left| q^{\pi ^*}(s,a^i) \ - \ q^{\pi }(s,a^i) \right| }} \nonumber \\&= \ \left| \sum _{\begin{array}{c} s_1,..,s_T \\ a_1,...,a_T \end{array}} \ \prod _{t=1}^{T} p(s_t \mid s_{t-1},a_{t-1}) \ \left( \prod _{t=1}^{T} \pi ^*(a_t \mid s_t) \ - \ \prod _{t=1}^{T} \pi (a_t \mid s_t) \right) \ \sum _{t=0}^{T} R_{t+1} \right| \nonumber \\&< \ \sum _{\begin{array}{c} s_1,..,s_T \\ a_1,...,a_T \end{array}} \ \prod _{t=1}^{T} p(s_t \mid s_{t-1},a_{t-1}) \ \left| \prod _{t=1}^{T} \pi ^*(a_t \mid s_t) \ - \ \prod _{t=1}^{T} \pi (a_t \mid s_t) \right| \ (T+1) \ K_R\nonumber \\&< \ \sum _{\begin{array}{c} s_1,..,s_T \\ a_1,...,a_T \end{array}} \ \left| \prod _{t=1}^{T} \pi ^*(a_t \mid s_t) \ - \ \prod _{t=1}^{T} \pi (a_t \mid s_t) \right| \ (T+1) \ K_R\nonumber \\&< \ |\mathscr {S}|^T \ |\mathscr {A}|^T \ \frac{\epsilon }{ 2|\mathscr {S}|^T \ |\mathscr {A}|^T \ (T+1) \ K_R} \ (T+1) \ K_R \ = \ \epsilon / 2. \end{aligned}$$
(66)

Now from the condition that \(q^{\pi }(s,a^i) \ > \ q^{\pi }(s,a^j) \) for all \(j \ne i\) we can conclude that

$$\begin{aligned} \begin{array}{c} q^{\pi ^*}(s,a^j) - q^{\pi ^*}(s,a^i) \\< (q^{\pi }(s,a^j) + \epsilon / 2) - (q^{\pi }(s,a^i) - \epsilon / 2) < \epsilon \end{array} \end{aligned}$$
(67)

for all \(j \ne i\). Thus for \(j \not =i\) it follows that \(j \not =i_{\max }\) and consequently \(i=i_{\max }\).    \(\square \)

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer-Verlag GmbH Germany, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Holzleitner, M., Gruber, L., Arjona-Medina, J., Brandstetter, J., Hochreiter, S. (2021). Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER. In: Hameurlain, A., Tjoa, A.M. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XLVIII. Lecture Notes in Computer Science(), vol 12670. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-63519-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-63519-3_5

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-63518-6

  • Online ISBN: 978-3-662-63519-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics