Abstract
We prove under commonly used assumptions the convergence of actor-critic reinforcement learning algorithms, which simultaneously learn a policy function, the actor, and a value function, the critic. Both functions can be deep neural networks of arbitrary complexity. Our framework allows showing convergence of the well known Proximal Policy Optimization (PPO) and of the recently introduced RUDDER. For the convergence proof we employ recently introduced techniques from the two time-scale stochastic approximation theory.
Previous convergence proofs assume linear function approximation, cannot treat episodic examples, or do not consider that policies become greedy. The latter is relevant since optimal policies are typically deterministic. Our results are valid for actor-critic methods that use episodic samples and that have a policy that becomes more greedy during learning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Absil, P.A., Kurdyka, K.: On the stable equilibrium points of gradient systems. Syst. Control Lett. 55(7), 573–577 (2006)
Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: RUDDER: Return decomposition for delayed rewards (2018). ArXiv https://arxiv.org/abs/1806.07857
Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: RUDDER: return decomposition for delayed rewards. In: Advances in Neural Information Processing Systems, vol. 33 (2019). ArXiv https://arxiv.org/abs/1806.07857
Bakker, B.: Reinforcement learning by backpropagation through an LSTM model/critic. In: IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp. 127–134 (2007). https://doi.org/10.1109/ADPRL.2007.368179
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)
Bhatnagar, S., Prasad, H.L., Prashanth, L.A.: Stochastic Recursive Algorithms for Optimization. Lecture Notes in Control and Information Sciences, 1st edn., p. 302. Springer, London (2013). https://doi.org/10.1007/978-1-4471-4285-0
Stochastic Approximation. TRM, vol. 48. Hindustan Book Agency, Gurgaon (2008). https://doi.org/10.1007/978-93-86279-38-5
Borkar, V.S., Meyn, S.P.: The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2), 447–469 (2000). https://doi.org/10.1137/S0363012997331639
Casella, G., Berger, R.L.: Statistical Inference. Wadsworth and Brooks/Cole, Stanley (2002)
Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B., LeCun, Y.: The loss surfaces of multilayer networks. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pp. 192–204 (2015)
Dayan, P.: The convergence of TD(\(\lambda \)) for general \(\lambda \). Mach. Learn. 8, 341 (1992)
Fan, J., Wang, Z., Xie, Y., Yang, Z.: A theoretical analysis of deep \(q\)-learning. CoRR abs/1901.00137 (2020)
Hairer, M.: Ergodic properties of Markov processes. In: Lecture Notes (2018)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a Nash equilibrium. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. pp. 6626–6637. Curran Associates, Inc. (2017). Preprint arXiv:1706.08500
Jin, C., Netrapalli, P., Jordan, M.I.: Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv:1902.00618 (2019)
Karmakar, P., Bhatnagar, S.: Two time-scale stochastic approximation with controlled Markov noise and off-policy temporal-difference learning. Math. Oper. Res. (2017). https://doi.org/10.1287/moor.2017.0855
Kawaguchi, K.: Deep learning without poor local minima. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29. pp. 586–594 (2016)
Kawaguchi, K., Bengio, Y.: Depth with nonlinearity creates no bad local minima in ResNets. Neural Netw. 118, 167–174 (2019)
Kawaguchi, K., Huang, J., Kaelbling, L.P.: Effect of depth and width on local minima in deep learning. Neural Comput. 31(6), 1462–1498 (2019)
Kawaguchi, K., Kaelbling, L.P., Bengio, Y.: Generalization in deep learning. arXiv:1710.05468 (2017)
Konda, V.R., Borkar, V.S.: Actor-critic-type learning algorithms for Markov decision processes. SIAM J. Control Optim. 38(1), 94–123 (1999). https://doi.org/10.1137/S036301299731669X
Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008–1014 (2000)
Konda, V.R., Tsitsiklis, J.N.: On actor-critic algorithms. SIAM J. Control Optim. 42(4), 1143–1166 (2003). https://doi.org/10.1137/S0363012901385691
Kushner, H.J., Clark, D.S.: Stochastic Approximation Methods for Constrained and Unconstrained Systems. Applied Mathematical Sciences. Springer, New York (1978). https://doi.org/10.1007/978-1-4684-9352-8
Kushner, H.J., Yin, G.G.: Stochastic Approximation and Recursive Algorithms and Applications. Stochastic Modelling and Applied Probability. Springer, New York (2003). https://doi.org/10.1007/b97441
Lin, T., Jin, C., Jordan, M.I.: On gradient descent ascent for nonconvex-concave minimax problems. arXiv:1906.00331 (2019)
Liu, B., Cai, Q., Yang, Z., Wang, Z.: Neural proximal/trust region policy optimization attains globally optimal policy. In: Advances in Neural Information Processing Systems, vol. 33. arXiv:1906.10306 (2019)
Maei, H.R., Szepesvári, C., Bhatnagar, S., Precup, D., Silver, D., Sutton, R.S.: Convergent temporal-difference learning with arbitrary smooth function approximation. In: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 22. pp. 1204–1212. Curran Associates, Inc. (2009)
Mazumdar, E.V., Jordan, M.I., Sastry, S.S.: On finding local Nash equilibria (and only local Nash equilibria) in zero-sum games. arXiv:1901.00838 (2019)
Metrikopoulos, P., Hallak, N., Kavis, A., Cevher, V.: On the almost sure convergence of stochastic gradient descent in non-convex problems. In: Advances in Neural Information Processing Systems, vol. 34 (2020). arXiv:2006.11144
Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv:1312.5602 (2013)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015). https://doi.org/10.1038/nature14236
Munro, P.W.: A dual back-propagation scheme for scalar reinforcement learning. In: Proceedings of the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, pp. 165–176 (1987)
Open, A.I., et al.: Dota 2 with large scale deep reinforcement learning. arXiv:1912.06680 (2019)
Patil, V.P., et al.: Align-RUDDER: learning from few demonstrations by reward redistribution. arXiv:2009.14108 (2020)
Puterman, M.L.: Markov Decision Processes, 2nd edn. Wiley, Hoboken (2005)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951). https://doi.org/10.1214/aoms/1177729586
Robinson, A.J.: Dynamic error propagation networks. Ph.D. thesis, Trinity Hall and Cambridge University Engineering Department (1989)
Robinson, T., Fallside, F.: Dynamic reinforcement driven error propagation networks with application to game playing. In: Proceedings of the 11th Conference of the Cognitive Science Society, Ann Arbor, pp. 836–843 (1989)
Schulman, J., Levine, S., Moritz, P., Jordan, M.I., Abbeel, P.: Trust region policy optimization. arXiv:1502.05477 (2015). 31st International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 37
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:1707.06347 (2018)
Singh, S., Jaakkola, T., Littman, M., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38, 287–308 (2000). https://doi.org/10.1023/A:1007678930559
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge (2018)
Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057–1063 (2000)
Tsitsiklis, J.N.: Asynchronous stochastic approximation and \(q\)-learning. Mach. Learn. 16(3), 185–202 (1994). https://doi.org/10.1023/A:1022689125041
Vinyals, O., et al.: Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019). https://doi.org/10.1038/s41586-019-1724-z
Watkins, C.J.C.H., Dayan, P.: Q-learning. Mach. Learn. 8, 279–292 (1992)
Xu, T., Zou, S., Liang, Y.: Two time-scale off-policy TD learning: non-asymptotic analysis over Markovian samples. Adv. Neural Inf. Process. Syst. 32, 10633–10643 (2019)
Yang, Z., Chen, Y., Hong, M., Wang, Z.: Provably global convergence of actor-critic: a case for linear quadratic regulator with ergodic cost. Adv. Neural Inf. Process. Syst. 32, 8351–8363 (2019)
Acknowledgments
The ELLIS Unit Linz, the LIT AI Lab, the Institute for Machine Learning, are supported by the Federal State Upper Austria. IARAI is supported by Here Technologies. We thank the projects AI-MOTION (LIT-2018-6-YOU-212), DeepToxGen (LIT-2017-3-YOU-003), AI-SNN (LIT-2018-6-YOU-214), DeepFlood (LIT-2019-8-YOU-213), Medical Cognitive Computing Center (MC3), PRIMAL (FFG873979), S3AI (FFG-872172), DL for granular flow (FFG-871302), ELISE (H2020-ICT-2019-3 ID: 951847), AIDD (MSCA-ITN-2020 ID: 956832). We thank Janssen Pharmaceutica, UCB Biopharma SRL, Merck Healthcare KGaA, Audi.JKU Deep Learning Center, TGW LOGISTICS GROUP GMBH, Silicon Austria Labs (SAL), FILL Gesellschaft mbH, Anyline GmbH, Google Brain, ZF Friedrichshafen AG, Robert Bosch GmbH, Software Competence Center Hagenberg GmbH, TÜV Austria, and the NVIDIA Corporation.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
A Appendix
A Appendix
This appendix is meant to provide the reader with details and more precise descriptions of several parts of the main text, including e.g. exact formulations of the algorithms and more technical proof steps. Sections A.1 and A.2 provide the full formulation of the PPO and RUDDER algorithm, respectively, for which we ensure convergence. Section A.3 describes how the causality assumption leads to the formulas for PPO. In Sect. A.4 we discuss the precise formulations of the assumptions from [16]. Section A.5 gives further details about the probabilistic setup that we use to formalize the sampling process while Sect. A.6 gives formal details on how to ensure the assumptions from [16] to obtain our main convergence result Theorem 1. The last Sect. A.7 discusses arguments how to deduce the optimal policy from the approximate ones.
1.1 A.1 Further Details on PPO
Here we describe the minimization problem for the PPO setup in a more detailed way by including the exact expression for the gradients of the respective loss functions:
1.2 A.2 Further details on RUDDER
In a similar vein we present the minimization problem of RUDDER in more detail:
1.3 A.3 Causality and Reward-To-Go
This section is meant to provide the reader with more details concerning the causality assumption that leads to the formula for h in Eq. (15) for PPO. We can derive a formulation of the policy gradient with reward-to-go. For ease of notation, instead of using \(\tilde{P_{\pi }}(\tau )\) as in previous sections, we here denote the probability of state-action sequence \(\tau =\tau _{0,T}=(s_0,a_0,s_1,a_1,\ldots ,s_T,a_T)\) with policy \(\pi \) as
The probability of state-action sequence \(\tau _{0,t}=(s_0,s_0,s_1,a_1,\ldots ,s_t,a_t)\) with policy \(\pi \) is
The probability of state-action sequence \(\tau _{t+1,T}=(s_{t+1},a_{t+1},\ldots ,s_T,a_T)\) with policy \(\pi \) given \(( s_t,a_t)\) is
The expectation of \(\sum _{t=0}^{T} R_{t+1}\) is
With \(R_{t+1} \sim p(r_{t+1} \mid s_t,a_t)\), the random variable \(R_{t+1}\) depends only on \((s_t,a_t)\). We define the expected reward \(\mathbf {\mathrm {E}}_{r_{t+1}} \left[ R_{t+1} \mid s_t,a_t\right] \) as a function \(r(s_t,a_t)\) of \((s_t,a_t)\):
Causality. We assume that the reward \(R_{t+1}=R(s_t,a_t) \sim p(r_{t+1} \mid s_t,a_t)\) only depends on the past but not on the future. The state-action pair \((s_t,a_t)\) is determined by the past and not by the future. Relevant is only how likely we observe \((s_t,a_t)\) and not what we do afterwards.
Causality is derived from the Markov property of the MDP and means:
That is
Policy Gradient Theorem. We now assume that the policy \(\pi \) is parametrized by \(\boldsymbol{\theta }\), that is, \(\pi (a_t \mid s_t) = \pi (a_t \mid s_t ; \boldsymbol{\theta })\). We need the gradient with respect to \(\boldsymbol{\theta }\) of \(\prod _{t=a}^{b} \pi (a_t \mid s_t)\):
It follows that
We only have to consider the reward to go. Since \(a_0\) does not depend on \(\pi \), we have \(\nabla _{\theta } \mathbf {\mathrm {E}}_{\pi } \left[ R_1 \right] =0\). Therefore
We can express this by Q-values.
We have finally:
1.4 A.4 Precise statement of Assumptions
Here we provide a precise formulation of the assumptions from [16]. The formulation we use here is mostly taken from [14]:
-
(A1)
Assumptions on the controlled Markov processes: The controlled Markov process \(\boldsymbol{z}\) takes values in a compact metric space S. It is controlled by the iterate sequences \(\boldsymbol{\theta }_n\}\) and \(\boldsymbol{\omega }_n\) and furthermore \(\boldsymbol{z}_n\) by a random process \(\boldsymbol{a}_n\) taking values in a compact metric space W. For B Borel in S the \(\boldsymbol{z}_n\) dynamics for \(n\geqslant 0\) is determined by a transition kernel \(\tilde{p}\):
$$\begin{aligned}&\mathrm {P}(\boldsymbol{z}_{n+1} \in B |\boldsymbol{z}_l, \boldsymbol{a}_l, \boldsymbol{\theta }_l, \boldsymbol{\omega }_l, l\leqslant n) = \ \int _{B} \tilde{p}(\mathrm {d}\boldsymbol{z}| \boldsymbol{z}_n, \boldsymbol{a}_n, \boldsymbol{\theta }_n, \boldsymbol{\omega }_n). \end{aligned}$$(39) -
(A2)
Assumptions on the update functions: \(h : \mathbb {R}^{m+k} \times S^{(1)} \rightarrow \mathbb {R}^m\) is jointly continuous as well as Lipschitz in its first two arguments, and uniformly w.r.t. the third. This means that for all \( \boldsymbol{z}\in S\):
$$\begin{aligned} \Vert h(\boldsymbol{\theta }, \boldsymbol{\omega }, \boldsymbol{z}) \ - \ h(\boldsymbol{\theta }', \boldsymbol{w}', \boldsymbol{z})\Vert \leqslant \ L^{(1)} \ (\Vert \boldsymbol{\theta }-\boldsymbol{\theta }'\Vert + \Vert \boldsymbol{\omega }- \boldsymbol{\omega }'\Vert ). \end{aligned}$$(40)Similarly for f, where the Lipschitz constant is \(L^{(2)}\).
-
(A3)
Assumptions on the additive noise: For \(i=1,2\), \(\{(\boldsymbol{m}_i)_n\}\) are martingale difference sequences with bounded second moments. More precisely, \((\boldsymbol{m}_i)_n\) are martingale difference sequences w.r.t. increasing \(\sigma \)-fields
$$\begin{aligned} \mathfrak {F}_n \ = \ \sigma (\boldsymbol{\theta }_l, \boldsymbol{\omega }_l, (\boldsymbol{m}_1)_{l}, (\boldsymbol{m}_2)_{l}, \boldsymbol{z}_l, l \leqslant n) , \end{aligned}$$(41)satisfying \( \mathrm {E}\left[ \Vert (\boldsymbol{m}_i)_n \Vert ^2 \mid \mathfrak {F}_n \right] \ \leqslant \ B_i \) for \(n \geqslant 0\) and given constants \(B_i\).
-
(A4)
Assumptions on the learning rates:
$$\begin{aligned}&\sum _{n} a(n) \ = \ \infty , \quad \sum _{n} a^2(n) \ < \ \infty , \end{aligned}$$(42)$$\begin{aligned}&\sum _{n} b(n) \ = \ \infty , \quad \sum _{n} b^2(n) \ < \ \infty , \end{aligned}$$(43)and \(a(n) \ = \ \mathrm {o}(b(n))\). Furthermore, \(a(n), b(n), n \geqslant 0\) are non-increasing.
-
(A5)
Assumptions on the transition kernels: The state-action map
$$\begin{aligned} S \times W \times \mathbb {R}^{m+k} \ni&(\boldsymbol{z},\boldsymbol{a},\boldsymbol{\theta },\boldsymbol{\omega }) \mapsto \ \tilde{p}(\mathrm {d}\boldsymbol{y}\mid \boldsymbol{z}, \boldsymbol{a}, \boldsymbol{\theta }, \boldsymbol{\omega }) \end{aligned}$$(44)is continuous (the topology on the spaces of probability measures is induced by weak convergence).
-
(A6)
Assumptions on the associated ODEs: We consider occupation measures which intuitively give for the controlled Markov process the probability or density to observe a particular state-action pair from \(S \times W\) for given \(\boldsymbol{\theta }\) and \(\boldsymbol{\omega }\) and a given control. A precise definition of these occupation measures can be found e.g. on page 68 of [7] or page 5 in [16]. We have following assumptions:
-
We assume that there exists only one such ergodic occupation measure for \(\boldsymbol{z}_n\) on \(S \times W\), denoted by \(\varGamma _{\boldsymbol{\theta },\boldsymbol{\omega }}\). A main reason for assuming uniqueness is that it enables us to deal with ODEs instead of differential inclusions. Moreover, set
$$\begin{aligned} \tilde{f}(\boldsymbol{\theta }, \boldsymbol{\omega }) \ = \ \int f(\boldsymbol{\theta },\boldsymbol{\omega },\boldsymbol{z}) \ \varGamma _{\boldsymbol{\theta },\boldsymbol{\omega }}(\mathrm {d}\boldsymbol{z}, W). \end{aligned}$$(45) -
We assume that for \( \boldsymbol{\theta }\in \mathbb {R}^m\), the ODE \( \dot{\boldsymbol{\omega }}(t) \ = \ \tilde{f}(\boldsymbol{\theta },\boldsymbol{\omega }(t)) \) has a unique asymptotically stable equilibrium \(\boldsymbol{\lambda }(\boldsymbol{\theta })\) with attractor set \(B_{\boldsymbol{\theta }}\) such that \(\boldsymbol{\lambda }: \mathbb {R}^m \rightarrow \mathbb {R}^k\) is a Lipschitz map with global Lipschitz constant.
-
The Lyapunov function \(V(\boldsymbol{\theta },.)\) associated to \(\boldsymbol{\lambda }(\boldsymbol{\theta })\) is continuously differentiable.
-
Next define
$$\begin{aligned} \tilde{h}(\boldsymbol{\theta }) \ = \ \int h(\boldsymbol{\theta },\boldsymbol{\lambda }(\boldsymbol{\theta }),\boldsymbol{z}) \ \varGamma _{\boldsymbol{\theta },\boldsymbol{\lambda }(\boldsymbol{\theta })}(\mathrm {d}\boldsymbol{z}, W). \end{aligned}$$(46)We assume that the ODE \( \dot{\boldsymbol{\theta }}(t) \ = \ \tilde{h}(\boldsymbol{\theta }(t)) \) has a global attractor set A.
-
For all \(\boldsymbol{\theta }\), with probability 1, \(\boldsymbol{\omega }_n\) for \(n\geqslant 1\) belongs to a compact subset \(Q_{\boldsymbol{\theta }}\) of \(B_{\boldsymbol{\theta }}\) “eventually”.
This assumption is an adapted version of (A6)’ of [16], to avoid too many technicalities (e.g. in [16] two controls are used, which we avoid here to not overload notation).
-
-
(A7)
Assumption of bounded iterates: \(\sup _n \Vert \boldsymbol{\theta }_n \Vert \ < \ \infty \) and \(\sup _n \Vert \boldsymbol{\omega }_n \Vert \ < \ \infty \) a.s.
1.5 A.5 Further Details concerning the Sampling Process
Let us formulate the construction of the sampling process in more detail: We introduced the function \(S_{\pi }\) in the main paper as follows:
Now \(S_{\pi }\) basically divides the interval [0, 1] into finitely many disjoint subintervals, such that the i-th subinterval \(I_i\) maps to the i-th element \(\tau _i \in \tilde{\varOmega }_{\pi }\), and additionally the length of \(I_i\) is given by \(\tilde{P_{\pi }}(\tau _i)\). \(S_{\pi }\) is measurable, because the pre-image of any element of the sigma-algebra \(\tilde{\mathfrak {A}_{\pi }}\) wrt. \(S_{\pi }\) is just a finite union of subintervals of [0, 1], which is clearly contained in the Borel-algebra. Basically \(S_{\pi }\) just describes how to get one sample from a multinomial distribution with (finitely many) probabilities \(\tilde{P_{\pi }}(\tau )\), where \(\tau \in \tilde{\varOmega }_{\pi }\). Compare with inverse transform sampling, e.g. Theorem 2.1.10. in [9] and applications thereof. For the reader’s convenience let us briefly recall this important concept here in a formal way:
Lemma 1 (Inverse transform sampling)
Let X have continuous cumulative distribution \(F_X(x)\) and define the random variable Y as \(Y=F_{X}(X)\). Then Y is uniformly distributed on (0, 1).
1.6 A.6 Further Details for Proof of Theorem 1
Here we provide further technical details needed to ensure the assumptions stated before to prove our main theorem Theorem 1.
Ad (A1): Assumptions on the Controlled Markov Processes: Let us start by discussing more details for controlled processes that appear in the PPO and RUDDER setting. Let us focus on the process related to \((z_1)_n\): Let \(\beta >1\) and let the real sequence \(z_n\) be defined by \((z_1)_1=1\) and \((z_1)_{n+1}=(1-\frac{1}{\beta })(z_1)_{n}+1\). The \(z_n\)’s are nothing more but the partial sums of a geometric series converging to \(\beta \).
The sequence \((z_1)_n\) can also be interpreted as a time-homogeneous Markov process \((\boldsymbol{z}_1)_n\) with transition probabilities given by
where \(\delta \) denotes the Dirac measure, and with the compact interval \([1,\beta ]\) as its range. We use the standard notation for discrete time Markov processes, described in detail e.g. in [13]. Its unique invariant measure is clearly \(\delta _{\beta }\). So integrating wrt. this invariant measure will in our case just correspond to taking the limit \((z_1)_n \rightarrow \beta \).
Ad (A2): \(\boldsymbol{h}\) and \(\boldsymbol{f}\) are Lipschitz: By the mean value theorem it is enough to show that the derivatives wrt. \(\boldsymbol{\theta }\) and \(\boldsymbol{\omega }\) are bounded uniformly wrt. \(\boldsymbol{z}\). We only show details for f, since for h similar considerations apply. By the explicit formula for \(L_g\), we see that \(f(\boldsymbol{\theta },\boldsymbol{\omega },\boldsymbol{z})\) can be written as:
The claim can now be readily deduced from the assumptions (L1)–(L3).
Ad (A3): Martingale Difference Property and Estimates: From the results in the main paper on the probabilistic setting, \((\boldsymbol{m}_1)_{n+1}\) and \((\boldsymbol{m}_2)_{n+1}\) can easily be seen to be martingale difference sequences with respect to their filtrations \(\mathfrak {F}_n\). Indeed, the sigma algebras created by \(\boldsymbol{\omega }_n\) and \(\boldsymbol{\theta }_n\) already describe \(\tilde{\mathfrak {A}}_{\pi _{\boldsymbol{\theta }_n}}\), and thus:
It remains to show that
This, however, is also clear, since all the involved expressions are bounded uniformly again by the assumptions (L1)–(L3) on the losses (e.g. one can observe this by writing down the involved expressions explicitly as indicated in the previous point (A2) ).
Ad (A4): Assumptions on the Learning Rates: These standard assumptions are taken for granted.
Ad (A5):Transition Kernels: The continuity of the transition kernels is clear from Eq. (48) (continuity is wrt. to the weak topology in the space of probability measures. So in our case, this again boils down to using continuity of the test functions).
Ad (A6): Stability Properties of the ODEs:
-
By the explanations for (A1) we mentioned that integrating wrt. the ergodic occupation measure in our case corresponds to taking the limit \(\boldsymbol{z}_n \rightarrow \boldsymbol{z}\) (since our Markov processes can be interpreted as sequences). Thus \(\tilde{f}(\boldsymbol{\theta }, \boldsymbol{\omega })=f(\boldsymbol{\theta },\boldsymbol{\omega },\boldsymbol{z})\). In the sequel we will also use the following abbreviations: \(f(\boldsymbol{\theta },\boldsymbol{\omega })=f(\boldsymbol{\theta },\boldsymbol{\omega },\boldsymbol{z})\), \(h(\boldsymbol{\theta },\boldsymbol{\omega })=h(\boldsymbol{\theta },\boldsymbol{\omega },\boldsymbol{z})\), etc. Now consider the ODE
$$\begin{aligned} \dot{\boldsymbol{\omega }}(t)=f(\boldsymbol{\theta },\boldsymbol{\omega }(t)), \end{aligned}$$(52)where \(\boldsymbol{\theta }\) is fixed. Equation (52) can be seen as a gradient system for the function \(L_g\). By standard results on gradient systems (cf. e.g. Sect. 4 in [1] for a nice summary), which guarantee equivalence between strict local minima of the loss function and asymptotically stable points of the associated gradient system, we can use the assumptions (L1)–(L3) and the remarks thereafter from the main paper to ensure that there exists a unique asymptotically stable equilibrium \(\boldsymbol{\lambda }(\boldsymbol{\theta })\) of Eq. (52).
-
The fact that \(\boldsymbol{\lambda }(\boldsymbol{\theta })\) is smooth enough can be deduced by the Implicit Function Theorem as discussed in the main paper.
-
For Eq. (52) \(L_g(\boldsymbol{\theta },\boldsymbol{\omega })-L_g(\boldsymbol{\theta },\boldsymbol{\lambda }(\boldsymbol{\theta }))\) can be taken as associated Lyapunov function \(V_{\boldsymbol{\theta }}(\boldsymbol{\omega })\), and thus \(V_{\boldsymbol{\theta }}(\boldsymbol{\omega })\) clearly is differentiable wrt. \(\boldsymbol{\omega }\) for any \(\boldsymbol{\theta }\).
-
The slow ODE \( \dot{\boldsymbol{\theta }}(t)=h(\boldsymbol{\theta }(t),\boldsymbol{\lambda }(\boldsymbol{\theta }(t)) \) also has a unique asymptotically stable fixed point, which again is guaranteed by our assumptions and the standard results on gradient systems.
Ad (A7): Assumption of Bounded Iterates: This follows from the assumptions on the loss functions.
1.7 A.7 Finite Greediness is Sufficient to Converge to the Optimal Policy
Here we provide details on how the optimal policy can be deduced using only a finite parameter \(\beta >1\). The Q-values for policy \(\pi \) are:
The optimal policy \(\pi ^*\) is known to be deterministic \(\left( \prod _{t=1}^T \pi ^*(a_t\ |\ s_t) \in \{0,1\} \right) \). Let us assume that the optimal policy is also unique. Then we are going to show the following result:
Lemma 2
For \(i_{\max }= \arg \max _{i} q^{\pi ^*}(s,a^i)\) and \(v^{\pi ^*}(s) = \max _{i} q^{\pi ^*}(s,a^i)\). We define
We assume a function \(\psi (s,a^i)\) that defines the actual policy \(\pi \) via
We assume that the function \(\psi \) already identified the optimal actions, which will occur during learning at some time point when the policy is getting more greedy:
Hence,
We assume that
Then we can make the statement for all s:
therefore the Q-values \(q^{\pi }(s,a^i)\) determine the optimal policy as the action with the largest Q-value can be chosen.
More importantly, \(\beta \) is large enough to allow Q-value based methods and policy gradients converge to the optimal policy if it is the local minimum of the loss functions. For Q-value based methods the optimal action can be determined if the optimal policy is the minimum of the loss functions. For policy gradients the optimal action receives always the largest gradient and the policy converges to the optimal policy.
Proof
We already discussed that the optimal policy \(\pi ^*\) is known to be deterministic \(\left( \prod _{t=1}^T \pi ^*(a_t\ |\ s_t) \in \{0,1\} \right) \). Let us assume that the optimal policy is also unique. Since
we have
and for \(i \not = i_{\max }\)
For \(\prod _{t=1}^{T} \pi ^*(a_t \mid s_t) = 1\), we have
where in the last step we used that \((|\mathscr {A}|-1) \exp (- \beta \delta )<1\) by definition of \(\beta \) in (58) so that an application of Bernoulli’s inequality is justified. For \(\prod _{t=1}^{T} \pi ^*(a_t \mid s_t) = 0\), we have
Therefore
Using Eq. (65) and the definition of \(\beta \) in Eq. (58) we get:
Now from the condition that \(q^{\pi }(s,a^i) \ > \ q^{\pi }(s,a^j) \) for all \(j \ne i\) we can conclude that
for all \(j \ne i\). Thus for \(j \not =i\) it follows that \(j \not =i_{\max }\) and consequently \(i=i_{\max }\). \(\square \)
Rights and permissions
Copyright information
© 2021 Springer-Verlag GmbH Germany, part of Springer Nature
About this chapter
Cite this chapter
Holzleitner, M., Gruber, L., Arjona-Medina, J., Brandstetter, J., Hochreiter, S. (2021). Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER. In: Hameurlain, A., Tjoa, A.M. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XLVIII. Lecture Notes in Computer Science(), vol 12670. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-63519-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-662-63519-3_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-63518-6
Online ISBN: 978-3-662-63519-3
eBook Packages: Computer ScienceComputer Science (R0)